cs336_spring2025_assignment1_basics
cs336_spring2025_assignment1_basics
Version 1.0.4
CS336 Staff
Spring 2025
1 Assignment Overview
In this assignment, you will build all the components needed to train a standard Transformer language model
(LM) from scratch and train some models.
What you can use We expect you to build these components from scratch. In particular, you may not
use any definitions from torch.nn, torch.nn.functional, or torch.optim except for the following:
• torch.nn.Parameter
You may use any other PyTorch definitions. If you would like to use a function or class and are not
sure whether it is permitted, feel free to ask on Slack. When in doubt, consider if using it compromises the
“from-scratch” ethos of the assignment.
1 See PyTorch.org/docs/stable/nn.html#containers for a full list.
1
Statement on AI tools Prompting LLMs such as ChatGPT is permitted for low-level programming
questions or high-level conceptual questions about language models, but using it directly to solve the problem
is prohibited.
We strongly encourage you to disable AI autocomplete (e.g., Cursor Tab, GitHub CoPilot) in your IDE
when completing assignments (though non-AI autocomplete, e.g., autocompleting function names is totally
fine). We have found that AI autocomplete makes it much harder to engage deeply with the content.
What the code looks like All the assignment code as well as this writeup are available on GitHub at:
github.com/stanford-cs336/assignment1-basics
Please git clone the repository. If there are any updates, we will notify you so you can git pull to get
the latest.
1. cs336_basics/*: This is where you write your code. Note that there’s no code in here—you can do
whatever you want from scratch!
2. adapters.py: There is a set of functionality that your code must have. For each piece of
functionality (e.g., scaled dot product attention), fill out its implementation (e.g.,
run_scaled_dot_product_attention) by simply invoking your code. Note: your changes to
adapters.py should not contain any substantive logic; this is glue code.
3. test_*.py: This contains all the tests that you must pass (e.g.,
test_scaled_dot_product_attention), which will invoke the hooks defined in adapters.py. Don’t
edit the test files.
Where to get datasets This assignment will use two pre-processed datasets: TinyStories [Eldan and Li,
2023] and OpenWebText [Gokaslan et al., 2019]. Both datasets are single, large plaintext files. If you are
doing the assignment with the class, you can find these files at /data of any non-head node machine.
If you are following along at home, you can download these files with the commands inside the README.md.
Throughout the course’s assignment handouts, we will give advice for working through parts of the
assignment with fewer or no GPU resources. For example, we will sometimes suggest downscaling
your dataset or model size, or explain how to run training code on a MacOS integrated GPU or CPU.
You’ll find these “low-resource tips” in a blue box (like this one). Even if you are an enrolled Stanford
student with access to the course machines, these tips may help you iterate faster and save time, so we
recommend you to read them!
2
Low-Resource/Downscaling Tip: Assignment 1 on Apple Silicon or CPU
With the staff solution code, we can train an LM to generate reasonably fluent text on an Apple M3
Max chip with 36 GB RAM, in under 5 minutes on Metal GPU (MPS) and about 30 minutes using the
CPU. If these words don’t mean much to you, don’t worry! Just know that if you have a reasonably
up-to-date laptop and your implementation is correct and efficient, you will be able to train a small
LM that generates simple children’s stories with decent fluency.
Later in the assignment, we will explain what changes to make if you are on CPU or MPS.
3
2 Byte-Pair Encoding (BPE) Tokenizer
In the first part of the assignment, we will train and implement a byte-level byte-pair encoding (BPE)
tokenizer [Sennrich et al., 2016, Wang et al., 2019]. In particular, we will represent arbitrary (Unicode)
strings as a sequence of bytes and train our BPE tokenizer on this byte sequence. Later, we will use this
tokenizer to encode text (a string) into tokens (a sequence of integers) for language modeling.
(b) How does this character’s string representation (__repr__()) differ from its printed representa-
tion?
Deliverable: A one-sentence response.
(c) What happens when this character occurs in text? It may be helpful to play around with the
following in your Python interpreter and see if it matches your expectations:
>>> chr(0)
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
>>> print("this is a test" + chr(0) + "string")
4
>>> test_string = "hello! !"
>>> utf8_encoded = test_string.encode("utf-8")
>>> print(utf8_encoded)
b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
>>> print(type(utf8_encoded))
<class 'bytes'>
>>> # Get the byte values for the encoded string (integers from 0 to 255).
>>> list(utf8_encoded)
[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129,
161, 227, 129, 175, 33]
>>> # One byte does not necessarily correspond to one Unicode character!
>>> print(len(test_string))
13
>>> print(len(utf8_encoded))
23
>>> print(utf8_encoded.decode("utf-8"))
hello! !
By converting our Unicode codepoints into a sequence of bytes (e.g., via the UTF-8 encoding), we
are essentially taking a sequence of codepoints (integers in the range 0 to 154,997) and transforming it
into a sequence of byte values (integers in the range 0 to 255). The 256-length byte vocabulary is much
more manageable to deal with. When using byte-level tokenization, we do not need to worry about out-of-
vocabulary tokens, since we know that any input text can be expressed as a sequence of integers from 0 to
255.
(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings.
Deliverable: A one-to-two sentence response.
(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into
a Unicode string. Why is this function incorrect? Provide an example of an input byte string
that yields incorrect results.
>>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))
'hello'
(c) Give a two byte sequence that does not decode to any Unicode character(s).
Deliverable: An example, with a one-sentence explanation.
5
sentence with 10 words might only be 10 tokens long in a word-level language model, but could be 50 or
more tokens long in a character-level model (depending on the length of the words). Processing these longer
sequences requires more computation at each step of the model. Furthermore, language modeling on byte
sequences is difficult because the longer input sequences create long-term dependencies in the data.
Subword tokenization is a midpoint between word-level tokenizers and byte-level tokenizers. Note that a
byte-level tokenizer’s vocabulary has 256 entries (byte values are 0 to 225). A subword tokenizer trades-off a
larger vocabulary size for better compression of the input byte sequence. For example, if the byte sequence
b'the' often occurs in our raw text training data, assigning it an entry in the vocabulary would reduce this
3-token sequence to a single token.
How do we select these subword units to add to our vocabulary? Sennrich et al. [2016] propose to use
byte-pair encoding (BPE; Gage, 1994), a compression algorithm that iteratively replaces (“merges”) the
most frequent pair of bytes with a single, new unused index. Note that this algorithm adds subword tokens
to our vocabulary to maximize the compression of our input sequences—if a word occurs in our input text
enough times, it’ll be represented as a single subword unit.
Subword tokenizers with vocabularies constructed via BPE are often called BPE tokenizers. In this
assignment, we’ll implement a byte-level BPE tokenizer, where the vocabulary items are bytes or merged
sequences of bytes, which give us the best of both worlds in terms of out-of-vocabulary handling and man-
ageable input sequence lengths. The process of constructing the BPE tokenizer vocabulary is known as
“training” the BPE tokenizer.
Vocabulary initialization The tokenizer vocabulary is a one-to-one mapping from bytestring token to
integer ID. Since we’re training a byte-level BPE tokenizer, our initial vocabulary is simply the set of all
bytes. Since there are 256 possible byte values, our initial vocabulary is of size 256.
Pre-tokenization Once you have a vocabulary, you could, in principle, count how often bytes occur next
to each other in your text and begin merging them starting with the most frequent pair of bytes. However,
this is quite computationally expensive, since we’d have to go take a full pass over the corpus each time
we merge. In addition, directly merging bytes across the corpus may result in tokens that differ only in
punctuation (e.g., dog! vs. dog.). These tokens would get completely different token IDs, even though they
are likely to have high semantic similarity (since they differ only in punctuation).
To avoid this, we pre-tokenize the corpus. You can think of this as a coarse-grained tokenization over the
corpus that helps us count how often pairs of characters appear. For example, the word 'text' might be
a pre-token that appears 10 times. In this case, when we count how often the characters ‘t’ and ‘e’ appear
next to each other, we will see that the word ‘text’ has ‘t’ and ‘e’ adjacent and we can increment their count
by 10 instead of looking through the corpus. Since we’re training a byte-level BPE model, each pre-token is
represented as a sequence of UTF-8 bytes.
The original BPE implementation of Sennrich et al. [2016] pre-tokenizes by simply splitting on whitespace
(i.e., s.split(" ")). In contrast, we’ll use a regex-based pre-tokenizer (used by GPT-2; Radford et al., 2019)
from github.com/openai/tiktoken/pull/234/files:
>>> PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
It may be useful to interactively split some text with this pre-tokenizer to get a better sense of its
behavior:
>>> # requires `regex` package
>>> import regex as re
>>> re.findall(PAT, "some text that i'll pre-tokenize")
['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']
6
When using it in your code, however, you should use re.finditer to avoid storing the pre-tokenized words
as you construct your mapping from pre-tokens to their counts.
Compute BPE merges Now that we’ve converted our input text into pre-tokens and represented each
pre-token as a sequence of UTF-8 bytes, we can compute the BPE merges (i.e., train the BPE tokenizer).
At a high level, the BPE algorithm iteratively counts every pair of bytes and identifies the pair with the
highest frequency (“A”, “B”). Every occurrence of this most frequent pair (“A”, “B”) is then merged, i.e.,
replaced with a new token “AB”. This new merged token is added to our vocabulary; as a result, the final
vocabulary after BPE training is the size of the initial vocabulary (256 in our case), plus the number of BPE
merge operations performed during training. For efficiency during BPE training, we do not consider pairs
that cross pre-token boundaries.2 When computing merges, deterministically break ties in pair frequency by
preferring the lexicographically greater pair. For example, if the pairs (“A”, “B”), (“A”, “C”), (“B”, “ZZ”),
and (“BA”, “A”) all have the highest frequency, we’d merge (“BA”, “A”):
Special tokens Often, some strings (e.g., <|endoftext|>) are used to encode metadata (e.g., boundaries
between documents). When encoding text, it’s often desirable to treat some strings as “special tokens” that
should never be split into multiple tokens (i.e., will always be preserved as a single token). For example,
the end-of-sequence string <|endoftext|> should always be preserved as a single token (i.e., a single integer
ID), so we know when to stop generating from the language model. These special tokens must be added to
the vocabulary, so they have a corresponding fixed token ID.
Algorithm 1 of Sennrich et al. [2016] contains an inefficient implementation of BPE tokenizer training
(essentially following the steps that we outlined above). As a first exercise, it may be useful to implement
and test this function to test your understanding.
Here is a stylized example from Sennrich et al. [2016]. Consider a corpus consisting of the following text
Vocabulary We initialize our vocabulary with our special token <|endoftext|> and the 256 byte
values.
Pre-tokenization For simplicity and to focus on the merge procedure, we assume in this example
that pretokenization simply splits on whitespace. When we pretokenize and count, we end up with the
frequency table.
{low: 5, lower: 2, widest: 3, newest: 6}
2 Note that the original BPE formulation [Sennrich et al., 2016] specifies the inclusion of an end-of-word token. We do not
add an end-of-word-token when training byte-level BPE models because all bytes (including whitespace and punctuation) are
included in the model’s vocabulary. Since we’re explicitly representing spaces and punctuation, the learned BPE merges will
naturally reflect these word boundaries.
7
It is convenient to represent this as a dict[tuple[bytes], int], e.g. {(l,o,w): 5 …}. Note that even
a single byte is a bytes object in Python. There is no byte type in Python to represent a single byte,
just as there is no char type in Python to represent a single character.
Merges We first look at every successive pair of bytes and sum the frequency of the words where they
appear {lo: 7, ow: 7, we: 8, er: 2, wi: 3, id: 3, de: 3, es: 9, st: 9, ne: 6, ew: 6}. The pair ('es')
and ('st') are tied, so we take the lexicographically greater pair, ('st'). We would then merge the
pre-tokens so that we end up with {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,e,st): 3, (n,e,w,e,st): 6}.
In the second round, we see that (e, st) is the most common pair (with a count of 9) and we would
merge into {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,est): 3, (n,e,w,est): 6}. Continuing this, the
sequence of merges we get in the end will be ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e',
'ne west', 'w i', 'wi d', 'wid est', 'low e', 'lowe r'].
If we take 6 merges, we have ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e'] and our vocab-
ulary elements would be [<|endoftext|>, [...256 BYTE CHARS], st, est, ow, low, west, ne].
With this vocabulary and set of merges, the word newest would tokenize as [ne, west].
Parallelizing pre-tokenization You will find that a major bottleneck is the pre-tokenization step. You
can speed up pre-tokenization by parallelizing your code with the built-in library multiprocessing. Con-
cretely, we recommend that in parallel implementations of pre-tokenization, you chunk the corpus while
ensuring your chunk boundaries occur at the beginning of a special token. You are free to use the starter
code at the following link verbatim to obtain chunk boundaries, which you can then use to distribute work
across your processes:
https://github.com/stanford-cs336/assignment1-basics/blob/main/cs336_basics/pretokenization_example.py
This chunking will always be valid, since we never want to merge across document boundaries. For the
purposes of the assignment, you can always split in this way. Don’t worry about the edge case of receiving
a very large corpus that does not contain <|endoftext|>.
Removing special tokens before pre-tokenization Before running pre-tokenization with the regex
pattern (using re.finditer), you should strip out all special tokens from your corpus (or your chunk, if using
a parallel implementation). Make sure that you split on your special tokens, so that no merging can occur
across the text they delimit. For example, if you have a corpus (or chunk) like [Doc 1]<|endoftext|>[Doc
2], you should split on the special token <|endoftext|>, and pre-tokenize [Doc 1] and [Doc 2] separately,
so that no merging can occur across the document boundary. This can be done using re.split with "|" ⌋
.join(special_tokens) as the delimiter (with careful use of re.escape since | may occur in the special
tokens). The test test_train_bpe_special_tokens will test for this.
Optimizing the merging step The naïve implementation of BPE training in the stylized example above
is slow because for every merge, it iterates over all byte pairs to identify the most frequent pair. However,
the only pair counts that change after each merge are those that overlap with the merged pair. Thus,
BPE training speed can be improved by indexing the counts of all pairs and incrementally updating these
counts, rather than explicitly iterating over each pair of bytes to count pair frequencies. You can get
significant speedups with this caching procedure, though we note that the merging part of BPE training is
not parallelizable in Python.
8
Low-Resource/Downscaling Tip: Profiling
You should use profiling tools like cProfile or scalene to identify the bottlenecks in your imple-
mentation, and focus on optimizing those.
Instead of jumping to training your tokenizer on the full TinyStories dataset, we recommend you
first train on a small subset of the data: a “debug dataset”. For example, you could train your tokenizer
on the TinyStories validation set instead, which is 22K documents instead of 2.12M. This illustrates a
general strategy of downscaling whenever possible to speed up development: for example, using smaller
datasets, smaller model sizes, etc. Choosing the size of the debug dataset or hyperparameter config
requires careful consideration: you want your debug set to be large enough to have the same bottlenecks
as the full configuration (so that the optimizations you make will generalize), but not so big that it
takes forever to run.
Deliverable: Write a function that, given a path to an input text file, trains a (byte-level) BPE
tokenizer. Your BPE training function should handle (at least) the following input parameters:
input_path: str Path to a text file with BPE tokenizer training data.
vocab_size: int A positive integer that defines the maximum final vocabulary size (including the
initial byte vocabulary, vocabulary items produced from merging, and any special tokens).
special_tokens: list[str] A list of strings to add to the vocabulary. These special tokens do not
otherwise affect BPE training.
Your BPE training function should return the resulting vocabulary and merges:
vocab: dict[int, bytes] The tokenizer vocabulary, a mapping from int (token ID in the vocabu-
lary) to bytes (token bytes).
merges: list[tuple[bytes, bytes]] A list of BPE merges produced from training. Each list item
is a tuple of bytes (<token1>, <token2>), representing that <token1> was merged with
<token2>. The merges should be ordered by order of creation.
To test your BPE training function against our provided tests, you will first need to implement the
test adapter at [adapters.run_train_bpe]. Then, run uv run pytest tests/test_train_bpe.py.
Your implementation should be able to pass all tests. Optionally (this could be a large time-investment),
you can implement the key parts of your training method using some systems language, for instance
C++ (consider cppyy for this) or Rust (using PyO3). If you do this, be aware of which operations
require copying vs reading directly from Python memory, and make sure to leave build instructions, or
make sure it builds using only pyproject.toml. Also note that the GPT-2 regex is not well-supported
in most regex engines and will be too slow in most that do. We have verified that Oniguruma is
reasonably fast and supports negative lookahead, but the regex package in Python is, if anything,
even faster.
9
Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)
(a) Train a byte-level BPE tokenizer on the TinyStories dataset, using a maximum vocabulary size
of 10,000. Make sure to add the TinyStories <|endoftext|> special token to the vocabulary.
Serialize the resulting vocabulary and merges to disk for further inspection. How many hours
and memory did training take? What is the longest token in the vocabulary? Does it make sense?
Resource requirements: ≤ 30 minutes (no GPUs), ≤ 30GB RAM
Hint You should be able to get under 2 minutes for BPE training using multiprocessing during
pretokenization and the following two facts:
Next, we’ll try training a byte-level BPE tokenizer on the OpenWebText dataset. As before, we recom-
mend taking a look at the dataset to better understand its contents.
(a) Train a byte-level BPE tokenizer on the OpenWebText dataset, using a maximum vocabulary
size of 32,000. Serialize the resulting vocabulary and merges to disk for further inspection. What
is the longest token in the vocabulary? Does it make sense?
Resource requirements: ≤ 12 hours (no GPUs), ≤ 100GB RAM
Deliverable: A one-to-two sentence response.
(b) Compare and contrast the tokenizer that you get training on TinyStories versus OpenWebText.
Deliverable: A one-to-two sentence response.
10
Example (bpe_encoding): BPE encoding example
For example, suppose our input string is 'the cat ate', our vocabulary is {0: b' ', 1: b'a', 2:
b'c', 3: b'e', 4: b'h', 5: b't', 6: b'th', 7: b' c', 8: b' a', 9: b'the', 10: b'
at'}, and our learned merges are [(b't', b'h'), (b' ', b'c'), (b' ', 'a'), (b'th', b'e'),
(b' a', b't')]. First, our pre-tokenizer would split this string into ['the', ' cat', ' ate'].
Then, we’ll look at each pre-token and apply the BPE merges.
The first pre-token 'the' is initially represented as [b't', b'h', b'e']. Looking at our list of
merges, we identify the first applicable merge to be (b't', b'h'), and use that to transform the
pre-token into [b'th', b'e']. Then, we go back to the list of merges and identify the next applicable
merge to be (b'th', b'e'), which transforms the pre-token into [b'the']. Finally, looking back at
the list of merges, we see that there are no more that apply to the string (since the entire pre-token
has been merged into a single token), so we are done applying the BPE merges. The corresponding
integer sequence is [9].
Repeating this process for the remaining pre-tokens, we see that the pre-token ' cat' is represented
as [b' c', b'a', b't'] after applying the BPE merges, which becomes the integer sequence [7, 1,
5]. The final pre-token ' ate' is [b' at', b'e'] after applying the BPE merges, which becomes the
integer sequence [10, 3]. Thus, the final result of encoding our input string is [9, 7, 1, 5, 10,
3].
Special tokens. Your tokenizer should be able to properly handle user-defined special tokens when encod-
ing text (provided when constructing the tokenizer).
Memory considerations. Suppose we want to tokenize a large text file that we cannot fit in memory.
To efficiently tokenize this large file (or any other stream of data), we need to break it up into manageable
chunks and process each chunk in-turn, so that the memory complexity is constant as opposed to linear in
the size of the text. In doing so, we need to make sure that a token doesn’t cross chunk boundaries, else
we’ll get a different tokenization than the naïve method of tokenizing the entire sequence in-memory.
Deliverable: Implement a Tokenizer class that, given a vocabulary and a list of merges, encodes
text into integer IDs and decodes integer IDs into text. Your tokenizer should also support user-provided
special tokens (appending them to the vocabulary if they aren’t already there). We recommend the
following interface:
replacement character.
11
the following parameters:
vocab_filepath: str
merges_filepath: str
special_tokens: list[str] | None = None
def encode(self, text: str) -> list[int] Encode an input text into a sequence of token IDs.
def encode_iterable(self, iterable: Iterable[str]) -> Iterator[int] Given an iterable of
strings (e.g., a Python file handle), return a generator that lazily yields token IDs. This is
required for memory-efficient tokenization of large files that we cannot directly load into
memory.
def decode(self, ids: list[int]) -> str Decode a sequence of token IDs into text.
To test your Tokenizer against our provided tests, you will first need to implement the test adapter
at [adapters.get_tokenizer]. Then, run uv run pytest tests/test_tokenizer.py. Your imple-
mentation should be able to pass all tests.
2.7 Experiments
Problem (tokenizer_experiments): Experiments with tokenizers (4 points)
(a) Sample 10 documents from TinyStories and OpenWebText. Using your previously-trained TinyS-
tories and OpenWebText tokenizers (10K and 32K vocabulary size, respectively), encode these
sampled documents into integer IDs. What is each tokenizer’s compression ratio (bytes/token)?
Deliverable: A one-to-two sentence response.
(b) What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer? Com-
pare the compression ratio and/or qualitatively describe what happens.
Deliverable: A one-to-two sentence response.
(c) Estimate the throughput of your tokenizer (e.g., in bytes/second). How long would it take to
tokenize the Pile dataset (825GB of text)?
Deliverable: A one-to-two sentence response.
(d) Using your TinyStories and OpenWebText tokenizers, encode the respective training and devel-
opment datasets into a sequence of integer token IDs. We’ll use this later to train our language
model. We recommend serializing the token IDs as a NumPy array of datatype uint16. Why is
uint16 an appropriate choice?
12
Deliverable: A one-to-two sentence response.
13
Output
Probabilities Output tensor with shape
(batch_size, seq_len, d_model)
Softmax
Add
Linear
(Output Embedding)
Position-Wise
Feed-Forward
Norm
Norm
Transformer Block
num_layers
... Transformer Add
Blocks
Token
Norm
Embedding
3.1 Transformer LM
Given a sequence of token IDs, the Transformer language model uses an input embedding to convert token
IDs to dense vectors, passes the embedded tokens through num_layers Transformer blocks, and then applies
a learned linear projection (the “output embedding” or “LM head”) to produce the predicted next-token
logits. See Figure 1 for a schematic representation.
14
More specifically, given a sequence of token IDs, the Transformer language model uses a token em-
bedding layer to produce a sequence of vectors. Each embedding layer takes in a tensor of integers
of shape (batch_size, sequence_length) and produces a sequence of vectors of shape (batch_size,
sequence_length, d_model).
• Elements of a batch: we apply the same Transformer forward operation on each batch element.
• Sequence length: the “position-wise” operations like RMSNorm and feed-forward operate identically
on each position of a sequence.
• Attention heads: the attention operation is batched across attention heads in a “multi-headed”
attention operation.
It is useful to have an ergonomic way of performing such operations in a way that fully utilizes the GPU,
and is easy to read and understand. Many PyTorch operations can take in excess “batch-like” dimensions
at the start of a tensor and repeat/broadcast the operation across these dimensions efficiently.
For instance, say we are doing a position-wise, batched operation. We have a “data tensor” D of shape
(batch_size, sequence_length, d_model), and we would like to do a batched vector-matrix multiply
against a matrix A of shape (d_model, d_model). In this case, D @ A will do a batched matrix multiply,
which is an efficient primitive in PyTorch, where the (batch_size, sequence_length) dimensions are
batched over.
Because of this, it is helpful to assume that your functions may be given additional batch-like dimensions
and to keep those dimensions at the start of the PyTorch shape. To organize tensors so they can be batched
in this manner, they might need to be shaped using many steps of view, reshape and transpose. This can
be a bit of a pain, and it often gets hard to read what the code is doing and what the shapes of your tensors
are.
A more ergonomic option is to use einsum notation within torch.einsum, or rather use framework
agnostic libraries like einops or einx. The two key ops are einsum, which can do tensor contractions with
arbitrary dimensions of input tensors, and rearrange, which can reorder, concatenate, and split arbitrary
15
dimensions. It turns out almost all operations in machine learning are some combination of dimension
juggling and tensor contraction with the occasional (usually pointwise) nonlinear function. This means that
a lot of your code can be more readable and flexible when using einsum notation.
We strongly recommend learning and using einsum notation for the class. Students who have not
been exposed to einsum notation before should use einops (docs here), and students who are already
comfortable with einops should learn the more general einx (here).4 Both packages are already installed
in the environment we’ve supplied.
Here we give some examples of how einsum notation can be used. These are a supplement to the
documentation for einops, which you should read first.
import torch
from einops import rearrange, einsum
## Basic implementation
Y = D @ A.T
# Hard to tell the input and output shapes and what they mean.
# What shapes can D and A have, and do any of these have unexpected behavior?
## Or, a batched version where D can have any leading dimensions but A is constrained.
Y = einsum(D, A, "... d_in, d_out d_in -> ... d_out")
We have a batch of images, and for each image we want to generate 10 dimmed versions based on some
scaling factor:
images = torch.randn(64, 128, 128, 3) # (batch, height, width, channel)
dim_by = torch.linspace(start=0.0, end=1.0, steps=10)
## Or in one go:
dimmed_images = einsum(
images, dim_by,
"batch height width channel, dim_value -> batch dim_value height width channel"
)
4 It’s worth noting that while einops has a great amount of support, einx is not as battle-tested. You should feel free to fall
back to using einops with some more plain PyTorch if you find any limitations or bugs in einx.
16
Example (einstein_example3): Pixel mixing with einops.rearrange
Suppose we have a batch of images represented as a tensor of shape (batch, height, width,
channel), and we want to perform a linear transformation across all pixels of the image, but this
transformation should happen independently for each channel. Our linear transformation is
represented as a matrix B of shape (height × width, height × width).
channels_last = torch.randn(64, 32, 32, 3) # (batch, height, width, channel)
B = torch.randn(32*32, 32*32)
channels_last_flat_transformed = channels_first_flat_transformed.transpose(1, 2)
channels_last_transformed = channels_last_flat_transformed.view(*channels_last.shape)
Or, if you’re feeling crazy: all in one go using einx.dot (einx equivalent of einops.einsum)
height = width = 32
channels_last_transformed = einx.dot(
"batch row_in col_in channel, (row_out col_out) (row_in col_in)"
"-> batch row_out col_out channel",
channels_last, B,
col_in=width, col_out=width
)
The first implementation here could be improved by placing comments before and after to indicate
17
what the input and output shapes are, but this is clunky and susceptible to bugs. With einsum
notation, documentation is implementation!
Einsum notation can handle arbitrary input batching dimensions, but also has the key benefit of being
self-documenting. It’s much clearer what the relevant shapes of your input and output tensors are in code
that uses einsum notation. For the remaining tensors, you can consider using Tensor type hints, for instance
using the jaxtyping library (not specific to Jax).
We will talk more about the performance implications of using einsum notation in assignment 2, but for
now know that they’re almost always better than the alternative!
• RMSNorm: 1
y = W x. (3)
Note that we do not include a bias term, following most modern LLMs.
18
Problem (linear): Implementing the linear module (1 point)
Deliverable: Implement a Linear class that inherits from torch.nn.Module and performs a linear
transformation. Your implementation should follow the interface of PyTorch’s built-in nn.Linear
module, except for not having a bias argument or parameter. We recommend the following interface:
def forward(self, x: torch.Tensor) -> torch.Tensor Apply the linear transformation to the
input.
Deliverable: Implement the Embedding class that inherits from torch.nn.Module and performs an
embedding lookup. Your implementation should follow the interface of PyTorch’s built-in
nn.Embedding module. We recommend the following interface:
19
embedding_dim: int Dimension of the embedding vectors, i.e., dmodel
device: torch.device | None = None Device to store the parameters on
dtype: torch.dtype | None = None Data type of the parameters
def forward(self, token_ids: torch.Tensor) -> torch.Tensor Lookup the embedding vectors
for the given token IDs.
• store the embedding matrix with the d_model being the final dimension
• of course, don’t use nn.Embedding or nn.functional.embedding
Again, use the settings from above for initialization, and use torch.nn.init.trunc_normal_ to
initialize the weights.
To test your implementation, implement the test adapter at [adapters.run_embedding]. Then, run
uv run pytest -k test_embedding.
20
You should upcast your input to torch.float32 to prevent overflow when you square the input. Overall,
your forward method should look like:
in_dtype = x.dtype
x = x.to(torch.float32)
21
3.5.2 Position-Wise Feed-Forward Network
SiLU(x)
0
4
4 2 0 2 4
x
Figure 3: Comparing the SiLU (aka Swish) and ReLU activation functions.
In the original Transformer paper (section 3.3 of Vaswani et al. [2017]), the Transformer feed-forward network
consists of two linear transformations with a ReLU activation (ReLU(x) = max(0, x)) between them. The
dimensionality of the inner feed-forward layer is typically 4x the input dimensionality.
However, modern language models tend to incorporate two main changes compared to this original design:
they use another activation function and employ a gating mechanism. Specifically, we will implement the
“SwiGLU” activation function adopted in LLMs like Llama 3 [Grattafiori et al., 2024] and Qwen 2.5 [Yang
et al., 2024], which combines the SiLU (often called Swish) activation with a gating mechanism called a
Gated Linear Unit (GLU). We will also omit the bias terms sometimes used in linear layers, following most
modern LLMs since PaLM [Chowdhery et al., 2022] and LLaMA [Touvron et al., 2023].
The SiLU or Swish activation function [Hendrycks and Gimpel, 2016, Elfwing et al., 2017] is defined as
follows:
x
SiLU(x) = x · σ(x) = (5)
1 + e−x
As can be seen in Figure 3, the SiLU activation function is similar to the ReLU activation function, but
is smooth at zero.
Gated Linear Units (GLUs) were originally defined by Dauphin et al. [2017] as the element-wise product
of a linear transformation passed through a sigmoid function and another linear transformation:
where ⊙ represents element-wise multiplication. Gated Linear Units are suggested to “reduce the vanishing
gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear
capabilities.”
Putting the SiLU/Swish and GLU together, we get the SwiGLU, which we will use for our feed-forward
networks:
FFN(x) = SwiGLU(x, W1 , W2 , W3 ) = W2 (SiLU(W1 x) ⊙ W3 x), (7)
where x ∈ Rdmodel , W1 , W3 ∈ Rdff ×dmodel , W2 ∈ Rdmodel ×dff , and canonically, dff = 83 dmodel .
22
Shazeer [2020] first proposed combining the SiLU/Swish activation with GLUs and conducted experiments
showing that SwiGLU outperforms baselines like ReLU and SiLU (without gating) on language modeling
tasks. Later in the assignment, you will compare SwiGLU and SiLU. Though we’ve mentioned some heuristic
arguments for these components (and the papers provide more supporting evidence), it’s good to keep an
empirical perspective: a now famous quote from Shazeer’s paper is
We offer no explanation as to why these architectures seem to work; we attribute their success,
as all else, to divine benevolence.
[ ]
cos(θi,k ) − sin(θi,k )
Rki = . (8)
sin(θi,k ) cos(θi,k )
where 0s represent 2 × 2 zero matrices. While one could construct the full d × d matrix, a good solution
should use the properties of this matrix to implement the transformation more efficiently. Since we only
care about the relative rotation of tokens within a given sequence, we can reuse the values we compute for
cos(θi,k ) and sin(θi,k ) across layers, and different batches. If you would like to optimize it, you may use a
single RoPE module referenced by all layers, and it can have a 2d pre-computed buffer of sin and cos values
created during init with self.register_buffer(persistent=False), instead of a nn.Parameter (because
we do not want to learn these fixed cosine and sine values). The exact same rotation process we did for
our q (i) is then done for k (j) , rotating by the corresponding Rj . Notice that this layer has no learnable
parameters.
23
Problem (rope): Implement RoPE (2 points)
exp(vi )
softmax(v)i = ∑n . (10)
j=1 exp(vj )
Note that exp(vi ) can become inf for large values (then, inf/inf = NaN). We can avoid this by noticing
that the softmax operation is invariant to adding any constant c to all inputs. We can leverage this property
for numerical stability—typically, we will subtract the largest entry of oi from all elements of oi , making the
new largest entry 0. You will now implement softmax, using this trick for numerical stability.
Deliverable: Write a function to apply the softmax operation on a tensor. Your function should
take two parameters: a tensor and a dimension i, and apply softmax to the i-th dimension of the input
tensor. The output tensor should have the same shape as the input tensor, but its i-th dimension will
now have a normalized probability distribution. Use the trick of subtracting the maximum value in
the i-th dimension from all elements of the i-th dimension to avoid numerical stability issues.
To test your implementation, complete [adapters.run_softmax] and make sure it passes uv run
pytest -k test_softmax_matches_pytorch.
24
where Q ∈ Rn×dk , K ∈ Rm×dk , and V ∈ Rm×dv . Here, Q, K and V are all inputs to this operation—note
that these are not the learnable parameters. If you’re wondering why this isn’t QK ⊤ , see 3.3.1.
Masking: It is sometimes convenient to mask the output of an attention operation. A mask should have
the shape M ∈ {True, False}n×m , and each row i of this boolean matrix indicates which keys the query
i should attend to. Canonically (and slightly confusingly), a value of True at position (i, j) indicates that
the query i does attend to the key j, and a value of False indicates that the query does not attend to the
key. In other words, “information flows” at (i, j) pairs with value True. For example, consider a 1 × 3 mask
matrix with entries [[True, True, False]]. The single query vector attends only to the first two keys.
Computationally, it will be much more efficient to use masking
( ⊤ ) than to compute attention on subse-
quences, and we can do this by taking the pre-softmax values Q√dK and adding a −∞ in any entry of the
k
mask matrix that is False.
Deliverable: Implement the scaled dot-product attention function. Your implementation should
handle keys and queries of shape (batch_size, ..., seq_len, d_k) and values of shape
(batch_size, ..., seq_len, d_v), where ... represents any number of other batch-like
dimensions (if provided). The implementation should return an output with the shape (batch_size,
..., d_v). See section 3.3 for a discussion on batch-like dimensions.
Your implementation should also support an optional user-provided boolean mask of shape (seq_len,
seq_len). The attention probabilities of positions with a mask value of True should collectively sum
to 1, and the attention probabilities of positions with a mask value of False should be zero.
To test your implementation against our provided tests, you will need to implement the test adapter
at [adapters.run_scaled_dot_product_attention].
uv run pytest -k test_scaled_dot_product_attention tests your implementation on third-order
input tensors, while uv run pytest -k test_4d_scaled_dot_product_attention tests your
implementation on fourth-order input tensors.
with Qi , Ki , Vi being slice number i ∈ {1, . . . , h} of size dk or dv of the embedding dimension for Q, K, and
V respectively. With Attention being the scaled dot-product attention operation defined in §3.5.4. From
this we can form the multi-head self -attention operation:
Here, the learnable parameters are WQ ∈ Rhdk ×dmodel , WK ∈ Rhdk ×dmodel , WV ∈ Rhdv ×dmodel , and WO ∈
Rdmodel ×hdv . Since the Qs, K, and V s are sliced in the multi-head attention operation, we can think of WQ ,
WK and WV as being separated for each head along the output dimension. When you have this working,
you should be computing the key, value, and query projections in a total of three matrix multiplies.5
5As a stretch goal, try combining the key, query, and value projections into a single weight matrix so you only need a single
matrix multiply.
25
Causal masking. Your implementation should prevent the model from attending to future tokens in the
sequence. In other words, if the model is given a token sequence t1 , . . . , tn , and we want to calculate the
next-word predictions for the prefix t1 , . . . , ti (where i < n), the model should not be able to access (attend
to) the token representations at positions ti+1 , . . . , tn since it will not have access to these tokens when
generating text during inference (and these future tokens leak information about the identity of the true
next word, trivializing the language modeling pre-training objective). For an input token sequence t1 , . . . , tn
we can naively prevent access to future tokens by running multi-head self-attention n times (for the n unique
prefixes in the sequence). Instead, we’ll use causal attention masking, which allows token i to attend to all
positions j ≤ i in the sequence. You can use torch.triu or a broadcasted index comparison to construct
this mask, and you should take advantage of the fact that your scaled dot-product attention implementation
from §3.5.4 already supports attention masking.
Applying RoPE. RoPE should be applied to the query and key vectors, but not the value vectors. Also,
the head dimension should be handled as a batch dimension, because in multi-head attention, attention is
being applied independently for each head. This means that precisely the same RoPE rotation should be
applied to the query and key vectors for each head.
Folllowing Vaswani et al. [2017], set dk = dv = dmodel /h. To test your implementation against our
provided tests, implement the test adapter at [adapters.run_multihead_self_attention]. Then,
run uv run pytest -k test_multihead_self_attention to test your implementation.
Implement the pre-norm Transformer block as described in §3.5 and illustrated in Figure 2. Your
Transformer block should accept (at least) the following parameters.
26
To test your implementation, implement the adapter [adapters.run_transformer_block]. Then
run uv run pytest -k test_transformer_block to test your implementation.
Deliverable: Transformer block code that passes the provided tests.
Now we put the blocks together, following the high level diagram in Figure 1. Follow our description of
the embedding in Section 3.1.1, feed this into num_layers Transformer blocks, and then pass that into the
three output layers to obtain a distribution over the vocabulary.
Time to put it all together! Implement the Transformer language model as described in §3.1
and illustrated in Figure 1. At minimum, your implementation should accept all the aforementioned
construction parameters for the Transformer block, as well as these additional parameters:
vocab_size: int The size of the vocabulary, necessary for determining the dimensionality of the token
embedding matrix.
context_length: int The maximum context length, necessary for determining the dimensionality of
the position embedding matrix.
num_layers: int The number of Transformer blocks to use.
To test your implementation against our provided tests, you will first need to implement the test
adapter at [adapters.run_transformer_lm]. Then, run uv run pytest -k test_transformer_lm
to test your implementation.
Deliverable: A Transformer LM module that passes the above tests.
Resource accounting. It is useful to be able to understand how the various parts of the Transformer
consume compute and memory. We will go through the steps to do some basic “FLOPs accounting.” The
vast majority of FLOPS in a Transformer are matrix multiplies, so our core approach is simple:
1. Write down all the matrix multiplies in a Transformer forward pass.
2. Convert each matrix multiply into FLOPs required.
For this second step, the following fact will be useful:
Rule: Given A ∈ Rm×n and B ∈ Rn×p , the matrix-matrix product AB requires 2mnp FLOPs.
To see this, note that (AB)[i, j] = A[i, :] · B[:, j], and that this dot product requires n additions and n
multiplications (2n FLOPs). Then, since the matrix-matrix product AB has m × p entries, the total number
of FLOPS is (2n)(mp) = 2mnp.
Now, before you do the next problem, it can be helpful to go through each component of your Transformer
block and Transformer LM, and list out all the matrix multiplies and their associated FLOPs costs.
vocab_size : 50,257
context_length : 1,024
num_layers : 48
d_model : 1,600
27
num_heads : 25
d_ff : 6,400
Suppose we constructed our model using this configuration. How many trainable parameters
would our model have? Assuming each parameter is represented using single-precision floating
point, how much memory is required to just load this model?
Deliverable: A one-to-two sentence response.
(b) Identify the matrix multiplies required to complete a forward pass of our GPT-2 XL-shaped
model. How many FLOPs do these matrix multiplies require in total? Assume that our input
sequence has context_length tokens.
Deliverable: A list of matrix multiplies (with descriptions), and the total number of FLOPs
required.
(c) Based on your analysis above, which parts of the model require the most FLOPs?
Deliverable: A one-to-two sentence response.
(d) Repeat your analysis with GPT-2 small (12 layers, 768 d_model, 12 heads), GPT-2 medium (24
layers, 1024 d_model, 16 heads), and GPT-2 large (36 layers, 1280 d_model, 20 heads). As the
model size increases, which parts of the Transformer LM take up proportionally more or less of
the total FLOPs?
Deliverable: For each model, provide a breakdown of model components and its associated
FLOPs (as a proportion of the total FLOPs required for a forward pass). In addition, provide a
one-to-two sentence description of how varying the model size changes the proportional FLOPs
of each component.
(e) Take GPT-2 XL and increase the context length to 16,384. How does the total FLOPs for one
forward pass change? How do the relative contribution of FLOPs of the model components
change?
Deliverable: A one-to-two sentence response.
28
4 Training a Transformer LM
We now have the steps to preprocess the data (via tokenizer) and the model (Transformer). What remains
is to build all of the code to support training. This consists of the following:
• Loss: we need to define the loss function (cross-entropy).
• Optimizer: we need to define the optimizer to minimize this loss (AdamW).
• Training loop: we need all the supporting infrastructure that loads data, saves checkpoints, and
manages training.
1 ∑∑
m
ℓ(θ; D) = − log pθ (xi+1 | x1:i ). (16)
|D|m i=1 x∈D
(Note that a single forward pass in the Transformer yields pθ (xi+1 | x1:i ) for all i = 1, . . . , m.)
In particular, the Transformer computes logits oi ∈ Rvocab_size for each position i, which results in:6
exp(oi [xi+1 ])
p(xi+1 | x1:i ) = softmax(oi )[xi+1 ] = ∑vocab_size . (17)
a=1 exp(oi [a])
The cross entropy loss is generally defined with respect to the vector of logits oi ∈ Rvocab_size and target
xi+1 .7
Implementing the cross entropy loss requires some care with numerical issues, just like in the case of
softmax.
Problem (cross_entropy): Implement Cross entropy
Deliverable: Write a function to compute the cross entropy loss, which takes in predicted logits
(oi ) and targets (xi+1 ) and computes the cross entropy ℓi = − log softmax(oi )[xi+1 ]. Your function
should handle the following:
• Handle any additional batch dimensions and return the average across the batch. As with sec-
tion 3.3, we assume batch-like dimensions always come first, before the vocabulary size dimension.
Perplexity Cross entropy suffices for training, but when we evaluate the model, we also want to report
perplexity. For a sequence of length m where we suffer cross-entropy losses ℓ1 , . . . , ℓm :
( )
1 ∑
m
perplexity = exp ℓi . (18)
m i=1
6 Note that oi [k] refers to value at index k of the vector oi .
7 This corresponds to the cross entropy between the Dirac delta distribution over xi+1 and the predicted softmax(oi ) distri-
bution.
29
4.2 The SGD Optimizer
Now that we have a loss function, we will begin our exploration of optimizers. The simplest gradient-based
optimizer is Stochastic Gradient Descent (SGD). We start with randomly initialized parameters θ0 . Then
for each step t = 0, . . . , T − 1, we perform the following update:
def __init__(self, params, ...) should initialize your optimizer. Here, params will be a collection of
parameters to be optimized (or parameter groups, in case the user wants to use different hyperpa-
rameters, such as learning rates, for different parts of the model). Make sure to pass params to the
__init__ method of the base class, which will store these parameters for use in step. You can take
additional arguments depending on the optimizer (e.g., the learning rate is a common one), and pass
them to the base class constructor as a dictionary, where keys are the names (strings) you choose for
these parameters.
def step(self) should make one update of the parameters. During the training loop, this will be called
after the backward pass, so you have access to the gradients on the last batch. This method should
iterate through each parameter tensor p and modify them in place, i.e. setting p.data, which holds
the tensor associated with that parameter based on the gradient p.grad (if it exists), the tensor
representing the gradient of the loss with respect to that parameter.
The PyTorch optimizer API has a few subtleties, so it’s easier to explain it with an example. To make
our example richer, we’ll implement a slight variation of SGD where the learning rate decays over training,
starting with an initial learning rate α and taking successively smaller steps over time:
α
θt+1 = θt − √ ∇L(θt ; Bt ) (20)
t+1
Let’s see how this version of SGD would be implemented as a PyTorch Optimizer:
class SGD(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3):
if lr < 0:
raise ValueError(f"Invalid learning rate: {lr}")
defaults = {"lr": lr}
super().__init__(params, defaults)
30
for p in group["params"]:
if p.grad is None:
continue
return loss
In __init__, we pass the parameters to the optimizer, as well as default hyperparameters, to the base
class constructor (the parameters might come in groups, each with different hyperparameters). In case the
parameters are just a single collection of torch.nn.Parameter objects, the base constructor will create a
single group and assign it the default hyperparameters. Then, in step, we iterate over each parameter group,
then over each parameter in that group, and apply Eq 20. Here, we keep the iteration number as a state
associated with each parameter: we first read this value, use it in the gradient update, and then update it.
The API specifies that the user might pass in a callable closure to re-compute the loss before the optimizer
step. We won’t need this for the optimizers we’ll use, but we add it to comply with the API.
To see this working, we can use the following minimal example of a training loop:
for t in range(100):
opt.zero_grad() # Reset the gradients for all learnable parameters.
loss = (weights**2).mean() # Compute a scalar loss value.
print(loss.cpu().item())
loss.backward() # Run backward pass, which computes gradients.
opt.step() # Run optimizer step.
This is the typical structure of a training loop: in each iteration, we will compute the loss and run a
step of the optimizer. When training language models, our learnable parameters will come from the model
(in PyTorch, m.parameters() gives us this collection). The loss will be computed over a sampled batch of
data, but the basic structure of the training loop will be the same.
As we will see, one of the hyperparameters that affects training the most is the learning rate. Let’s
see that in practice in our toy example. Run the SGD example above with three other values for the
learning rate: 1e1, 1e2, and 1e3, for just 10 training iterations. What happens with the loss for each
of these learning rates? Does it decay faster, slower, or does it diverge (i.e., increase over the course of
training)?
Deliverable: A one-two sentence response with the behaviors you observed.
4.3 AdamW
Modern language models are typically trained with more sophisticated optimizers, instead of SGD. Most
optimizers used recently are derivatives of the Adam optimizer [Kingma and Ba, 2015]. We will use AdamW
[Loshchilov and Hutter, 2019], which is in wide use in recent work. AdamW proposes a modification to Adam
that improves regularization by adding weight decay (at each iteration, we pull the parameters towards 0),
31
in a way that is decoupled from the gradient update. We will implement AdamW as described in algorithm
2 of Loshchilov and Hutter [2019].
AdamW is stateful: for each parameter, it keeps track of a running estimate of its first and second
moments. Thus, AdamW uses additional memory in exchange for improved stability and convergence.
Besides the learning rate α, AdamW has a pair of hyperparameters (β1 , β2 ) that control the updates to the
moment estimates, and a weight decay rate λ. Typical applications set (β1 , β2 ) to (0.9, 0.999), but large
language models like LLaMA [Touvron et al., 2023] and GPT-3 [Brown et al., 2020] are often trained with
(0.9, 0.95). The algorithm can be written as follows, where ϵ is a small value (e.g., 10−8 ) used to improve
numerical stability in case we get extremely small values in v:
Let us compute how much memory and compute running AdamW requires. Assume we are using
float32 for every tensor.
(a) How much peak memory does running AdamW require? Decompose your answer based on the
memory usage of the parameters, activations, gradients, and optimizer state. Express your answer
in terms of the batch_size and the model hyperparameters (vocab_size, context_length,
num_layers, d_model, num_heads). Assume d_ff = 4 × d_model.
For simplicity, when calculating memory usage of activations, consider only the following compo-
nents:
• Transformer block
– RMSNorm(s)
32
– Multi-head self-attention sublayer: QKV projections, Q⊤K matrix multiply, softmax,
weighted sum of values, output projection.
– Position-wise feed-forward: W1 matrix multiply, SiLU, W2 matrix multiply
• final RMSNorm
• output embedding
• cross-entropy on logits
Deliverable: An algebraic expression for each of parameters, activations, gradients, and opti-
mizer state, as well as the total.
(b) Instantiate your answer for a GPT-2 XL-shaped model to get an expression that only depends on
the batch_size. What is the maximum batch size you can use and still fit within 80GB memory?
Deliverable: An expression that looks like a · batch_size + b for numerical values a, b, and a
number representing the maximum batch size.
(c) How many FLOPs does running one step of AdamW take?
Deliverable: An algebraic expression, with a brief justification.
(d) Model FLOPs utilization (MFU) is defined as the ratio of observed throughput (tokens per second)
relative to the hardware’s theoretical peak FLOP throughput [Chowdhery et al., 2022]. An
NVIDIA A100 GPU has a theoretical peak of 19.5 teraFLOP/s for float32 operations. Assuming
you are able to get 50% MFU, how long would it take to train a GPT-2 XL for 400K steps and a
batch size of 1024 on a single A100? Following Kaplan et al. [2020] and Hoffmann et al. [2022],
assume that the backward pass has twice the FLOPs of the forward pass.
Deliverable: The number of days training would take, with a brief justification.
8 It’s sometimes common to use a schedule where the learning rate rises back up (restarts) to help get past local minima.
33
Problem (learning_rate_schedule): Implement cosine learning rate schedule with
warmup
Write a function that takes t, αmax , αmin , Tw and Tc , and returns the learning rate αt according to
the scheduler defined above. Then implement [adapters.get_lr_cosine_schedule] and make sure
it passes uv run pytest -k test_get_lr_cosine_schedule.
Write a function that implements gradient clipping. Your function should take a list of parameters
and a maximum ℓ2 -norm. It should modify each parameter gradient in place. Use ϵ = 10−6 (the
PyTorch default). Then, implement the adapter [adapters.run_gradient_clipping] and make sure
it passes uv run pytest -k test_gradient_clipping.
34
5 Training loop
We will now finally put together the major components we’ve built so far: the tokenized data, the model,
and the optimizer.
Deliverable: Write a function that takes a numpy array x (integer array with token IDs), a
batch_size, a context_length and a PyTorch device string (e.g., 'cpu' or 'cuda:0'), and returns
a pair of tensors: the sampled input sequences and the corresponding next-token targets. Both ten-
sors should have shape (batch_size, context_length) containing token IDs, and both should be
placed on the requested device. To test your implementation against our provided tests, you will first
need to implement the test adapter at [adapters.run_get_batch]. Then, run uv run pytest -k
test_get_batch to test your implementation.
If you are planning to train your LM on CPU or Apple Silicon, you need to move your data
to the correct device (and similarly, you should use the same device for your model later on).
If you are on CPU, you can use the 'cpu' device string, and on Apple Silicon (M* chips), you
can use the 'mps' device string.
For more on MPS, checkout these resources:
• https://developer.apple.com/metal/pytorch/
• https://pytorch.org/docs/main/notes/mps.html
What if the dataset is too big to load into memory? We can use a Unix systemcall named mmap which
maps a file on disk to virtual memory, and lazily loads the file contents when that memory location is
accessed. Thus, you can “pretend” you have the entire dataset in memory. Numpy implements this through
np.memmap (or the flag mmap_mode='r' to np.load, if you originally saved the array with np.save), which
will return a numpy array-like object that loads the entries on-demand as you access them. When sampling
from your dataset (i.e., a numpy array) during training, be sure load the dataset in memory-
mapped mode (via np.memmap or the flag mmap_mode='r' to np.load, depending on how you saved the
array). Make sure you also specify a dtype that matches the array that you’re loading. It may be helpful
to explicitly verify that the memory-mapped data looks correct (e.g., doesn’t contain values beyond the
expected vocabulary size).
35
5.2 Checkpointing
In addition to loading data, we will also need to save models as we train. When running jobs, we often
want to be able to resume a training run that for some reason stopped midway (e.g., due to your job timing
out, machine failure, etc). Even when all goes well, we might also want to later have access to intermediate
models (e.g., to study training dynamics post-hoc, take samples from models at different stages of training,
etc).
A checkpoint should have all the states that we need to resume training. We of course want to be able
to restore model weights at a minimum. If using a stateful optimizer (such as AdamW), we will also need
to save the optimizer’s state (e.g., in the case of AdamW, the moment estimates). Finally, to resume the
learning rate schedule, we will need to know the iteration number we stopped at. PyTorch makes it easy to
save all of these: every nn.Module has a state_dict() method that returns a dictionary with all learnable
weights; we can restore these weights later with the sister method load_state_dict(). The same goes
for any nn.optim.Optimizer. Finally, torch.save(obj, dest) can dump an object (e.g., a dictionary
containing tensors in some values, but also regular Python objects like integers) to a file (path) or file-like
object, which can then be loaded back into memory with torch.load(src).
def save_checkpoint(model, optimizer, iteration, out) should dump all the state from the
first three parameters into the file-like object out. You can use the state_dict method of both
the model and the optimizer to get their relevant states and use torch.save(obj, out) to dump
obj into out (PyTorch supports either a path or a file-like object here). A typical choice is to
have obj be a dictionary, but you can use whatever format you want as long as you can load your
checkpoint later.
This function expects the following parameters:
model: torch.nn.Module
optimizer: torch.optim.Optimizer
iteration: int
out: str | os.PathLike | typing.BinaryIO | typing.IO[bytes]
def load_checkpoint(src, model, optimizer) should load a checkpoint from src (path or file-
like object), and then recover the model and optimizer states from that checkpoint. Your
function should return the iteration number that was saved to the checkpoint. You can use
torch.load(src) to recover what you saved in your save_checkpoint implementation, and the
load_state_dict method in both the model and optimizers to return them to their previous
states.
This function expects the following parameters:
36
5.3 Training loop
Now, it’s finally time to put all of the components you implemented together into your main training script.
It will pay off to make it easy to start training runs with different hyperparameters (e.g., by taking them
as command-line arguments), since you will be doing these many times later to study how different choices
impact training.
Deliverable: Write a script that runs a training loop to train your model on user-provided input.
In particular, we recommend that your training script allow for (at least) the following:
• Ability to configure and control the various model and optimizer hyperparameters.
37
6 Generating text
Now that we can train models, the last piece we need is the ability to generate text from our model.
Recall that a language model takes in a (possibly batched) integer sequence of length (sequence_length)
and produces a matrix of size (sequence_length × vocab size), where each element of the sequence is a
probability distribution predicting the next word after that position. We will now write a few functions to
turn this into a sampling scheme for new sequences.
Softmax By standard convention, the language model output is the output of the final linear layer (the
“logits”) and so we have to turn this into a normalized probability via the softmax operation, which we saw
earlier in Eq 10.
Decoding To generate text (decode) from our model, we will provide the model with a sequence of prefix
tokens (the “prompt”), and ask it to produce a probability distribution over the vocabulary that predicts
the next word in the sequence. Then, we will sample from this distribution over the vocabulary items to
determine the next output token.
Concretely, one step of the decoding process should take in a sequence x1...t and return a token xt+1 via
the following equation,
exp(vi )
P (xt+1 = i | x1...t ) = ∑
j exp(vj )
v = TransformerLM(x1...t )t ∈ Rvocab_size
where TransformerLM is our model which takes as input a sequence of sequence_length and produces a
matrix of size (sequence_length × vocab_size), and we take the last element of this matrix, as we are
looking for the next word prediction at the t-th position.
This gives us a basic decoder by repeatedly sampling from these one-step conditionals (appending our
previously-generated output token to the input of the next decoding timestep) until we generate the end-of-
sequence token <|endoftext|> (or a user-specified maximum number of tokens to generate).
Decoder tricks We will be experimenting with small models, and small models can sometimes generate
very low quality texts. Two simple decoder tricks can help fix these issues. First, in temperature scaling we
modify our softmax with a temperature parameter τ , where the new softmax is
exp(vi /τ )
softmax(v, τ )i = ∑|vocab_size| . (24)
j=1 exp(vj /τ )
Note how setting τ → 0 makes it so that the largest element of v dominates, and the output of the softmax
becomes a one-hot vector concentrated at this maximal element.
Second, another trick is nucleus or top-p sampling, where we modify the sampling distribution by trun-
cating low-probability words. Let q be a probability distribution that we get from a (temperature-scaled)
softmax of size (vocab_size). Nucleus sampling with hyperparameter p produces the next token according
to the equation {
∑ qi if i ∈ V (p)
P (xt+1 = i|q) = j∈V (p) qj
0 otherwise
∑
where V (p) is the smallest set of indices such that j∈V (p) qj ≥ p. You can compute this quantity easily by
first sorting the probability distribution q by magnitude, and selecting the largest vocabulary elements until
you reach the target level of α.
38
Problem (decoding): Decoding (3 points)
Deliverable: Implement a function to decode from your language model. We recommend that you
support the following features:
• Generate completions for a user-provided prompt (i.e., take in some x1...t and sample a completion
until you hit an <|endoftext|> token).
39
7 Experiments
Now it is time to put everything together and train (small) language models on a pretaining dataset.
For your training and evaluation code, create experiment tracking infrastructure that allows you to
track your experiments and loss curves with respect to gradient steps and wallclock time.
Deliverable: Logging infrastructure code for your experiments and an experiment log (a document
of all the things you tried) for the assignment problems below in this section.
7.2 TinyStories
We are going to start with a very simple dataset (TinyStories; Eldan and Li, 2023) where models will train
quickly, and we can see some interesting behaviors. The instructions for getting this dataset is at section 1.
An example of what this dataset looks like is below.
Once upon a time there was a little boy named Ben. Ben loved to explore the world around him.
He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was
walking through the store when he came across a very special vase. When Ben saw it he was amazed!
He said, “Wow, that is a really amazing vase! Can I buy it?” The shopkeeper smiled and said, “Of
course you can. You can take it home and show all your friends how amazing it is!” So Ben took the
vase home and he was so proud of it! He called his friends over and showed them the amazing vase.
All his friends thought the vase was beautiful and couldn’t believe how lucky Ben was. And that’s how
Ben found an amazing vase in the store!
Hyperparameter tuning We will tell you some very basic hyperparameters to start with and ask you to
find some settings for others that work well.
vocab_size 10000. Typical vocabulary sizes are in the tens to hundreds of thousands. You should vary this
and see how the vocabulary and model behavior changes.
context_length 256. Simple datasets such as TinyStories might not need long sequence lengths, but for
the later OpenWebText data, you may want to vary this. Try varying this and seeing the impact on
both the per-iteration runtime and the final perplexity.
40
d_model 512. This is slightly smaller than the 768 dimensions used in many small Transformer papers, but
this will make things faster.
d_ff 1344. This is roughly 83 d_model while being a multiple of 64, which is good for GPU performance.
You should do some trial and error to find good defaults for the following other hyperparameters:
learning rate, learning rate warmup, other AdamW hyperparameters (β1 , β2 , ϵ), and weight decay.
You can find some typical choices of such hyperparameters in Kingma and Ba [2015].
Putting it together Now you can put everything together by getting a trained BPE tokenizer, tok-
enizing the training dataset, and running this in the training loop that you wrote. Important note: If
your implementation is correct and efficient, the above hyperparameters should result in a roughly 30-40
minute runtime on 1 H100 GPU. If you have runtimes that are much longer, please check and make sure
your dataloading, checkpointing, or validation loss code is not bottlenecking your runtimes and that your
implementation is properly batched.
Tips and tricks for debugging model architectures We highly recommend getting comfortable with
your IDE’s built-in debugger (e.g., VSCode/PyCharm), which will save you time compared to debugging
with print statements. If you use a text editor, you can use something more like pdb. A few other good
practices when debugging model architectures are:
• A common first step when developing any neural net architecture is to overfit to a single minibatch. If
your implementation is correct, you should be able to quickly drive the training loss to near-zero.
• Set debug breakpoints in various model components, and inspect the shapes of intermediate tensors to
make sure they match your expectations.
• Monitor the norms of activations, model weights, and gradients to make sure they are not exploding
or vanishing.
The learning rate is one of the most important hyperparameters to tune. Taking the base model
you’ve trained, answer the following questions:
(a) Perform a hyperparameter sweep over the learning rates and report the final losses (or note
divergence if the optimizer diverges).
Deliverable: Learning curves associated with multiple learning rates. Explain your hyperpa-
rameter search strategy.
Deliverable: A model with validation loss (per-token) on TinyStories of at most 1.45
41
Low-Resource/Downscaling Tip: Train for few steps on CPU or Apple Silicon
If you are running on cpu or mps, you should instead reduce the total tokens processed
count to 40, 000, 000, which will be sufficient to produce reasonably fluent text. You may
also increase the target validation loss from 1.45 to 2.00.
Running our solution code with a tuned learning rate on an M3 Max chip and 36 GB of
RAM, we use batch size × total step count × context length = 32×5000×256 = 40, 960, 000
tokens, which takes 1 hour and 22 minutes on cpu and 36 minutes on mps. At step 5000,
we achieve a validation loss of 1.80.
Some additional tips:
• When using X training steps, we suggest adjusting the cosine learning rate decay
schedule to terminate its decay (i.e., reach the minimum learning rate) at precisely
step X.
• When using mps, do not use TF32 kernels, i.e., do not set
torch.set_float32_matmul_precision('high')
as you might with cuda devices. We tried enabling TF32 kernels with mps (torch
version 2.6.0) and found the backend will use silently broken kernels that cause unstable
training.
• You can speed up training by JIT-compiling your model with torch.compile. Specif-
ically:
– On cpu, compile your model with
model = torch.compile(model)
– On mps, you can somewhat optimize the backward pass using
model = torch.compile(model, backend="aot_eager")
Compilation with Inductor is not supported on mps as of torch version 2.6.0.
(b) Folk wisdom is that the best learning rate is “at the edge of stability.” Investigate how the point
at which learning rates diverge is related to your best learning rate.
Deliverable: Learning curves of increasing learning rate which include at least one divergent
run and an analysis of how this relates to convergence rates.
Now let’s vary the batch size and see what happens to training. Batch sizes are important – they let us get
higher efficiency from our GPUs by doing larger matrix multiplies, but is it true that we always want batch
sizes to be large? Let’s run some experiments to find out.
Vary your batch size all the way from 1 to the GPU memory limit. Try at least a few batch sizes
in between, including typical sizes like 64 and 128.
Deliverable: Learning curves for runs with different batch sizes. The learning rates should be
optimized again if necessary.
Deliverable: A few sentences discussing of your findings on batch sizes and their impacts on
training.
With your decoder in hand, we can now generate text! We will generate from the model and see how
good it is. As a reference, you should get outputs that look at least as good as the example below.
42
Example (ts_generate_example): Sample output from a TinyStories language model
Once upon a time, there was a pretty girl named Lily. She loved to eat gum, especially the big black
one. One day, Lily’s mom asked her to help cook dinner. Lily was so excited! She loved to help her
mom. Lily’s mom made a big pot of soup for dinner. Lily was so happy and said, “Thank you, Mommy!
I love you.” She helped her mom pour the soup into a big bowl. After dinner, Lily’s mom made some
yummy soup. Lily loved it! She said, “Thank you, Mommy! This soup is so yummy!” Her mom smiled
and said, “I’m glad you like it, Lily.” They finished cooking and continued to cook together. The end.
If instead you used the low-resource configuration with 40M tokens processed, you should see gen-
erations that still resemble English but are not as fluent as above. For example, our sample output
from a TinyStories language model trained on 40M tokens is below:
Once upon a time, there was a little girl named Sue. Sue had a tooth that she loved very much. It
was his best head. One day, Sue went for a walk and met a ladybug! They became good friends and
played on the path together.
“Hey, Polly! Let’s go out!” said Tim. Sue looked at the sky and saw that it was difficult to find a
way to dance shining. She smiled and agreed to help the talking!”
As Sue watched the sky moved, what it was. She
Using your decoder and your trained checkpoint, report the text generated by your model. You
may need to manipulate decoder parameters (temperature, top-p, etc.) to get fluent outputs.
Deliverable: Text dump of at least 256 tokens of text (or until the first <|endoftext|> token),
and a brief comment on the fluency of this output and at least two factors which affect how good or
bad this output is.
Ablation 1: layer normalization It is often said that layer normalization is important for the stability
of Transformer training. But perhaps we want to live dangerously. Let’s remove RMSNorm from each of
our Transformer blocks and see what happens.
Remove all of the RMSNorms from your Transformer and train. What happens at the previous
optimal learning rate? Can you get stability by using a lower learning rate?
Deliverable: A learning curve for when you remove RMSNorms and train, as well as a learning
curve for the best learning rate.
Deliverable: A few sentence commentary on the impact of RMSNorm.
43
Let’s now investigate another layer normalization choice that seems arbitrary at first glance. Pre-norm
Transformer blocks are defined as
z = x + MultiHeadedSelfAttention(RMSNorm(x))
y = z + FFN(RMSNorm(z)).
This is one of the few ‘consensus’ modifications to the original Transformer architecture, which used a
post-norm approach as
z = RMSNorm(x + MultiHeadedSelfAttention(x))
y = RMSNorm(z + FFN(z)).
Let’s revert back to the post-norm approach and see what happens.
Modify your pre-norm Transformer implementation into a post-norm one. Train with the post-norm
model and see what happens.
Deliverable: A learning curve for a post-norm transformer, compared to the pre-norm one.
We see that layer normalization has a major impact on the behavior of the transformer, and that even
the position of the layer normalization is important.
Ablation 2: position embeddings We will next investigate the impact of the position embeddings on
the performance of the model. Specifically, we will compare our base model (with RoPE) with not including
position embeddings at all (NoPE). It turns out that decoder-only transformers, i.e., those with a causal
mask as we have implemented, can in theory infer relative or absolute position information without being
provided with position embeddings explicitly [Tsai et al., 2019, Kazemnejad et al., 2023]. We will now test
empirically how NoPE performs compare to RoPE.
Modify your Transformer implementation with RoPE to remove the position embedding information
entirely, and see what happens.
Deliverable: A learning curve comparing the performance of RoPE and NoPE.
Ablation 3: SwiGLU vs. SiLU Next, we will follow Shazeer [2020] and test the importance of gating
in the feed-forward network, by comparing the performance of SwiGLU feed-forward networks versus feed-
forward networks using SiLU activations but no gated linear unit (GLU):
Recall that in our SwiGLU implementation, we set the dimensionality of the inner feed-forward layer to
be roughly dff = 83 dmodel (while ensuring that dff mod 64 = 0, to make use of GPU tensor cores). In your
FFNSiLU implementation you should set dff = 4 × dmodel , to approximately match the parameter count of
the SwiGLU feed-forward network (which has three instead of two weight matrices).
Deliverable: A learning curve comparing the performance of SwiGLU and SiLU feed-forward
networks, with approximately matched parameter counts.
44
Deliverable: A few sentences discussing your findings.
In the remainder of the assignment, we will move to a larger-scale, noisier web dataset (Open-
WebText), experimenting with architecture modifications and (optionally) making a submission to the
course leaderboard.
It takes a long time to train an LM to fluency on OpenWebText, so we suggest that online students
with limited GPU access continue testing modifications on TinyStories (using validation loss as a metric
to evaluate performance).
Baseball Prospectus director of technology Harry Pavlidis took a risk when he hired Jonathan Judge.
Pavlidis knew that, as Alan Schwarz wrote in The Numbers Game, “no corner of American culture
is more precisely counted, more passionately quantified, than performances of baseball players.” With
a few clicks here and there, you can findout that Noah Syndergaard’s fastball revolves more than 2,100
times per minute on its way to the plate, that Nelson Cruz had the game’s highest average exit velocity
among qualified hitters in 2016 and myriad other tidbits that seem ripped from a video game or science
fiction novel. The rising ocean of data has empowered an increasingly important actor in baseball’s
culture: the analytical hobbyist.
That empowerment comes with added scrutiny – on the measurements, but also on the people
and publications behind them. With Baseball Prospectus, Pavlidis knew all about the backlash that
accompanies quantitative imperfection. He also knew the site’s catching metrics needed to be reworked,
and that it would take a learned mind – someone who could tackle complex statistical modeling problems
– to complete the job.
“He freaks us out.” Harry Pavlidis
Pavlidis had a hunch that Judge “got it” based on the latter’s writing and their interaction at a site-
sponsored ballpark event. Soon thereafter, the two talked over drinks. Pavlidis’ intuition was validated.
Judge was a fit for the position – better yet, he was a willing fit. “I spoke to a lot of people,” Pavlidis
said, “he was the only one brave enough to take it on.” [...]
Note: You may have to re-tune your hyperparameters such as learning rate or batch size for this experiment.
Train your language model on OpenWebText with the same model architecture and total training
iterations as TinyStories. How well does this model do?
Deliverable: A learning curve of your language model on OpenWebText. Describe the difference
in losses from TinyStories – how should we interpret these losses?
45
Deliverable: Generated text from OpenWebText LM, in the same format as the TinyStories
outputs. How is the fluency of this text? Why is the output quality worse even though we have the
same model and compute budget as TinyStories?
Rules for the leaderboard There are no restrictions other than the following:
Runtime Your submission can run for at most 1.5 hours on an H100. You can enforce this by setting
--time=01:30:00 in your slurm submission script.
Data You may only use the OpenWebText training dataset that we provide.
Otherwise, you are free to do whatever your heart desires.
If you are looking for some ideas on what to implement, you can checkout some of these resources:
• State-of-the-art open-source LLM families, such as Llama 3 [Grattafiori et al., 2024] or Qwen 2.5 [Yang
et al., 2024].
• The NanoGPT speedrun repository (https://github.com/KellerJordan/modded-nanogpt), where
community members post many interesting modifications for “speedrunning” small-scale language
model pretraining. For example, a common modification that dates back to the original Transformer
paper is to tie the weights of the input and output embeddings together (see Vaswani et al. [2017]
(Section 3.4) and Chowdhery et al. [2022] (Section 2)). If you do try weight tying, you may have to
decrease the standard deviation of the embedding/LM head init.
You will want to test these on either a small subset of OpenWebText or on TinyStories before trying the
full 1.5-hour run.
As a caveat, we do note that some of the modifications you may find working well in this leaderboard
may not generalize to larger-scale pretraining. We will explore this idea further in the scaling laws unit of
the course.
You will train a model under the leaderboard rules above with the goal of minimizing the validation
loss of your language model within 1.5 H100-hour.
Deliverable: The final validation loss that was recorded, an associated learning curve that clearly
shows a wallclock-time x-axis that is less than 1.5 hours and a description of what you did. We expect
a leaderboard submission to beat at least the naive baseline of a 5.0 loss.
46
References
Ronen Eldan and Yuanzhi Li. TinyStories: How small can language models be and still speak coherent
English?, 2023. arXiv:2305.07759.
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. OpenWebText corpus. http://
Skylion007.github.io/OpenWebTextCorpus, 2019.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword
units. In Proc. of ACL, 2016.
Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords,
2019. arXiv:1909.03341.
Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February 1994. ISSN
0898-9788.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners, 2019.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding
by generative pre-training, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Proc. of NeurIPS, 2017.
Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-
attention. In Proc. of IWSWLT, 2019.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan,
Liwei Wang, and Tie-Yan Liu. On layer normalization in the Transformer architecture. In Proc. of ICML,
2020.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. arXiv:1607.06450.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
arXiv:2302.13971.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Proc. of NeurIPS, 2019.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh
Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur
Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere,
Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra,
Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton
Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt,
David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes,
Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic,
Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind
Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar,
Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evti-
mov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet
Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu
47
Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua
Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate
Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley
Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen,
Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke
de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria
Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si,
Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev,
Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Peng-
wei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura,
Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Sil-
veira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain
Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hos-
seini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang
Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon
Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Syd-
ney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias
Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ra-
manathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic,
Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang
Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh
Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre
Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain,
Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay
Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo,
Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poul-
ton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arka-
bandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James,
Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing
Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido,
Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim,
Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia
Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine,
Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Ed-
ward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan
Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian,
Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Flo-
rez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi,
Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen
Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan,
Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weiss-
man, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang,
Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe
Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang,
Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik
Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang,
Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu,
Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus,
Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan
Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov,
Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Moham-
48
mad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa,
Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Nor-
man Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh,
Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyag-
ina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub,
Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah
Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh
Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh
Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng
Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang,
Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve
Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny
Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo
Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked,
Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla,
Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen
Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo
Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi,
Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary
DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of
models, 2024. URL https://arxiv.org/abs/2407.21783.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue,
Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang
Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5
technical report. arXiv preprint arXiv:2412.15115, 2024.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar
Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael
Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk
Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,
David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor
Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang,
Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: Scaling language modeling with pathways, 2022.
arXiv:2204.02311.
Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error
linear units, 2016. arXiv:1606.08415.
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function
approximation in reinforcement learning, 2017. URL https://arxiv.org/abs/1702.03118.
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolu-
tional networks, 2017. URL https://arxiv.org/abs/1612.08083.
Noam Shazeer. GLU variants improve transformer, 2020. arXiv:2002.05202.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary
position embedding, 2021.
49
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proc. of ICLR, 2019.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. In Proc. of NeurIPS, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
arXiv:2001.08361.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan,
Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language
models, 2022. arXiv:2203.15556.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degenera-
tion. In Proc. of ICLR, 2020.
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov.
Transformer dissection: An unified understanding for transformer‘s attention via the lens of kernel. In
Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pages 4344–4353, Hong Kong, China, November 2019.
Association for Computational Linguistics. doi: 10.18653/v1/D19-1443. URL https://aclanthology.
org/D19-1443/.
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan, Payel Das, and Siva Reddy. The impact of
positional encoding on length generalization in transformers. In Thirty-seventh Conference on Neural
Information Processing Systems, 2023. URL https://openreview.net/forum?id=Drrl2gcjzl.
50