Building LLaMA 3 From Scratch With Python
Building LLaMA 3 From Scratch With Python
Search Write
178 1
We won’t be using a GPU for this blog, but you’ll need at least 17 GB of RAM
because we are going to load some files that are more than 15 GB in size. If
this is an issue for you, you can use Kaggle as a solution. Since we don’t need
a GPU, Kaggle offers 30 GB of RAM while using only CPU cores as an
accelerator.
To avoid copying and pasting the code from this blog, here is the GitHub
Repository containing the notebook file with all the code and information:
Here is the blog link which guides you on how to create a 2.3+ million
parameter LLM from scratch:
Table of Contents
1. Prerequisites
Prerequisites
The good part is we won’t be using object-oriented programming (OOP)
coding, just plain Python programming. However, you should have a basic
understanding of neural networks and Transformer architecture. These are
the only two prerequisites needed to follow along with the blog.
Topic Link
Here are some key points about LLaMA 2 and LLaMA 3. If you are already
familiar with their architecture:
FEATURE Llama 3 Llama 2
Tiktoken (developed by
Tokenizer SentencePiece
OpenAI)
Number of
8B, 70B 70B, 13B, 7B
Parameters
Computational
Very high (70B model) Very high (70B model)
Requirements
Reinforcement
learning from human Yes Yes
feedback
Number of languages
30 languages 20 languages
supported
From apps4rent.co
Let’s look into the most important components of LLaMA 3 with a bit more
detail:
Imagine you’re studying for a big exam, and you have a massive textbook full
of chapters. Each chapter represents a different topic, but some chapters are
more crucial for understanding the subject than others.
Now, before diving into the entire textbook, you decide to evaluate the
importance of each chapter. You don’t want to spend the same amount of
time on every chapter; you want to focus more on the critical ones.
This is where Pre-normalization using RMSNorm comes into play for large
language models (LLMs) like ChatGPT. It’s like assigning a weight to each
chapter based on its significance. Chapters that are fundamental to the
subject get higher weights, while less important ones get lower weights.
So, before going deeply into studying, you adjust your study plan based on
the weighted importance of each chapter. You allocate more time and effort
to the chapters with higher weights, ensuring you grasp the core concepts
thoroughly.
Now, imagine if you had a magic pen that automatically adjusted the size and
style of your handwriting based on how important each point is. If
something is really crucial, the pen writes it bigger and clearer, making it
stand out. If it’s less important, the pen writes it smaller, but still legible.
SwiGLU is like that magic pen for large language models (LLMs) like
ChatGPT. Before generating text, SwiGLU adjusts the importance of each
word or phrase based on its relevance to the context. Just like the magic pen
adjusts the size and style of your writing, SwiGLU adjusts the emphasis of
each word or phrase.
So, when the LLM generates text, it can give more prominence to the
important parts, making them more noticeable and ensuring they contribute
more to the overall understanding of the text. This way, SwiGLU helps LLMs
produce text that’s clearer and easier to understand, much like how the
magic pen helps you create clearer explanations for your students on the
whiteboard. Further details on SwiGLU can be found in the associated paper.
Imagine you’re in a classroom, and you want to assign seats to students for
group discussions. Typically, you might arrange the seats in rows and
columns, with each student having a fixed position. However, in some cases,
you want to create a more dynamic seating arrangement where students can
move around and interact more freely.
ROPE is like a special seating arrangement that allows students to rotate and
change positions while still maintaining their relative positions to each
other. Instead of being fixed in one place, students can now move around in
a circular motion, allowing for more fluid interactions.
Let’s start with a simple example. Suppose we have a text corpus with the
words: “ab”, “bc”, “bcd”, and “cde”. We begin by initializing our vocabulary
with all the individual characters in the text corpus, so our initial vocabulary
is {“a”, “b”, “c”, “d”, “e”}.
Next, we calculate the frequency of each character in the text corpus. For
our example, the frequencies are: {“a”: 1, “b”: 3, “c”: 3, “d”: 2, “e”: 1}.
Now, we start the merging process. We repeat the following steps until our
vocabulary reaches the desired size:
2. We repeat the process. The next most frequent pair is “cd”. We merge “cd”
to form a new subword unit “cd” and update the frequency counts. The
updated frequency is {“a”: 1, “b”: 2, “c”: 1, “d”: 1, “e”: 1, “bc”: 2, “cd”: 2}.
We add “cd” to the vocabulary, resulting in {“a”, “b”, “c”, “d”, “e”, “bc”,
“cd”}.
3. Continuing the process, the next frequent pair is “de”. We merge “de” to
form the subword unit “de” and update the frequency counts to {“a”: 1,
“b”: 2, “c”: 1, “d”: 1, “e”: 0, “bc”: 2, “cd”: 1, “de”: 1}. We add “de” to the
vocabulary, making it {“a”, “b”, “c”, “d”, “e”, “bc”, “cd”, “de”}.
4. Next, we find “ab” as the most frequent pair. We merge “ab” to form the
subword unit “ab” and update the frequency counts to {“a”: 0, “b”: 1, “c”:
1, “d”: 1, “e”: 0, “bc”: 2, “cd”: 1, “de”: 1, “ab”: 1}. We add “ab” to the
vocabulary, which becomes {“a”, “b”, “c”, “d”, “e”, “bc”, “cd”, “de”, “ab”}.
5. Then, the next frequent pair is “bcd”. We merge “bcd” to form the
subword unit “bcd” and update the frequency counts to {“a”: 0, “b”: 0, “c”:
0, “d”: 0, “e”: 0, “bc”: 1, “cd”: 0, “de”: 1, “ab”: 1, “bcd”: 1}. We add “bcd” to
the vocabulary, resulting in {“a”, “b”, “c”, “d”, “e”, “bc”, “cd”, “de”, “ab”,
“bcd”}.
6. Finally, the most frequent pair is “cde”. We merge “cde” to form the
subword unit “cde” and update the frequency counts to {“a”: 0, “b”: 0, “c”:
0, “d”: 0, “e”: 0, “bc”: 1, “cd”: 0, “de”: 0, “ab”: 1, “bcd”: 1, “cde”: 1}. We add
“cde” to the vocabulary, making it {“a”, “b”, “c”, “d”, “e”, “bc”, “cd”, “de”,
“ab”, “bcd”, “cde”}.
This technique can improve the performance of LLMs and handle rare and
out-of-vocabulary words. The big difference between TikToken BPE and
sentencepiece BPE is that TikToken BPE doesn’t always split words into
smaller parts if the whole word is already known. For example, if “hugging”
is in the vocabulary, it stays as one token instead of splitting into
[“hug”,”ging”].
After installing the required libraries, we need to download some files. Since
we’re going to replicate the architecture of llama-3–8B, you must have an
account on HuggingFace. Additionally, since llama-3 is a gated model, you
have to accept their terms and conditions to access model content.
Once you’ve completed both of these steps, Now we have to download some
files. There are two options to do that:
Once you run this cell it will ask you to enter the token. If there is an error
during login, retry it but make sure to uncheck add token as git credential.
After that, we just need to run a simple Python code to download the three
files that are the backbone of the llama-3–8B architecture.
# Import the necessary function from the huggingface_hub library
from huggingface_hub import hf_hub_download
# Specify the directory where you want to save the downloaded files
save_directory = "llama-3-8B/" # Replace with your desired path
Once all the files are downloaded, we need to import the libraries that we
will be using throughout this blog.
# Tokenization library
import tiktoken
# PyTorch library
import torch
# JSON handling
import json
The length attribute shows the total vocabulary size, which is the unique
number of characters in the training data. The type of tokenizer_model is a
dictionary.
b'mitted': 5600,
b" $('#": 5601,
b' saw': 5602,
b' approach': 5603,
b'ICE': 5604,
b' saying': 5605,
b' anyone': 5606,
b'meta': 5607,
b'SD': 5608,
b' song': 5609
}
#### OUTPUT ####
When we print 10 random items from it, you will see strings that have been
formed using the BPE algorithm, similar to the example we discussed earlier.
Keys representing Byte sequences from BPE training, while values represent
merge ranks based on frequency.
consolidated.00.pth — contains the learned parameters (weights) of Llama-
3–8B. These parameters include information about how the model
understands and processes language, such as how it represents tokens,
computes attention, performs feed-forward transformations, and normalizes
its outputs.
'tok_embeddings.weight',
'layers.0.attention.wq.weight',
'layers.0.attention.wk.weight',
'layers.0.attention.wv.weight',
'layers.0.attention.wo.weight',
'layers.0.feed_forward.w1.weight',
'layers.0.feed_forward.w3.weight',
'layers.0.feed_forward.w2.weight',
'layers.0.attention_norm.weight',
'layers.0.ffn_norm.weight',
'layers.1.attention.wq.weight',
]
#### OUTPUT ####
'dim': 4096,
'n_layers': 32,
'n_heads': 32,
'n_kv_heads': 8,
'vocab_size': 128256,
'multiple_of': 1024,
'ffn_dim_multiplier': 1.3,
'norm_eps': 1e-05,
'rope_theta': 500000.0
}
#### OUTPUT ####
# Dimension
dim = config["dim"]
# Layers
n_layers = config["n_layers"]
# Heads
n_heads = config["n_heads"]
# KV_heads
n_kv_heads = config["n_kv_heads"]
# Vocabulary
vocab_size = config["vocab_size"]
# Multiple
multiple_of = config["multiple_of"]
# Multiplier
ffn_dim_multiplier = config["ffn_dim_multiplier"]
# Epsilon
norm_eps = config["norm_eps"]
# RoPE
rope_theta = torch.tensor(config["rope_theta"])
special_tokens = [
"<|begin_of_text|>", # Marks the beginning of a text sequence.
"<|end_of_text|>", # Marks the end of a text sequence.
"<|reserved_special_token_0|>", # Reserved for future use.
"<|reserved_special_token_1|>", # Reserved for future use.
"<|reserved_special_token_2|>", # Reserved for future use.
"<|reserved_special_token_3|>", # Reserved for future use.
"<|start_header_id|>", # Indicates the start of a header ID.
"<|end_header_id|>", # Indicates the end of a header ID.
"<|reserved_special_token_4|>", # Reserved for future use.
"<|eot_id|>", # Marks the end of a turn (in a conversational context).
] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)] # A large se
Next we define the rules for splitting text into tokens by specifying different
patterns to match various types of substrings in the input text. Here’s how
we can do that.
We need to code a simple tokenizer function using the TikToken BPE, which
takes three inputs: tokenizer_model, tokenize_breaker, and special_tokens.
This function will encode/decode our input text accordingly.
# input prompt
prompt = "the answer to the ultimate question of life, the universe, and everyth
# Encode the prompt using the tokenizer and prepend a special token (128000)
tokens = [128000] + tokenizer.encode(prompt)
We encoded our input text “the answer to the ultimate question of life, the
universe, and everything is ” starting with a special token.
These embeddings are not normalized, and it will have a serious effect if we
don’t normalize them. In the next section, we will perform normalization on
our input vectors.
# Calculating RMSNorm
def rms_norm(tensor, norm_weights):
# Calculate the mean of the square of tensor values along the last dimension
squared_mean = tensor.pow(2).mean(-1, keepdim=True)
You may already know that the dimension won’t change because we are only
normalizing the vectors and nothing else.
Here, 32 is the number of attention heads in Llama-3, 128 is the size of the
query vector, and 4096 is the size of the token embedding.
We can access the query weight matrix of the first head of the first layer
using:
# Extract the query weight for the first head of the first layer of attention
q_layer0_head0 = q_layer0[0]
# Print the shape of the extracted query weight tensor for the first head
q_layer0_head0.shape
To find the query vector for each token, we multiply the query weights with
the token embedding.
# Matrix multiplication: token embeddings with transpose of query weight for fir
q_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)
The query vectors don’t inherently know their position in the prompt, so
we’ll use RoPE to make them aware of it.
Implementing RoPE
We split the query vectors into pairs and then apply a rotational angle shift to
each pair.
# Print the shape of the resulting tensor after splitting into pairs
q_per_token_split_into_pairs.shape
Now, with a complex number for each token’s query element, we convert our
queries into complex numbers and then rotate them based on their position
using dot product.
q_per_token_as_complex_numbers.shape
# Output: torch.Size([17, 64])
# Calculate frequencies for each token using outer product of arange(17) and fre
freqs_for_each_token = torch.outer(torch.arange(17), freqs)
q_per_token_as_complex_numbers_rotated.shape
# Output: torch.Size([17, 64])
After obtaining the rotated vector, we can revert back to our original queries
as pairs by viewing the complex numbers as real numbers again.
The rotated pairs are now merged, resulting in a new query vector (rotated
query vector) that has the shape [17x128], where 17 is the number of tokens
and 128 is the dimension of the query vector.
For keys, the process is similar, but keep in mind that key vectors are also
128-dimensional. Keys have only 1/4th the number of weights as queries
because they are shared across 4 heads at a time to minimize computations.
Keys are also rotated to include positional information, similar to queries.
# Extract the weight tensor for the attention mechanism's key in the first layer
k_layer0 = model["layers.0.attention.wk.weight"]
# Reshape key weight for the first layer of attention to separate heads
k_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0] // n_kv_heads, dim)
# Extract the key weight for the first head of the first layer of attention
k_layer0_head0 = k_layer0[0]
# Print the shape of the extracted key weight tensor for the first head
k_layer0_head0.shape # Output: torch.Size([128, 4096])
# Print the shape of the resulting tensor representing keys per token
k_per_token.shape # Output: torch.Size([17, 128])
# Print the shape of the resulting tensor after splitting into pairs
k_per_token_split_into_pairs.shape # Output: torch.Size([17, 64, 2])
# Convert key per token to complex numbers
k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pa
# Print the shape of the resulting tensor representing key per token as complex
k_per_token_as_complex_numbers.shape # Output: torch.Size([17, 64])
We now have the rotated queries and keys for each token, with each being of
size [17x128].
# Print the shape of the resulting tensor representing query-key dot products pe
qk_per_token.shape
We need to mask the query-key scores. During training, future token query-
key scores are masked because we only learn to predict tokens using past
tokens. As a result, during inference, we set the future tokens to zero.
Now, we have to apply a mask to the query-key per token vector. Additionally,
we want to apply softmax on top of it to convert the output scores into
probabilities. This helps in selecting the most likely token or sequence of
tokens from the model’s vocabulary, making the model’s predictions more
interpretable and suitable for tasks like language generation and
classification.
For the value matrix, which marks the end of the self-attention part, similar
to keys, value weights are also shared across every 4 attention heads to save
computation. As a result, the shape of the value weight matrix is
[8x128x4096].
# Reshape value weight for the first layer of attention to separate heads
v_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] // n_kv_heads, dim)
Similar to the query and key matrices, the value matrix for the first layer and
first head can be obtained using:
# Extract the value weight for the first head of the first layer of attention
v_layer0_head0 = v_layer0[0]
# Print the shape of the extracted value weight tensor for the first head
v_layer0_head0.shape
Using the value weights, we compute the attention values for each token,
resulting in a matrix of size [17x128]. Here, 17 denotes the number of tokens
in the prompt, and 128 indicates the dimension of the value vector for each
token.
# Print the shape of the resulting tensor representing values per token
v_per_token.shape
We now have the attention values for the first layer and first head or in other
words self attention.
Now that the QKV attention matrix for all 32 heads in the first layer is
obtained, all attention scores will be merged into one large matrix of size
[17x4096].
# Concatenate QKV attentions from all heads along the last dimension
stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)
One of the last steps for layer 0 attention is to multiply the weight matrix
with the stacked QKV matrix.
# Calculate the embedding delta by matrix multiplication with the output weight
embedding_delta = torch.matmul(stacked_qkv_attention, model["layers.0.attention.
We now have the change in the embedding values after attention, which
should be added to the original token embeddings.
# Add the embedding delta to the unnormalized token embeddings to get the final
embedding_after_edit = token_embeddings_unnormalized + embedding_delta
# Normalize edited embeddings using root mean square normalization and provided
embedding_after_edit_normalized = rms_norm(embedding_after_edit, model["layers.0
Merging everything
Now that everything is ready, we need to merge our code to generate 31
more layers.
# Initialize final embedding with unnormalized token embeddings
final_embedding = token_embeddings_unnormalized
# Normalize the final embedding using root mean square normalization and wei
layer_embedding_norm = rms_norm(final_embedding, model[f"layers.{layer}.atte
# Retrieve query, key, value, and output weights for the attention mechanism
q_layer = model[f"layers.{layer}.attention.wq.weight"]
q_layer = q_layer.view(n_heads, q_layer.shape[0] // n_heads, dim)
k_layer = model[f"layers.{layer}.attention.wk.weight"]
k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] // n_kv_heads, dim)
v_layer = model[f"layers.{layer}.attention.wv.weight"]
v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] // n_kv_heads, dim)
w_layer = model[f"layers.{layer}.attention.wo.weight"]
# Concatenate QKV attentions from all heads along the last dimension
stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)
# Calculate embedding delta by matrix multiplication with the output weight
embedding_delta = torch.matmul(stacked_qkv_attention, w_layer.T)
# Add the embedding delta to the current embedding to get the edited embeddi
embedding_after_edit = final_embedding + embedding_delta
# Normalize the edited embedding using root mean square normalization and we
embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f"lay
# Update the final embedding with the edited embedding plus the output from
final_embedding = embedding_after_edit + output_after_feedforward
# Normalize the final embedding using root mean square normalization and provide
final_embedding = rms_norm(final_embedding, model["norm.weight"])
To predict the next value, we utilize the embedding of the last token.
# Calculate logits by matrix multiplication between the final embedding and the
logits = torch.matmul(final_embedding[-1], model["output.weight"].T)
# Find the index of the maximum value along the last dimension to determine the
next_token = torch.argmax(logits, dim=-1)
So, our input was “the answer to the ultimate question of life, the universe,
and everything is ”, and the output for it is “42”, which is the correct answer.
You can experiment with different input texts by simply changing these two
lines throughout the entire code, Rest of the code remains same!
# input prompt
prompt = "Your Input"
Hope you have enjoyed and learned new things from this blog!
Python Data Science Machine Learning Deep Learning Artificial Intelligence
11 min read · Mar 19, 2024 52 min read · Apr 12, 2024
740 5 1.7K 14
935 8
4.3K 89
See all from Fareed Khan See all from Level Up Coding
Youness Mansar in Towards Data Science Yaokun Lin @ MachineLearningQ… in Level Up Co…
434 1 412
Lists
AI Regulation ChatGPT
6 stories · 466 saves 21 stories · 651 saves
177 6 685 4
ChatGPT Users Will Love These How Pieter Levels Makes (At Least)
Amazing GPT-4o Use Cases $210K a Month From His Laptop …
Ten use cases for the new GPT-4o. I went through 126.8K tweets & found 7
patterns to Pieter’s success
· 5 min read · May 20, 2024 · 7 min read · May 17, 2024
665 3 3.1K 31