The Transformer is a revolutionary neural network architecture introduced in the 2017 paper
"Attention Is All You Need." It completely abandoned the sequential recurrence (RNNs/LSTMs)
and convolution typically used for sequence-to-sequence tasks like machine translation, relying
entirely on a mechanism called Self-Attention. This parallel processing capability is what made
it vastly more efficient and scalable, leading to the development of modern large language
models (LLMs) like GPT and BERT.
🏗️ Core Architecture
The original Transformer model follows the standard Encoder-Decoder structure:
1. Encoder Stack: Processes the input sequence (e.g., an English sentence) and creates an abstract,
continuous representation of it. The original model uses a stack of six identical encoder layers.
2. Decoder Stack: Uses the encoded representation, along with the partially generated output
sequence, to predict the next token (e.g., a word in the translated French sentence). It also
consists of six identical decoder layers.
1. Input Processing
Before entering the stacks, the input sequence undergoes two key steps:
Embeddings: Each word or sub-word (token) is converted into a numerical vector (an
embedding) that captures its semantic meaning.
Positional Encoding: Since the Transformer processes all tokens in parallel, it loses the natural
sequential order. Positional Encoding adds a vector to each input embedding, providing the
model with information about the token's absolute or relative position in the sequence. The
original paper used sine and cosine functions for this, but later models often use learned
embeddings.
🧠 The Self-Attention Mechanism
Self-attention is the heart of the Transformer. For every token in the input sequence, it calculates
an attention score with every other token (including itself) to determine how much to focus on
them when computing the current token's new representation. This is done by transforming the
input vector into three different, smaller vectors:
Vector Function Analogy
Query The current token being processed (what A question in a
($\mathbf{Q}$) I'm looking for). search engine.
Vector Function Analogy
The tokens in the sequence that are being
Key The index/label of
compared against the Query (what I have
($\mathbf{K}$) the documents.
available).
Value The actual information from the tokens that The content of the
($\mathbf{V}$) will be aggregated. documents.
The calculation for the output of the Scaled Dot-Product Attention is given by the formula:
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) =
\text{Softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{\text{T}}}{\sqrt{d_k}}\right)\mathbf{V}$$
1. Attention Score: A dot product is computed between the $\mathbf{Q}$ vector of the current
token and the $\mathbf{K}$ vectors of all tokens in the input. This measures the relevance or
compatibility.
2. Scaling: The scores are divided by $\sqrt{d_k}$ (the square root of the dimension of the key
vectors) to prevent the dot products from becoming too large and destabilizing the training
process.
3. Normalization: The Softmax function is applied to the scaled scores, turning them into
attention weights that sum to 1.
4. Weighted Sum: The attention weights are multiplied by the $\mathbf{V}$ vectors. Summing
these weighted values produces the final, context-aware output for the current token.
Multi-Head Attention
Instead of performing the attention calculation once, Multi-Head Attention performs the scaled
dot-product attention in parallel (e.g., 8 times). Each "head" uses different, independently
learned linear projections ($\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V$) to transform the
input vectors into $\mathbf{Q}, \mathbf{K}, \mathbf{V}$.
This allows the model to:
Jointly attend to information from different representation subspaces: One head might learn
to focus on syntactic relationships, while another might focus on semantic meaning.
Better capture long-range dependencies: By having multiple perspectives on the sequence
simultaneously.
The outputs from all attention heads are concatenated and then linearly projected to produce the
final output of the Multi-Head Attention layer.
🧠 Encoder and Decoder Layers
Encoder Layer Components
Each encoder layer has two main sub-layers, each with a residual connection and Layer
Normalization (which helps stabilize training):
1. Multi-Head Self-Attention: Allows the encoder to look at all other tokens in the input sequence
to compute a better representation for the current token.
2. Feed-Forward Network (FFN): A simple, position-wise fully-connected network applied
independently and identically to each position (token) in the sequence. It consists of two linear
transformations with a ReLU activation in between: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2
+ b_2$.
Decoder Layer Components
Each decoder layer has three main sub-layers, also followed by residual connections and layer
normalization:
1. Masked Multi-Head Self-Attention: This is the same as the encoder's self-attention, but with a
mask applied. The mask ensures that when predicting the next token, the decoder can only
attend to the previously generated tokens, preventing it from "cheating" by looking ahead.
2. Encoder-Decoder Multi-Head Attention: The Queries ($\mathbf{Q}$) come from the
previous masked attention layer in the decoder, but the Keys ($\mathbf{K}$) and Values
($\mathbf{V}$) come from the output of the entire encoder stack. This is the mechanism that
allows the decoder to focus on relevant parts of the input sequence.
3. Feed-Forward Network (FFN): Identical to the one in the encoder.
The final output of the decoder stack is passed through a Linear Layer and a Softmax function
to generate a probability distribution over the vocabulary, which determines the predicted next
token.
🚀 Impact and Variations
The Transformer architecture, especially its self-attention mechanism, is a massive leap forward
because it:
Parallelizes Computation: Unlike Recurrent Neural Networks (RNNs), which must process
tokens one by one, the Transformer processes the entire sequence simultaneously, drastically
reducing training time.
Handles Long-Range Dependencies: The self-attention mechanism can directly link any two
tokens in a sequence, regardless of how far apart they are, solving the "vanishing gradient"
problem that plagued RNNs over long sequences.
This architecture forms the basis for numerous groundbreaking models, often simplified to use
only the Encoder or only the Decoder:
Encoder-only Models (e.g., BERT): Excellent for understanding and encoding context from an
input (tasks like classification, question answering).
Decoder-only Models (e.g., GPT): Used for generative tasks (text generation, language
modeling) as they are trained to predict the next word autoregressively.
Encoder-Decoder Models (e.g., T5, BART): Used for sequence-to-sequence tasks like
translation and summarization.
Would you like a more detailed explanation of the Multi-Head Attention formula or how the
Positional Encoding works?