Transformers
MTH424
Yeditepe University
Uğur Ünal, PhD
Senior Research Engineer
Huawei Turkey R&D Center
15/04/2025
Seq2Seq and Attention Mechanism
• Translating a sentence
• -> The cat sat on the mat (EN)
• -> Le chat s’est assis sur le tapis (FR)
• Problem?
• Solution : Attention is all you need
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you
need. Advances in neural information processing systems, 30.
Working Principle of Original
Seq2Seq
Decod
Context
er
Encod
Vector 𝒔
s11 ⋯ 𝒔
s t𝑡
er
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm 𝒔0
A'
⋯ A'
A A A
⋯ A
𝒙 ′1 𝒙 ′𝑡
⋯
𝒙1 𝒙2 𝒙3 𝒙𝑚
Calculation of Attention
Q – what you are looking for
V – what you offer
K – information you carry
1. Calculate Score
2. Scaled the scores
3. Apply Softmax
4. Multiply by Values
Effect of Attention Mechanism in NLP
Seq2Seq with attention
BLEU
Seq2Seq without attention
sequence length (#
words)
The attention mechanism in RNN(1)
Weight importance/similarity calculation:
Linear maps:
Encoder RNN
Inner product: 𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚
𝒔0
s0
Normalization: 𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm
A A A
⋯ A
⋯
𝒙1 𝒙2 𝒙3 𝒙𝑚
The attention mechanism in RNN (2)
Context vector calculation:
Encoder RNN
𝒄0
𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚
𝒔0
s0
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm
A A A
⋯ A
⋯
𝒙1 𝒙2 𝒙3 𝒙𝑚
The attention mechanism in RNN (3)
Hidden state calculation of the Decoder:
𝒄0 Decoder RNN
Encoder RNN
𝒔0
s0 𝒔1
s0
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm
A'
A A A
⋯ A
𝒙 ′1
⋯
𝒙1 𝒙2 𝒙3 𝒙𝑚
Attention Mechanism in RNN (4)
Weight importance/similarity calculation:
Hidden state calculation of the Decoder:
Context vector calculation:
Encoder RNN Decoder RNN
𝒄1
𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚 𝒄0
𝒔0
s0 𝒔1
s0 𝒔2 ?
s0 ⋯ 𝐬 t𝑡
𝒔
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm
A' A' ⋯ 𝐀′
A A A
⋯ A
⋯ 𝒙 ′1 𝒙 ′2 𝒙 ′𝑡
𝒙1 𝒙2 𝒙3 𝒙𝑚
Weight importance/similarity
visualization
Encoder RNN (source language: Chinese)
Decoder RNN (target language: English)
Self-attention mechanism in RNN(1)
Review: Hidden State Calculation for SimpleRNN:
SimpleRNN+ implicit state calculation of self-attention:
Context vector calculation:
𝒄 0= 0 𝒄 1 ?
¿
s0
h 0 0 s01
h 𝒉 2= tanh
( [ ] )
𝐴∙
𝑥2
𝑐1
+𝒃
A A
𝒙1 𝒙2
Self-attention mechanism in RNN (2)
Weight importance/similarity calculation:
Context vector calculation:
𝒄 0= 0 𝒄 1 ?
𝛼1 𝛼2
¿
s0
h 0 0 s01
h 𝒉
s0𝟐
A A
𝒙1 𝒙2
Self-attention mechanism in RNN (3)
Hidden state calculation:
Weight importance/similarity calculation:
Context vector calculation:
𝒄1 𝒄2 ?
𝛼1 𝛼2 𝛼3
s01
h s02
h 𝒉
s0𝟑 ⋯ hs m
t
A A A ⋯ A
𝒙1 𝒙2 𝒙3 𝒙𝑚
Review: Attention in Seq2Seq
Query:Key:Value:
Weight:
Context Vector:
𝒄𝒄 𝒄
⋯
𝛼1 𝑗 𝛼2 𝑗 𝛼3 𝑗 𝛼 𝑚𝑗
𝑣: 1 𝑣: 2 𝑣: 3 ⋯ 𝑣: 𝑚 1 2 𝑗
𝒄
𝐬t
0
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm 𝒔1
s0 𝒔2
s0 ⋯ 𝒔 𝑗
⋯
𝒔0
s0
A' A' ⋯ A'
⋯
A A A
⋯ A
⋯
𝒙1 𝒙2 𝒙3 𝒙𝑚 𝒙1
′
𝒙2
′ ⋯ 𝒙𝑗
′ ⋯
Transformers Architecture Overview
Background of Transformer
At first, Transformer was proposed to solve the problem of machine translation, which is a seq2seq model.
Take the translation from French to English as an example: First look at the black box model. In a machine
translation task, a sentence in one language is taken as input, and then translated into a sentence in another
language as output.
The Transformer
output
input
Transformer Architecture Overview - Encoder
Ouput Probabilities
N blocks are stacked. Linear+Softmax
Each block has two layers: Add & Norm
Multi-Head Attention (Self-Attention) Feed Forward
+ Add (Residual Connection) Add & Norm
Add & Norm
Multi-Head
+ Norm (LayerNorm) Feed Forward Attention
Feed Forward 𝑁× Add & Norm Add & Norm
Masked
+ Add (Residual Connection) Multi-Head Multi-Head
Attention Attention
+ Norm (LayerNorm)
Positional Positional
Block1–BlockN-1 output: input to the next block. Encoding Encoding
Output of BlockN: input to each layer of the decoder. Input Embedding Input Embedding
Inputs shifted right
Transformer Architecture Overview - Decoder
Ouput Probabilities
N blocks are stacked. Linear+Softmax
Each block has three layers: Add & Norm
Masked Multi-Head Attention (Self-Attention) + Add (Residual Feed Forward
Connection) Add & Norm
Add & Norm
+ Norm (LayerNorm) Feed Forward
Multi-Head
Attention
Multi-Head Attention (Co-Attention) + Add (Residual
𝑁× Add & Norm Add & Norm
Connection) Masked
Multi-Head Multi-Head
Attention Attention
+ Norm (LayerNorm)
Feed Forward + Add (Residual Connection) + Norm (LayerNorm) Positional Positional
Encoding Encoding
Block1–BlockN-1 output: input to the next block.
Input Embedding Input Embedding
BlockN output: input to the subsequent linear layer.
Inputs shifted right
Input module
The input part can be divided into two parts:
Source Text Embed Layer + Position Encoder
Target text embedding layer + position encoder
position encoder position encoder
Source Text target text embed
Embed Layer layer
Encoder part on the Decoder part on the
left right
Input part: number code
In the NLP task, to facilitate the computer processing of string (e.g., I am a student) data, we first gave each word a
numeric code based on its vocabulary size.
For example, if the size of the English vocabulary is 20000, "I am a student" may correspond to a dictionary {"I":37,
"am": 4201, "a": 4110, "student": 1997}.
So we can understand that Inputs is an array. During model input, the input length is usually fixed for ease of
processing. Assume that the input length is fixed to 10, the blank character 0 is filled in the remaining position.
I am a student
Inputs 37 4201 4110 1997 0 ... 0
shape: (10,)
Input part: word vector transformation
Next, enter the text embedding layer, and its function is to convert the text number into a vector (usually referred to as a word
vector), so as to process information such as a feature of a word in a high-dimensional vector space, similar to the ID card, the
longer the information, the more the information. The input word vector dimension in the paper is 512.
However, the specific implementation of the Embedding layer uses a single-layer MLP (512).
After Embedding
0.2 ... 0.4
...
1.2 ... 0.4 1.9 ... 0.8
(10, 512)
...
I am a student shape: (10, )
Inputs 37 4201 4110 1997 0 ... 0
Input part: location code
Unlike RNNs, which inherently model word order, Transformers use position encoding to inject sequential information. The word
vector is combined with positional data (shown in yellow) to produce a position-aware representation.
After the location
coding ...
1.4 ... 0.8 0.3 ... 0.8 2.8 ... 1.6
(10, 512)
0.2 ... 0.4 0.1 ... 0.4 0.9 ... 0.8
...
(10, 512)
1.2 ... 0.4 0.2 ... 0.4 1.9 ... 0.8 ...
After Embedding (10, 512)
Inputs 37 4201 4110 1997 0 ... 0 shape:(10,)
Summary: Input Module Overview
Each word is converted into a word vector using a word embedding algorithm, and the positionally encoded
information is superimposed. In Transformer, the dimension of the word embedding vector is 512, which is
then input into the self-attention layer.
Encoder#1 Decoder#1
Encoder#0 Decoder#0
Time Step-
based Word 𝑥1 𝑥2
Embedding = =
Location code 𝑡1 𝑡2
+¿ +¿
word 𝑋1 𝑋2
embeddin
g
Entered Je suis
Understanding of Self-Attention
Mechanism
To understand the Multi-head Attention layer, first understand
the Self Attention layer. Self Attention is about focusing on the
parts that need attention.
How does the computer know that "it" refers to "animal" for
the following sentence?
That's what Self-Attention needs to do - associate "it" with "animal".
The animal didn't cross the street because it was too tired. Self-attentional
layer of bulls
Principle of Self-Attention Self-Attention
Mechanism (1)
For Self-Attention, Q (Query), K (Key), and V (Value):
The three matrices all come from the same input. The calculation process may be expressed as a weight
allocation function, which calculates the attention degree of the model on each part of the input data.
( )
𝑻
𝑸𝑲
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏(𝑸 , 𝑲 , 𝑽 )= 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝑽
√ 𝒅𝒌
Principle of Self-Attention Self-Attention
Mechanism (2)
Calculates the similarity between the query and each key. This similarity is usually calculated by a dot product or other form of
similarity function.
Perform the softmax operation on the similarity to obtain the weight distribution.
A context vector is obtained by multiplying the weight distribution and the value.
(N, 64)
(N, 512) (512, 64) (N, 64)
Principle of Multi-Head Attention Self-
Attention Mechanism
Multiple attention is a method of extending self-attention, which allows the model to capture information
from different representation spaces.
(N, 64*8) = (N, 512)
(N, 512)
(N, 512)
(512, 512)
Question ?
• What can be the advantages of Multi Self-Attention?
Effect of Multi-head Self-attention
Mechanism in Encoder
When coding "it," one attention head focuses on "The animal" and the
other on "tired." The model's representation of "it" incorporates part of
the expression of "animal" and "tired".
The essence of Multi-head Attention is that, when the total number of
parameters remains unchanged, Attention is mapped to different
subspaces of the original high-dimensional space for calculation, and
Attention information in different subspaces is combined in the last
step.
Because Attention has different distributions in different subspaces,
Multi-head Attention actually searches for the association relationships
between sequences from different angles, and then synthesizes the
association relationships captured in different subspaces in the final
splice step.
The animal didn't cross the street because it was too tired.
Add & Norm on the Self-Attention Layer
of the Bulls
The eigenvector Z can be obtained through the process just now. Then, the
eigenvector Z is added to the original X for residual connection, and the result R is
obtained.
(N, 512) (N, 512) (N, 512)
Finally, R undergoes a layer normalization to obtain a final R*, which is the output
of the multi-headed attention layer.
F
Layer R* word
Normalization vector F Encoder's
dimension
Self-attentional
layer of bulls
(N, 512) (N, 512) C-sequence length N samples
C N
Layer Normalization Batch Normalization
Feedforward fully connected layer
Feedforward neural network Feed Forword is a fully connected neural network that includes two linear layers.
Effect: The attention mechanism may not fit the complex process enough, and the model's ability is enhanced by adding a two-layer network.
(Experimental results show better)
After Z is obtained, a residual connection and layer normalization are performed to obtain the feedforward fully connected layer output.
R* R1
Fully Connected Layer (64) Fully Connected Layer (512)
(N, 512) + (N, 64) + (N, 512)
Relu activation function Dropout (0.1)
Layer Normalization R*
(N, 512) (N, 512) (N, 512) (N, 512)
The feedforward fully
connected layer of the
encoder
Summary Review: Encoder Workflow
In fact, the process of an encoder is to obtain a dimension-like feature through multiple attention and a
feedforward full connection: X to R*.
The feedforward fully
connected layer of the
encoder
R*
(N, 512) (N, 512)
Entered Features
Multi-head self-attention layer
of encoder
Masked multi-head self-attention layer
The only difference between the masked multi-head self-attention layer and the original multi-head attention layer is that the mask
Mask is added.
Specifically, the sequence mask technology is used. Sequence mask is used to prevent the decoder from seeing future
information.
That is, for a sequence, at a moment when time_step is t, the decoded output should depend only on the output before the
moment t, but not on the output after the moment t.
Therefore, the information after t needs to be hidden.
𝑡 =1 -inf -inf -inf -inf -inf -inf
𝑡 =2 I 0 -inf -inf I -inf -inf
𝑡 =3 I am 0 0 -inf I am -inf
Multiple attention layers related to
encoders
It differs from other multi-head attention layers in that:
K and V come from the output of the last layer of the encoder;
Q is the output from the previous Decoder module.
K, V from Encode Q on the
upper floor
Last output part in the Decoder
Index corresponding word student
Index of the highest
2
probability Output part
Probability value log_probs
012345 ... vocab_size-1
Softmax
eigenvalue logits
012345 ... vocab_size-1
Number of vocab_size nodes Linear
Fully Connected Layer
Output from the decoder
(N, 512)
Self-Attention Performance Comparison
Self-attention has the same maximum path length as FNN, so Transformer is more suitable for modeling long-
distance dependencies. But compared with RNN, Transformer has higher validity of parameters and better processing
for variable-length sequences.
Because convolutional layer has limited receptive fields, it is usually necessary to stack a deep network to have a
global receptive field. On the other hand, a constant maximum path length allows Self-attention to model long-term
dependencies between a constant number of layers.
The constant sequential operations and maximum path length make Self-attention more parallel than RNN and
perform better in remote modeling.
Advantages or Disadvantages of
Transformers?
Transformer-based Pretraining
BERT pre-training model
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, a 2019 paper by the Google team. Similar to
GPT (Generative Pre-Training), their goal is to learn a general model that can be applied to a large number of tasks and adapt to
different types of tasks with only slight modifications. `
Add & Norm
T1 T2 ... Tn
Feed Forward
Trm Trm ... Trm 𝑁× Add & Norm
Multi-Head
Attention
Trm Trm ... Trm Positional
Encoding
Input Embedding
E1 E2 ... En
Inputs
Fine-tuning of BERT in different tasks
Single-Sentence Classification Task Sentence pair classification task
Q&A Task Sequence Annotation Task
GPT Series Pre-trained Language Model
(1)
The simplest intelligent test game, guess what the next word is: I ate a lot today, I'm good ____ (full? Hungry?
Language model: Words are processed as vectors and the next word is continuously predicted through vector calculation. The
essence is to collect statistics on the probability relationship between words from a massive text corpus.
Transformer model: Words are used as vectors, input word vector sequences, and predict the next word vector.
2 Predict the vector 3 Probability of 4 Words with a high
of the next word converting probability are
vectors into sampled.
Beijing
candidate words0.8
? Nanjing 0.1
Beijing
Tokyo 0.02
Washington 0.01
............ ...
Transformer language model
The capital of China is
Transformer: vector calculation model Input:
1 partial sentences (word vector sequence)
GPT Series Pre-trained Language Model
(2)
Constantly predicting the next word to accomplish various tasks:
[Factual Q&A] The largest lake in the world is Lake Superior.
"Apple" is apple in English.
[News Continued] Scientists have discovered a new dinosaur fossil in the Hengduan Mountains of Yunnan, a
species of dinosaur that has never been discovered in the past. This kind of dinosaur fossil belongs to the middle
and late Jurassic period, scientists have taken the site perimeter patrol protection...
[Emotional Classification] "I'm tired but happy to go out today" is positive.
[Text Abstract] The 29th Summer Olympic Games, also known as the Beijing Olympic Games, opened at 8 p.m.
on August 8, 2008 in Beijing, China. In a word, the opening ceremony of the 2008 Beijing Olympic Games was a
complete success.