0% found this document useful (0 votes)
5 views42 pages

NLP 8

The document discusses the Transformer architecture, focusing on the Seq2Seq model and the attention mechanism used in natural language processing. It explains how attention improves translation tasks by allowing the model to focus on relevant parts of the input sequence. Additionally, it outlines the structure of the Transformer, including the encoder and decoder components, and the process of embedding input data with positional information.

Uploaded by

mehmet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

NLP 8

The document discusses the Transformer architecture, focusing on the Seq2Seq model and the attention mechanism used in natural language processing. It explains how attention improves translation tasks by allowing the model to focus on relevant parts of the input sequence. Additionally, it outlines the structure of the Transformer, including the encoder and decoder components, and the process of embedding input data with positional information.

Uploaded by

mehmet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Transformers

MTH424
Yeditepe University
Uğur Ünal, PhD
Senior Research Engineer
Huawei Turkey R&D Center
15/04/2025
Seq2Seq and Attention Mechanism
• Translating a sentence
• -> The cat sat on the mat (EN)
• -> Le chat s’est assis sur le tapis (FR)

• Problem?
• Solution : Attention is all you need
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you
need. Advances in neural information processing systems, 30.
Working Principle of Original
Seq2Seq
Decod
Context
er
Encod
Vector 𝒔
s11 ⋯ 𝒔
s t𝑡
er
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm 𝒔0
A'
⋯ A'

A A A
⋯ A

𝒙 ′1 𝒙 ′𝑡

𝒙1 𝒙2 𝒙3 𝒙𝑚
Calculation of Attention
Q – what you are looking for
V – what you offer
K – information you carry

1. Calculate Score

2. Scaled the scores

3. Apply Softmax

4. Multiply by Values
Effect of Attention Mechanism in NLP

Seq2Seq with attention

BLEU

Seq2Seq without attention

sequence length (#
words)
The attention mechanism in RNN(1)
 Weight importance/similarity calculation:


Linear maps:

Encoder RNN

Inner product: 𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚
𝒔0
s0


Normalization: 𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm

A A A
⋯ A


𝒙1 𝒙2 𝒙3 𝒙𝑚
The attention mechanism in RNN (2)
 Context vector calculation:

Encoder RNN
𝒄0
𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚
𝒔0
s0

𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm

A A A
⋯ A


𝒙1 𝒙2 𝒙3 𝒙𝑚
The attention mechanism in RNN (3)
 Hidden state calculation of the Decoder:

𝒄0 Decoder RNN

Encoder RNN
𝒔0
s0 𝒔1
s0

𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm
A'

A A A
⋯ A

𝒙 ′1

𝒙1 𝒙2 𝒙3 𝒙𝑚
Attention Mechanism in RNN (4)
 Weight importance/similarity calculation:
Hidden state calculation of the Decoder:

 Context vector calculation:

Encoder RNN Decoder RNN

𝒄1
𝛼1 𝛼2 𝛼3 ⋯ 𝛼𝑚 𝒄0
𝒔0
s0 𝒔1
s0 𝒔2 ?
s0 ⋯ 𝐬 t𝑡
𝒔
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm

A' A' ⋯ 𝐀′

A A A
⋯ A

⋯ 𝒙 ′1 𝒙 ′2 𝒙 ′𝑡
𝒙1 𝒙2 𝒙3 𝒙𝑚
Weight importance/similarity
visualization
Encoder RNN (source language: Chinese)

Decoder RNN (target language: English)


Self-attention mechanism in RNN(1)
 Review: Hidden State Calculation for SimpleRNN:
 SimpleRNN+ implicit state calculation of self-attention:
 Context vector calculation:

𝒄 0= 0 𝒄 1 ?

¿
s0
h 0 0 s01
h 𝒉 2= tanh
( [ ] )
𝐴∙
𝑥2
𝑐1
+𝒃

A A

𝒙1 𝒙2
Self-attention mechanism in RNN (2)
 Weight importance/similarity calculation:

 Context vector calculation:

𝒄 0= 0 𝒄 1 ?
𝛼1 𝛼2
¿
s0
h 0 0 s01
h 𝒉
s0𝟐

A A

𝒙1 𝒙2
Self-attention mechanism in RNN (3)
 Hidden state calculation:
 Weight importance/similarity calculation:
 Context vector calculation:

𝒄1 𝒄2 ?
𝛼1 𝛼2 𝛼3
s01
h s02
h 𝒉
s0𝟑 ⋯ hs m
t

A A A ⋯ A

𝒙1 𝒙2 𝒙3 𝒙𝑚
Review: Attention in Seq2Seq
 Query:Key:Value:

 Weight:

 Context Vector:

𝒄𝒄 𝒄

𝛼1 𝑗 𝛼2 𝑗 𝛼3 𝑗 𝛼 𝑚𝑗
𝑣: 1 𝑣: 2 𝑣: 3 ⋯ 𝑣: 𝑚 1 2 𝑗
𝒄
𝐬t
0
𝒉
h11 𝒉
h22 𝒉
h33 ⋯ 𝒉𝑚
hm 𝒔1
s0 𝒔2
s0 ⋯ 𝒔 𝑗

𝒔0
s0

A' A' ⋯ A'



A A A
⋯ A


𝒙1 𝒙2 𝒙3 𝒙𝑚 𝒙1

𝒙2
′ ⋯ 𝒙𝑗
′ ⋯
Transformers Architecture Overview
Background of Transformer
 At first, Transformer was proposed to solve the problem of machine translation, which is a seq2seq model.
 Take the translation from French to English as an example: First look at the black box model. In a machine
translation task, a sentence in one language is taken as input, and then translated into a sentence in another
language as output.

The Transformer

output
input
Transformer Architecture Overview - Encoder
Ouput Probabilities

 N blocks are stacked. Linear+Softmax

 Each block has two layers: Add & Norm


Multi-Head Attention (Self-Attention) Feed Forward

+ Add (Residual Connection) Add & Norm


Add & Norm

Multi-Head
+ Norm (LayerNorm) Feed Forward Attention


Feed Forward 𝑁× Add & Norm Add & Norm

Masked
+ Add (Residual Connection) Multi-Head Multi-Head
Attention Attention

+ Norm (LayerNorm)
Positional Positional
 Block1–BlockN-1 output: input to the next block. Encoding Encoding

 Output of BlockN: input to each layer of the decoder. Input Embedding Input Embedding

Inputs shifted right


Transformer Architecture Overview - Decoder
Ouput Probabilities

 N blocks are stacked. Linear+Softmax

 Each block has three layers: Add & Norm


Masked Multi-Head Attention (Self-Attention) + Add (Residual Feed Forward

Connection) Add & Norm


Add & Norm

+ Norm (LayerNorm) Feed Forward


Multi-Head
Attention


Multi-Head Attention (Co-Attention) + Add (Residual
𝑁× Add & Norm Add & Norm

Connection) Masked
Multi-Head Multi-Head
Attention Attention
+ Norm (LayerNorm)

Feed Forward + Add (Residual Connection) + Norm (LayerNorm) Positional Positional
Encoding Encoding

 Block1–BlockN-1 output: input to the next block.


Input Embedding Input Embedding

 BlockN output: input to the subsequent linear layer.


Inputs shifted right
Input module
 The input part can be divided into two parts:

Source Text Embed Layer + Position Encoder

Target text embedding layer + position encoder

position encoder position encoder

Source Text target text embed


Embed Layer layer
Encoder part on the Decoder part on the
left right
Input part: number code
 In the NLP task, to facilitate the computer processing of string (e.g., I am a student) data, we first gave each word a
numeric code based on its vocabulary size.
 For example, if the size of the English vocabulary is 20000, "I am a student" may correspond to a dictionary {"I":37,
"am": 4201, "a": 4110, "student": 1997}.
 So we can understand that Inputs is an array. During model input, the input length is usually fixed for ease of
processing. Assume that the input length is fixed to 10, the blank character 0 is filled in the remaining position.

I am a student

Inputs 37 4201 4110 1997 0 ... 0

shape: (10,)
Input part: word vector transformation
 Next, enter the text embedding layer, and its function is to convert the text number into a vector (usually referred to as a word
vector), so as to process information such as a feature of a word in a high-dimensional vector space, similar to the ID card, the
longer the information, the more the information. The input word vector dimension in the paper is 512.
 However, the specific implementation of the Embedding layer uses a single-layer MLP (512).

After Embedding

0.2 ... 0.4


...
1.2 ... 0.4 1.9 ... 0.8

(10, 512)

...
I am a student shape: (10, )

Inputs 37 4201 4110 1997 0 ... 0


Input part: location code
 Unlike RNNs, which inherently model word order, Transformers use position encoding to inject sequential information. The word
vector is combined with positional data (shown in yellow) to produce a position-aware representation.

After the location


coding ...
1.4 ... 0.8 0.3 ... 0.8 2.8 ... 1.6

(10, 512)

0.2 ... 0.4 0.1 ... 0.4 0.9 ... 0.8


...

(10, 512)

1.2 ... 0.4 0.2 ... 0.4 1.9 ... 0.8 ...
After Embedding (10, 512)

Inputs 37 4201 4110 1997 0 ... 0 shape:(10,)


Summary: Input Module Overview
 Each word is converted into a word vector using a word embedding algorithm, and the positionally encoded
information is superimposed. In Transformer, the dimension of the word embedding vector is 512, which is
then input into the self-attention layer.

Encoder#1 Decoder#1

Encoder#0 Decoder#0

Time Step-
based Word 𝑥1 𝑥2
Embedding = =

Location code 𝑡1 𝑡2
+¿ +¿
word 𝑋1 𝑋2
embeddin
g
Entered Je suis
Understanding of Self-Attention
Mechanism
 To understand the Multi-head Attention layer, first understand
the Self Attention layer. Self Attention is about focusing on the
parts that need attention.
 How does the computer know that "it" refers to "animal" for
the following sentence?

That's what Self-Attention needs to do - associate "it" with "animal".

The animal didn't cross the street because it was too tired. Self-attentional
layer of bulls
Principle of Self-Attention Self-Attention
Mechanism (1)
 For Self-Attention, Q (Query), K (Key), and V (Value):

The three matrices all come from the same input. The calculation process may be expressed as a weight
allocation function, which calculates the attention degree of the model on each part of the input data.

( )
𝑻
𝑸𝑲
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏(𝑸 , 𝑲 , 𝑽 )= 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝑽
√ 𝒅𝒌
Principle of Self-Attention Self-Attention
Mechanism (2)
 Calculates the similarity between the query and each key. This similarity is usually calculated by a dot product or other form of
similarity function.
 Perform the softmax operation on the similarity to obtain the weight distribution.
 A context vector is obtained by multiplying the weight distribution and the value.

(N, 64)
(N, 512) (512, 64) (N, 64)
Principle of Multi-Head Attention Self-
Attention Mechanism
 Multiple attention is a method of extending self-attention, which allows the model to capture information
from different representation spaces.

(N, 64*8) = (N, 512)

(N, 512)

(N, 512)

(512, 512)
Question ?
• What can be the advantages of Multi Self-Attention?
Effect of Multi-head Self-attention
Mechanism in Encoder
 When coding "it," one attention head focuses on "The animal" and the
other on "tired." The model's representation of "it" incorporates part of
the expression of "animal" and "tired".
 The essence of Multi-head Attention is that, when the total number of
parameters remains unchanged, Attention is mapped to different
subspaces of the original high-dimensional space for calculation, and
Attention information in different subspaces is combined in the last
step.
 Because Attention has different distributions in different subspaces,
Multi-head Attention actually searches for the association relationships
between sequences from different angles, and then synthesizes the
association relationships captured in different subspaces in the final
splice step.

The animal didn't cross the street because it was too tired.
Add & Norm on the Self-Attention Layer
of the Bulls
 The eigenvector Z can be obtained through the process just now. Then, the
eigenvector Z is added to the original X for residual connection, and the result R is
obtained.

(N, 512) (N, 512) (N, 512)

 Finally, R undergoes a layer normalization to obtain a final R*, which is the output
of the multi-headed attention layer.
F
Layer R* word
Normalization vector F Encoder's
dimension
Self-attentional
layer of bulls
(N, 512) (N, 512) C-sequence length N samples
C N

Layer Normalization Batch Normalization


Feedforward fully connected layer
 Feedforward neural network Feed Forword is a fully connected neural network that includes two linear layers.
 Effect: The attention mechanism may not fit the complex process enough, and the model's ability is enhanced by adding a two-layer network.
(Experimental results show better)
 After Z is obtained, a residual connection and layer normalization are performed to obtain the feedforward fully connected layer output.
R* R1

Fully Connected Layer (64) Fully Connected Layer (512)


(N, 512) + (N, 64) + (N, 512)
Relu activation function Dropout (0.1)

Layer Normalization R*

(N, 512) (N, 512) (N, 512) (N, 512)


The feedforward fully
connected layer of the
encoder
Summary Review: Encoder Workflow
 In fact, the process of an encoder is to obtain a dimension-like feature through multiple attention and a
feedforward full connection: X to R*.

The feedforward fully


connected layer of the
encoder

R*

(N, 512) (N, 512)

Entered Features

Multi-head self-attention layer


of encoder
Masked multi-head self-attention layer
 The only difference between the masked multi-head self-attention layer and the original multi-head attention layer is that the mask
Mask is added.

Specifically, the sequence mask technology is used. Sequence mask is used to prevent the decoder from seeing future
information.

That is, for a sequence, at a moment when time_step is t, the decoded output should depend only on the output before the
moment t, but not on the output after the moment t.

Therefore, the information after t needs to be hidden.

𝑡 =1 -inf -inf -inf -inf -inf -inf

𝑡 =2 I 0 -inf -inf I -inf -inf

𝑡 =3 I am 0 0 -inf I am -inf
Multiple attention layers related to
encoders
 It differs from other multi-head attention layers in that:

K and V come from the output of the last layer of the encoder;

Q is the output from the previous Decoder module.

K, V from Encode Q on the


upper floor
Last output part in the Decoder
Index corresponding word student

Index of the highest


2
probability Output part

Probability value log_probs


012345 ... vocab_size-1

Softmax

eigenvalue logits
012345 ... vocab_size-1

Number of vocab_size nodes Linear


Fully Connected Layer

Output from the decoder

(N, 512)
Self-Attention Performance Comparison

 Self-attention has the same maximum path length as FNN, so Transformer is more suitable for modeling long-
distance dependencies. But compared with RNN, Transformer has higher validity of parameters and better processing
for variable-length sequences.
 Because convolutional layer has limited receptive fields, it is usually necessary to stack a deep network to have a
global receptive field. On the other hand, a constant maximum path length allows Self-attention to model long-term
dependencies between a constant number of layers.
 The constant sequential operations and maximum path length make Self-attention more parallel than RNN and
perform better in remote modeling.
Advantages or Disadvantages of
Transformers?
Transformer-based Pretraining
BERT pre-training model
 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, a 2019 paper by the Google team. Similar to
GPT (Generative Pre-Training), their goal is to learn a general model that can be applied to a large number of tasks and adapt to
different types of tasks with only slight modifications. `
Add & Norm
T1 T2 ... Tn
Feed Forward

Trm Trm ... Trm 𝑁× Add & Norm

Multi-Head
Attention

Trm Trm ... Trm Positional


Encoding

Input Embedding

E1 E2 ... En

Inputs
Fine-tuning of BERT in different tasks

Single-Sentence Classification Task Sentence pair classification task

Q&A Task Sequence Annotation Task


GPT Series Pre-trained Language Model
(1)
 The simplest intelligent test game, guess what the next word is: I ate a lot today, I'm good ____ (full? Hungry?

 Language model: Words are processed as vectors and the next word is continuously predicted through vector calculation. The
essence is to collect statistics on the probability relationship between words from a massive text corpus.

Transformer model: Words are used as vectors, input word vector sequences, and predict the next word vector.

2 Predict the vector 3 Probability of 4 Words with a high


of the next word converting probability are
vectors into sampled.
Beijing
candidate words0.8
? Nanjing 0.1
Beijing
Tokyo 0.02
Washington 0.01
............ ...

Transformer language model

The capital of China is

Transformer: vector calculation model Input:


1 partial sentences (word vector sequence)
GPT Series Pre-trained Language Model
(2)
 Constantly predicting the next word to accomplish various tasks:

[Factual Q&A] The largest lake in the world is Lake Superior.

"Apple" is apple in English.

[News Continued] Scientists have discovered a new dinosaur fossil in the Hengduan Mountains of Yunnan, a
species of dinosaur that has never been discovered in the past. This kind of dinosaur fossil belongs to the middle
and late Jurassic period, scientists have taken the site perimeter patrol protection...

[Emotional Classification] "I'm tired but happy to go out today" is positive.

[Text Abstract] The 29th Summer Olympic Games, also known as the Beijing Olympic Games, opened at 8 p.m.
on August 8, 2008 in Beijing, China. In a word, the opening ceremony of the 2008 Beijing Olympic Games was a
complete success.

You might also like