Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
Sequence-To-Sequence, Attention,
Transformer
Contents
Sequence-To-Sequence
Attention
Transformer
BERT
GPT, GPT-2 and GPT-3
Sequence-To-Sequence
In the context of Machine Learning a sequence is an ordered data structure, whose successive
elements are somehow correlated.
Examples:
The crucial property of sequences is the correlation between the individual datapoints. This
means that for each element (datapoint) of the sequence, information is not only provided by
it’s individual feature-vector, but also by the neighboring datapoints. For each element of the
sequence, the neighboring elements are called context and we can understand an individual
element by taking in account
An example of how the human brain exploits context-information in order to recognize is given
in the image below.
https://hannibunny.github.io/mlbook/transformer/attention.html 1/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
For all types of sequential data, Machine Learning algorithms should learn models, which regard
not only individual feature vectors, but also contextual information. For example Recurrent
Networks (RNN) are capable to do so. In this section more complex ML architectures, suitable
for sequential data will be described. Some of these architectures integrate Recurrent Neural
Networks. More recent architectures, Transformers, integrate the concept of Attention. Both,
Attention and the integration of Attention in Transformers will be described in this section.
input sequences x = (x 1 , x 2 , … x Tx )
to output sequences y = (y 1 , y 2 , … y Ty )
The lengths of input- and output sequence need not be the same.
Applications of Seq2Seq models are e.g. Language Models (LM) or Machine Translation.
where the function f() is realized by a Vanilla RNN, LSTM, GRU. The Dense layer at the output
realizes the function
y i = g(h i ) ∀i ∈ [1, T ].
If the Dense layer at the output has a softmax-activation, and the architecture is trained to
predict the next token in the sequence, the output at each time step t is the conditional
distribution
https://hannibunny.github.io/mlbook/transformer/attention.html 2/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
p(x t ∣ x t−1 , … , x 1 ).
In this way a language model can be implemented. Language models allow to predict a target
word from the context words (neighbouring words).
Fig. 79 Simple Seq2Seq architecture for alligned input- and output sequence. The input-
sequence is processed (green) by a Recurrent Neural layer (Vanilla RNN, LSTM, GRU, etc.) and
the hidden states (blue) at the output of the Recurrent layer are passed to a dense layer with
softmax-activation. The output sequence (red) is alligned to the input sequence in the sense
that each y i corresponds to x i . This also implies that both sequences have the same length.
Encoder-Decoder Architectures
In an Encoder-Decoder architecture, the Encoder maps the input sequence x = (x 1 , x 2 , … x Tx )
to an intermediate representation, also called context vector, c. The entire information of the
sequences is compressed in this vector. The context vector is applied as input to the Decoder,
which outputs a sequence y = (y 1 , y 2 , … y Ty ). With this architecture the input- and output-
sequence need not be alligned.
There exists a phletora of different Seq2Seq Encoder-Decoder architectures. Here, we first refer
to one of the first architectures, introduced by Cho et al in [CvMG+14].
As depicted in image Seq2Seq-Encoder-Decoder, the encoder processes the input sequence and
compresses the information into a fixed-length context vector c. For example if the input-
sequence are the words of a sentence, the context vector is also called sentence embedding. In
general the context-vector is calculated by
https://hannibunny.github.io/mlbook/transformer/attention.html 3/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
where the function f() is realized by a Vanilla RNN, LSTM, GRU, etc.
context vector c
the previous predicted word y i−1 .
the hidden state h d,i of the decoder at time i,
where the functions g() and k() are realized by Vanilla RNN, LSTM, GRU, etc. Since g() must
output a probability distribution, it shall apply softmax-activation.
This Seq2Seq-Encoder-Decoder architecture has been proposed for Machine Translation. In this
application a single sentence of the source language is the input sequence and the
corresponding sentence in the target language is the output sequence. Translation can either be
done on character- or on word-level. On character-level the elements of the sequences are
characters, on word-level the sequence elements are words. Here, we assume translation on
word-level.
1. Encoder: Input sentence in source language x (j) to the Encoder. The sentence is a
sequence of words. Words are represented by their word-embedding vectors.
2. Encoder: For the current sentence at the Encoder calculate the context-vector c
3. Set i := 1, y^0 = ST ART , h d,0 = 0
4. For all words y i of the target sentence:
(j)
Apply the error between known target word y i and output of the decoder y^i in
(j) (j)
Drawbacks of Seq2Seq Encoder-Decoder: The Decoder estimates one word after another and
applies the estimated word at time i as an input for estimating the next word at time i + 1. As
soon as one estimate is wrong, the successive step perceives an erroneous input, which may
cause the next erroneous output and so on. Such error-propagations can not be avoided in this
type of Seq2Seq Encoder-Decoder architectures. Moreover, for long sequences, the single fixed
length context vextor c encodes information from the last part of the sequence quite well, but
may have forgotten information from the early parts.
https://hannibunny.github.io/mlbook/transformer/attention.html 4/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Attention
Concept of Attention
Attention is a well known concept in human recognition. Given a new input, the human brain
focuses on a essential region, which is scanned with high resolution. After scanning this region,
other relevant regions are inferred and scanned. In this way fast recognition without scanning
the entire input in detail can be realized. Examples of attention in visual recognition and in
reading are given in the images below.
Fig. 81 Attention in visual recognition: In this example attention is first focused on the
mouth. With this first perception alone the object can not be recognized. Then attention is
focused on something around the mouth. After seeing the ears the object can be recognized
to be a horse.
Fig. 82 Attention in reading: When we read horse, we expect to encounter a verb, which is
associated to horse. When we read jumped, we expect to encounter a word, which is
associated to horse and jumped. When we read hurt, we expect to encounter a word, which
is associated to jumped and hurt.
In attention layers the hidden states of all time-steps have an equal chance to contribute to the
representation of the entire sequence. The relevance of the individual elements for the entire
sequence-representation is learned.
https://hannibunny.github.io/mlbook/transformer/attention.html 5/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Fig. 83 Attention layer on top of a unidirectional (top) and unidirectional (bottom) RNN,
respectively. For each time-step a context vector c(i) is calculated as a linear combination of
all inputs over the entire sequence.
As sketched in the image above, in an attention layer, for each time-step a context vector c(i) is
calculated as a linear combination of all inputs over the entire sequence. The coefficients of the
linear combination, a i,j are learned from training data. In contrast to usual weights w i,j in a
neural network, these coefficients vary with the current input. A high value of a i,j means that for
calculating the i. th context vector c(i), in the current input the j. th element is important - or
attention is focused on the j.th input.
In an architecture like depicted above, the Decoder is trained to predict the probability-
distribution for the next word y i , given the context vector c i and all the previously predicted
words {y 1 , … , y i−1 }:
with
Tx
c i = ∑ a i,j h e,j ,
j=1
https://hannibunny.github.io/mlbook/transformer/attention.html 6/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
The learned coefficients a i,j describe how well the two tokens (words) x j and y i are aligned.
exp(e i,j )
a i,j = ,
∑ Tk=1
x
exp(e i,k )
where
is an alignment model, which scores how well the inputs around position j and the output
at position i match.
The scoring function a() can be realized in different ways [1]. E.g. it can just be the scalar
product
Another approach is to implement the scoring function as a MLP, which is jointly trained with all
other parameters of the network. This approach is depicted in the image below. Note that the
image refers to an architecture, where the Attention layer is embedded into a simple Feed-
Forward Neural Network. However, this type of scoring can also be applied in the context of a
Seq2Seq-Encoder-Decoder architecture. In order to calculate coefficient a j , the j.th input h j is
passed to the input of the MLP. The output e j is then passed to a softmax activation function:
exp(e j )
aj =
∑ Tk=1
x
exp(e k )
Fig. 85 Scoring function a() realized by a MLP with softmax-activation at the output. Here,
the attention layer is not embedded in as Seq2Seq Encoder-Decoder architecture, but in a
Feed-Forward Neural Network. Image Source: [RE16].
Transformer
Motivation
Deep Learning needs huge amounts of training data and correspondingly high processing effort
for training. In order to cope with this processing complexity, GPUs/TPUs must be applied.
However, GPUs and TPUs yield higher training speed, if operations can be parallelized. The
drawback of RNNs (of any type) is that the recurrent connections can not be parallelized.
https://hannibunny.github.io/mlbook/transformer/attention.html 7/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Transformers [VSP+17] exploit only Self-Attention, without recurrent connections. So they can
be trained efficiently on GPUs. In this section first the concept of Self-Attention is described.
Then Transformer architectures are presented.
Self Attention
As described above, in the Attention Layer }
is an alignment model, which scores how well the input-sequence around position j and the
output-sequence at position i match. Now in Self-Attention
e i,j = a(h i , h j )
scores the match of different positions j and i of the sequence at the input. In the image
below the calculation of the outputs y i in a Self-Attention layer is depicted. Here,
Fig. 86 Calculation of y 1 .
Contextual Embeddings
What is the meaning of the outputs of a Self-Attention layer? To answer this question, we focus
on applications, where the inputs to the network x i are sequences of words. In this case, words
are commonly represented by their embedding vectors (e.g. Word2Vec, Glove, Fasttext, etc.). The
drawback of Word Embeddings is that they are context free. E.g. the word tree has an unique
word embedding, independent of the context (tree as natural object or tree as a special type of
graph). On the other hand, the elements y i of the Self-Attention-Layer output
y = (y 1 , y 2 , … y T ) can be considerd to be contextual word embeddings! The representation y i
is a contextual embedding of the input word x i in the given context.
https://hannibunny.github.io/mlbook/transformer/attention.html 8/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Query: It is compared to every other vector to establish the weights for its own output y i
Key: It is compared to every other vector to establish the weights for the output of the j-th
vector y j
Value: It is used as part of the weighted sum to compute each output vector once the
weights have been established.
the Query vector is obtained by multiplying input vector x i with the learnable matrix W q :
qi = Wq xi
the Key vector is obtained by multiplying input vector x i with the learnable matrix W k :
ki = Wk xi
the Value vector is obtained by multiplying input vector x i with the learnable matrix W v :
vi = Wv xi
a ′i,j = q iT k j
(101)
a i,j =softmax(a ′i,j )
yi = ∑ a i,j v j
j
Fig. 88 Calculation of Self-Attention outputs y 1 from queries, keys and values of the input-
sequence
In the calculation, defined in (101), the problem is that the softmax-function is sensitive to large
input values in the sense that for large inputs most of the softmax outputs are close to 0 and the
corresponding gradients are also very small. The effect is very slow learning adaptations. In
order to circumvent this, the inputs to the softmax are normalized:
q iT k j
a ′i,j =
√d
a i,j =softmax(a ′i,j )
https://hannibunny.github.io/mlbook/transformer/attention.html 9/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
y r = (y r1 , y r2 , … y rTy ).
The length of the input vectors x i is typically d = 256. A typical number of heads is h = 8. For
combining outputs of the h heads to the overall output-vector y r , there exists 2 different
options:
Option 1:
Cut vectors x i in h parts, each of size d s
Each of these parts is fed to one head
Concatenation of y 1i , … , y hi yields y i of size d
Multiply this concatenation with matrix W O , which is typically of size d × d
Option 2:
Fed entire vector x i to each head.
Matrices W qr , W kr , W vr are each of size d × d
Concatenation of y 1i , … , y hi yields y i of size d ⋅ h
Multiply this concatenation with matrix W O , which is typically of size d × (d ⋅ h)
Fig. 90 Multi-Head Self-Attention: Combination of the individual heads to the overall output.
https://hannibunny.github.io/mlbook/transformer/attention.html 10/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
The vectors for positional encoding are designed such that the similiarity of two vectors
decreases with increasing distance between the positions of the tokens to which they are added.
This is illustrated in the image below:
Fig. 92 Positional Encoding: To each position within the sequence a unique positional-
encoding-vector is assigned. As can be seen the euclidean distance between vectors for
further away positions is larger than the distance between vectors, which belong to positions
close to each other. http://jalammar.github.io/illustrated-transformer
Example
For the two-word example sentence Thinking Machines and for the case of a single head, the
calculations done in the Self-Attention block, as specified in Image Singlehead Self-attention, are
sketched in the image below. In this example postional encoding has been omitted for sake of
https://hannibunny.github.io/mlbook/transformer/attention.html 11/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
simplicity.
Fig. 94 Example: Singlehead Self-Attention for the two-words sequence Thinking Machines .
Image source: http://jalammar.github.io/illustrated-transformer
Instead of calculating the outputs z i of a single head individually all of them can be calculated
simultanously by matrix multiplication:
https://hannibunny.github.io/mlbook/transformer/attention.html 12/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
source: http://jalammar.github.io/illustrated-transformer
A typical Encoder block is depicted in the image below. In this image the Self-Attention module
is the same as already depicted in Image Multihead Self-attention. The outputs z i of the Self-
Attention module are exactly the contextual embeddings, which have been denoted by y i in
Image Multihead Self-attention. Each of the outputs z i is passed to a Multi-Layer Perceptron
(MLP). The outputs of the MLP are the new representations r i (one for each input token). These
outputs r i constitute the inputs x i to the next Encoder block.
Fig. 98 Encoder Block - simple variant: Self-Attention Layer followed by Feed Forward
Network. Image source: http://jalammar.github.io/illustrated-transformer
The image above depicts a simple variant of an Encoder block, consisting only of Self-Attention
and a Feed Forward Neural Network. A more complex and more practical option is shown in the
image below. Here, short-cut connections from the Encoder-block input to the output of the
https://hannibunny.github.io/mlbook/transformer/attention.html 13/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Self-Attention Layer are implemented. The concept of such short-cuts has been introduced and
analysed in the context of Resnet ([HZRS15]). Moreover, the sum of the Encoder-block input and
the output of the Self-Attention Layer is layer-normalized (see [BKH16]), before it is passed to
the Feed Forward Net.
Fig. 99 Encoder Block - practical variant: Short-Cut Connections and Layer Normalisation are
applied in addition to Self-Attention and Feed Forward Network. Image source:
http://jalammar.github.io/illustrated-transformer
Image Encoder-Decoder illustrates the modules of the Decoder block and the linking of Encoder
and Decoder. As can be seen a Decoder block integrates two types of attention:
Self-Attention in the Decoder: Like the Encoder block, this layer calculates queries, keys
and values from the output of the previous layer. However, since Self Attention in the
Decoder is only allowed to attend to earlier positions[2] in the output sequence future
tokens (words) are masked out.
Encoder-Decoder-Attention: Keys and values come from the output of the Encoder stack.
Queries come from the output of the previous layer. In this way an alignment between the
input- and the output-sequence is modelled.
On the top of all decoder modules a Dense Layer with softmax-activation is applied to calculate
the most probable next word. This predicted word is attached to the decoder input sequence for
calculating the most probable word in the next time step, which is then again attached to the
input in the next time-step …
In the alternative Beam Search not only the most probable word in each time step is predicted,
but the most probable B words can be predicted and applied in the input of next time-step. The
parameter B is called Beamsize.
https://hannibunny.github.io/mlbook/transformer/attention.html 14/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
In the image below the iterative prediction of the tokens of the target-sequence is illustrated. In
iteration i = 4 the 4.th target token must be predicted. For this the decoder takes as input the
i − 1 = 3 previous estimations and the keys and the values from the Encoder stack.
Fig. 101 Prediction of the 4.th target word, given the 3 previously predictions . Image source:
http://jalammar.github.io/illustrated-transformer
BERT
BERT (Bidirectional Encoder Representations from Transformers) has been introduced in
[DCLT19]. BERT is a Transformer. As described above, Transformers often contain an Encoder-
and a Decoder-Stack. However, since BERT primarily constitutes a Language Model (LM), it only
consists of an Encoder. When it was published in 2019, BERT achieved state-of-the-art or even
better performance in 11 NLP tasks, including the GLUE benchmark[3]. Pre-trained BERT models
can be downloaded, e.g. from Google’s Github repo, and easily be adapted and fine-tuned for
custom NLP tasks.
BERT’s main innovation is that it defines a Transformer, which bi-directionally learns a Language
Model. As sketched in image Comparison with GPT-1 and Elmo, previous Deep Neural Network
LM, where either
https://hannibunny.github.io/mlbook/transformer/attention.html 15/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
BERT learns bi-directional relations in text, by a training approach, which is known from
Denoising Autoencoders: The input to the network is corrupted (in BERT tokens are masked
out) and the network is trained such that its output is the original (non-corrupted) input.
Fig. 102 Prediction of the 4.th target word, given the 3 previously predictions. Image source:
[DCLT19]
BERT training is separated into 2 stages: Pre-Training and Fine-Tuning. During Pre-Training, the
model is trained on unlabeled data for the tasks Masked Language Model (MLM) and Next
Sentence Prediction (NSP). Fine-Tuning starts with the parameters, that have been learned in
Pre-Training. There exists different downstream tasks such as Question-Answering, Named-
Entity-Recognition or Multi Natural Language Inference, for which BERT’s parameters can be fine-
tuned. Depending on the Downstream task, the BERT architecutre must be slightly adapted for
Fine-Tuning.
Fig. 103 BERT: Pretraining on tasks Masked Language Model and Next Sentence Prediction,
followed by task-specific Fine-Tuning. Image source: [DCLT19]
In BERT, tokens are not words, but word-pieces. This yields a better out-of-vocabulary-
robustness.
BERT Pre-Training
Masked Language Model (MLM): For this 15% of the input tokens are masked at randomly.
Since the [MASK] token is not known in finetuning not all masked tokens are replaced by this
marker. Instead
These masked tokens are predicted by passing the final hidden vectors, which belong to the
masked tokens to an output softmax over the vocabulary. The Loss function, which is minimized
during training, regards only the prediction of the masked values and ignores the predictions of
the non-masked words. As a consequence, the model converges slower than directional models,
but has increased context awareness.
https://hannibunny.github.io/mlbook/transformer/attention.html 16/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
For NSP pairs of sentences (A, B) are composed. For about 50% of these pairs the second
sentence B is a true successive sentence of A. In the remaining 50% B is a randomly selected
sentence, independent of sentence A. The BERT architecture is trained to estimate if the second
sentence at the input is a true successor of A or not. The pairs of sentences at the input of the
BERT-Encoder stack are configured as follows:
A [CLS] token is inserted at the beginning of the first sentence and a [SEP ] token is
inserted at the end of each sentence.
A sentence embedding indicating Sentence A or Sentence B is added to each token.
These sentence embeddings are similar in concept to token embeddings with a vocabulary
of 2.
A positional embedding is added to each token to indicate its position in the sequence.
Fig. 104 Input of sentence pairs to BERT Encoder stack. Segment Embedding is applied to
indicate first or second sentence. Image source: [DCLT19]
For the NSP task a classifier is trained, which distinguishes successive sentences and non-
successive sentences. For this the output of the [CLS] token is passed to a binary classification
layer. The purpose of adding such Pre-Training is that many NLP tasks such as Question-
Answering (QA) and Natural Language Inference (NLI) need to understand relationships between
sentences.
BERT Fine-Tuning
For each downstream NLP task, task-specific inputs and outputs are applied to fine-tune all
parameters end-to-end. For this minor task-specific adaptations at the input- and output of the
architecture are required.
Classification tasks such as sentiment analysis are done similarly to NSP, by adding a
classification layer on top of the Transformer output for the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question
regarding a text sequence and is required to mark the answer in the sequence. The model
can be trained by learning two extra vectors that mark the beginning and the end of the
answer.
In Named Entity Recognition (NER), the software receives a text sequence and is
required to mark the various types of entities (Person, Organization, Date, etc) that appear
in the text. The model can be trained by feeding the output vector of each token into a
classification layer that predicts the NER label.
https://hannibunny.github.io/mlbook/transformer/attention.html 17/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
The GPT variants are Autoregressive Language Models (AR LM). A AR LM predicts for a given
token-sequence (x 1 , x 2 , … x k ) the following token x k+1 . Then it predicts from
(x 2 , x 3 , … x k+1 ) the next word x k+2 and so on. In contrast to BERT it therefore integrates only
the previous context. However, since AR LMs can predict the next tokens from given prompts,
they are applicable for tasks like generating text or any type of sequence-to-sequence
transformations such as translation, text-to-code, etc.
Moreover, the fact that the AR LM is trained for text-completion, the authors of GPT-2
([RWC+19]) proposed to implement multi-task-learning at a data-level. What does this mean?
Usually multi-task-learning is realized not on a data- but on a architectural level. A typical
approach is to have a common network-part, e.g. the first layers, which constitutes the feature-
extractor, and on top of this common part two or more task-specific architectures in parallel, e.g.
one stack of layers for classification, one stack of layers for Named-Entity-Recognition and so
on. Each task-specific part is trained with task-specific labeled data. BERT and the first GPT are
examples for this approach. The drawback is, that task-specific fine-tuning still requires quite
large amounts of data.
In contrast to this architectural solution for multi-task learning, in data-level multi-task learning
only one common architecture is applied and the task-description is part of the input data.
Instead of predicting
p(output|input),
p(output|input, task)
is predicted, i.e. for one and the same input, different outputs can be generated. This is possible
if the model is an AR LM, because the input and the task-description are just sequences of
tokens. For example the two different tasks translate english to french and translate english to
german, can be implemented by providing training data of type
(translate to french, < english text >, < french text >)
and
(translate to german, < english text >, < german text >).
GPT-2 proposes to take data of this format, and just training the model for a text completion
objective should suffice to train the model for all the underlying objectives. Since the
unsupervised and supervised objectives are the same, the global minima for the unsupervised
objective are the same as the global minima for the supervised objective. Moreover, since the
model is not trained specifically for any of the underlying tasks (translation, summarization,
question-answering, etc.), it is said to be an example of few-shot-, one-shot- or zero-shot
learning. In few-shot learning one just has to explain the task and provide a few examples for
this task in the inference phase, i.e. there is no task-specific weight adaptation. The concept of
the x-shot approaches is depicted in the image below. The fact that no fine-tuning on
downstream tasks is required is a step towards general intelligence.
https://hannibunny.github.io/mlbook/transformer/attention.html 18/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
Fig. 107 Image source: [BMR+20]. Zero-shot, one-shot and few-shot, contrasted with
traditional fine-tuning. The panels above show four methods for performing a task with a
language model. Fine-tuning is the traditional method, whereas zero-, one-, and few-shot,
which are applied in GPT-3, require the model to perform the task with only forward passes
at test time. In the few shot setting typically a few dozen examples are presented to the
model. Exact phrasings for all GPT-3 task descriptions, examples and prompts can be found
in Appendix G of the paper.
Of course, before x-shot learning can be applied, the model must be pretrained. GPT is
pretrained as a autoregeressive language model from unsupervised data (large amounts of
texts).
GPT-3 is evaluated on more than two dozen datasets. For each of these tasks, it is evaluated for
3 settings: zero-shot, one-shot, and few-shot.
The largest GPT-3 model consists of 175 billion parameters, which is 470 times more than BERT,
and requires a storage of 800GB. Smaller variants, which have been considered in [BMR+20] are
depicted in the table below:
Fig. 108 Image source: [BMR+20]. Sizes, architectures, and learning hyper-parameters (batch
size in tokens and learning rate) of the trained models. All models were trained for a total of
300 billion tokens
GPT ([RN18])
GPT-2 ([RWC+19])
https://hannibunny.github.io/mlbook/transformer/attention.html 19/20
28/08/2024 11:02 Sequence-To-Sequence, Attention, Transformer — Machine Learning Lecture
GPT-2 showed that training on larger dataset and having more parameters improved
the capability of language model to understand tasks and surpass the state-of-the-art
of many tasks in zero shot settings.
GPT-3 ([BMR+20])
[2] The reason for this is that the Decoder caluclates its output (e.g. the translated sentence) iteratievely.
In iteration i the i. th output of the current sequence (e.g. the i.th translated word) is estimated. The
already estimated tokens at positions 1, 2, … , i − 1 are applied as inputs to the Decoder stack in
iteration i, i.e. future tokens at positions i + 1, … are not known at this time.
https://hannibunny.github.io/mlbook/transformer/attention.html 20/20