0% found this document useful (0 votes)
3 views

AE556_2024_Topic7_Transformer

The document provides an overview of Transformers, highlighting their popularity since 2018 and their applications in natural language processing (NLP) and other fields. It explains key concepts such as tokens, embeddings, attention mechanisms, and the architecture of Transformer models, including encoder and decoder layers. Additionally, it discusses various Transformer models like BERT and GPT, their training processes, and their applications in tasks like language modeling and image recognition.

Uploaded by

Yang Woo Seong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

AE556_2024_Topic7_Transformer

The document provides an overview of Transformers, highlighting their popularity since 2018 and their applications in natural language processing (NLP) and other fields. It explains key concepts such as tokens, embeddings, attention mechanisms, and the architecture of Transformer models, including encoder and decoder layers. Additionally, it discusses various Transformer models like BERT and GPT, their training processes, and their applications in tasks like language modeling and image recognition.

Uploaded by

Yang Woo Seong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Fall’24 AE556 AI for Aerospace Applications

Topic 7. Transformer

Oct. 16 (Th), 2024

Han-Lim Choi
Transformer Overview

Ref: https://deeplearning.cs.cmu.edu/F23/document/slides/lec19.transformersLLMs.pdf 2
Popularity of Transformer

• Since ~2018, Transformers have been growing in popularity ... and size

Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 3
Tokens and Embedding

• Let us first begin with the NLP application as it is the original application the
transformer has been made for.
• As is the case in NLP applications in general, we begin by turning each input
word into a vector using an embedding layer.
• Basically, the embedding layer takes an input as a token index (in the
vocabulary) and outputs its weight that represent a vector of the word.

4
Embedding (Projection) for Continuous Data

• Continuous Input (Time Series, Image Synthesis): Handled by projecting the


vector at each timestamp (for Time Series) or each image patch (for Image
Synthesis) using an embedding layer.
• Embedding Architecture: Can vary, including options such as Linear, MLP
(Multi-Layer Perceptron), CNN (Convolutional Neural Network), etc.

5
Attention Mechanism

• Originally developed for language translation: Bahdanau, D., Cho, K., &
Bengio, Y. (2014). Neural machine translation by jointly learning to align and
translate. https://arxiv.org/abs/1409.0473

"... allowing a model to automatically (soft-)search for parts of a source


sentence that are relevant to predicting a target word ..."

Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 6
Self-Attention: High Level Picture

• Example Sentence: The animal didn't cross


the street because it was too tired

• Self-attention associate "it" with


"animal" by examining other words in
the sentence.

• The model can look at all positions in


the input sequence for context.

• Unlike RNNs that process words


sequentially and maintain a hidden
state, Transformers use self-attention
to access entire input sequence at once.

Ref: https://jalammar.github.io/illustrated-transformer/ 7
Self-Attention: Basic Form

• self-attention: relating different positions within a single sequence (vs.


between in- and output sequences)

Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 8
Self-Attention: Generating Query, Key, Value

• Previous basic version did not involve any learnable parameters, so not very
useful for learning a language model
• We are now adding 3 trainable weight matrices (Linear Projection) that are
multiplied with the input sequence embeddings ( )

Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf
https://jalammar.github.io/illustrated-transformer/ 9
Self-Attention: QKV Operation

• The attention score determines how much focus to place on other parts of
the input sentence as we encode a word at a certain position.

Ref: https://github.com/redherring2141/AI504 10
Self-Attention: Step-by-Step

• Example self-attention on the


word “thinking”
– Generate Q, K, and V
– Calculate Attention Score
using 𝑞 𝑘
– Scale Attention Score
– Apply Softmax
– Weighted Sum of Values.

Ref: https://jalammar.github.io/illustrated-transformer/ 11
Self-Attention: QKV operation in batch

• QKV operations in Transformers can be performed in batch, enhancing


computational efficiency.
• Increases memory complexity due to the need to store large matrices for
queries, keys, values, and attention matrices

Ref: https://jalammar.github.io/illustrated-transformer/ 12
Multi-head Attention

• Multi-head attention involves using multiple QKV matrices.

Ref: https://jalammar.github.io/illustrated-transformer/ 13
Multi-head Attention

• The tokens are processed separately in multiple (8) different attention heads.
• The output size = head dimension

Ref: https://jalammar.github.io/illustrated-transformer/ 14
Multi-head Attention: Combining heads’ output

• After concatenation, the output size =


• The self-attention output is then projected into -sized embedding.

Ref: https://jalammar.github.io/illustrated-transformer/ 15
Multi-head Attention: The whole process

Ref: https://jalammar.github.io/illustrated-transformer/ 16
Multi-head Attention: Some practical notes

• In public repositories, you could find slightly different code implementation.


• Each set of Q, K, V matrices can be generated from the embedding before
splitting into separate Q, K, V matrices for each head.
• Typically, the dimension of each head is

• The concatenation of multi-head attention output usually has the size of


• The final output of self-attention layer also has the size of
• **Self-Attention Layer: Input Embedding Size = Output Embedding Size**

17
Multi-head Attention: Visualization

• As we encode the word "it", one attention head is focusing most on "the
animal", while another is focusing on "tired“
• In a sense, the model's representation of the word "it" bakes in some of the
representation of both "animal" and "tired".

Ref: https://jalammar.github.io/illustrated-transformer/ 18
Transformer: High-level Picture

• The encoding component is a stack of encoders


• The decoding component is a stack of decoders of the same number.

Ref: https://jalammar.github.io/illustrated-transformer/ 19
Transformer Encoder Layer

• Input (words) -> Self-attention (QKV Op) -> MLP -> Output
• The feedforward network, an MLP, processes each embedding
individually, outputting the same size .
• The hidden layer projects input from size to follows by ReLU
activation function, and then projects the size back to .

Ref: https://jalammar.github.io/illustrated-transformer/, https://wikidocs.net/178419 20


Transformer Encoder Layer

• Then, the FFN sends out the output upwards to the next encoder.
• Input from previous block -> Self-attention -> MLP -> Output

Ref: https://jalammar.github.io/illustrated-transformer/ 21
Transformer: “Add” (Residual Connection)

• Each sub-layer (self-attention, feedforward network) in each encoder layer


has a residual connection around it.
• Like ResNet, it prevent gradient vanishing for large transformer model.

Ref: https://jalammar.github.io/illustrated-transformer/ 22
Transformer: “Norm” (Layer Normalization)

• Layer Normalization computes statistics (mean and variance) based on the


summed inputs to the neurons in a single layer, rather than across different
samples in a batch.
• LN ensures consistent behavior during training and inference.
• Using LN: Training stability, Independence from Batch Size, Faster
convergence.

Ref: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html 23
Transformer Encoder

• Mathematical Expression of a
transformer encoder layer.

Ref: https://jalammar.github.io/illustrated-transformer/ 24
Positional Encoding

• Self-Attention is “Set Encoding”; therefore, Transformer Encoder is


inherently a “Set Encoder” not “Sequence Encoder”.
• We need to inject positional information for the model to recognize order of
each word, or the distance between different words in the sequence.

Ref: https://jalammar.github.io/illustrated-transformer/, https://github.com/redherring2141/AI504 25


Positional Encoding

• If we assumed the embedding has a dimensionality of 4, the actual positional


encodings would look like this:

• Position Encoding Requirements


– It should output a unique encoding for each time-step (word’s position in a sentence)
– Distance between any two time-steps should be consistent across sentences with different
lengths.
– Our model should generalize to longer sentences without any efforts. Its values should be
bounded.
– It must be deterministic.

Ref: https://jalammar.github.io/illustrated-transformer/, https://github.com/redherring2141/AI504 26


Sinusoidal Positional Encoding

• Uses sine and cosine functions of different frequencies to encode the


position of each token in the sequence.
• Effectively handles sequences of any length due to its continuous nature,
making it suitable for tasks involving variable input sizes without retraining.

Ref: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ 27
Learnable Positional Encoding

• We can also make PE learnable.


• Encodings are learned during training, allowing the model to adapt positional
encodings based on the specific data and task requirements.
• Can potentially lead to better performance on specific tasks by learning
optimal positional representations tailored to the nuances of the dataset.

28
Transformer Decoder

• The encoder output acts as Key and Value in encoder-decoder attention


across all layers, granting the decoder access to the entire encoder output.
• In training, the decoder uses Masked Self-Attention to ensure outputs
depend only on previous tokens, with its output used as the Query for
encoder-decoder attention.

Masked

29
Masked Attention (Causal Attention)

• Encoder
– Full (bi-directional) self-attention.
– An attention on a token can look into every
token in a sequence.
• Decoder
– Limited (masked, causal, or uni-directional)
self-attention.
– Can only attend to itself and backwards.
– In the time-dependent concept, it cannot
attend to future token.

Ref: https://github.com/redherring2141/AI504 30
Masked Attention (Causal Attention)

• How to transform fully-connected attention to causal attention?


• Multiply an upper triangular matrix of to the unnormalized attention.
• After applying Softmax, the attention scores on the upper triangular part
becomes 0, representing no attention to future tokens.

Ref: http://www.peterbloem.nl/blog/transformers, https://github.com/redherring2141/AI504 31


Encoder-Decoder Attention

• Encoder-Decoder Attention: Attends


to the encoder output, using the
same internal structure as self-
attention but with different inputs.
• Key and Value Generation: Applied
to the encoder output (memory).
• Query Generation: Applied to the
output of masked self-attention.
• Intuition: Functions like querying the
relevance of encoder output with
respect to the current output token.

32
Decoder output: NLP application

• NLP: Multiple-times classification problem -> Project linearly to the number


of words in the vocabulary, then apply softmax.

Ref: https://jalammar.github.io/illustrated-transformer/ 33
Decoder output: Handle continuous output

• Continuous Output (Time Series, Image Synthesis): Handle as a multiple-


points regression. Project decoder output linearly to the number of features.
• Bottleneck Architecture: Can vary, including Linear, MLP, CNN, etc.

34
Transformer Training: All at Once

• For translation task, use Seq2Seq Similar to RNN


– Encode the sentence in the original language using an encoder.
– On decoder input: start with <BOS> and the output sentence to the decoder
(Teacher Forcing).
– Expected output: output sentence followed by <EOS>.

35
Transformer Training: All at Once

• Use the cross-entropy loss function. Gradient backpropagation is performed


throughout the entire model.
• High Training Efficiency: Transformers can process entire sequences in
parallel, significantly enhancing training efficiency.

36
Transformer Inference: Step-by-Step

• Attention Vector Preparation: The output from the top encoder is


transformed into sets of attention vectors K (Keys) and V (Values) for use in
encoder-decoder attention.

Ref: https://jalammar.github.io/illustrated-transformer/, https://machinelearningmastery.com/inferencing-the-transformer-model/ 37


Transformer Inference: Step-by-Step

• Sequential Decoding Process:


– Start with the start token and apply masked self-attention to generate the first
decoder prediction.
– Take the predicted word, concatenate it to the previous sequence, and input this
extended sequence into masked self-attention again to obtain the next word.
– Repeat the process by appending each new prediction to the sequence and
processing it through masked self-attention until the sequence is complete.

38
Transformer family tree

• There are currently three


main types of
Transformer models
– Encoder-Only
– Encoder-Decoder
– Decoder-Only
• Nowadays, the most
popular architecture is
the Decoder-Only model
due to its empirically
better performance.

39
BERT: Architecture

• Bidirectional Encoder Representations from Transformers (BERT)


• BERT is an encoder-only model composed of stacked transformer encoders.
• It utilizes Token Embedding, Positional Embedding, and Trainable Segment
Embedding.
• The CLS token represents all input, used primarily for classification tasks.

40
BERT Pretraining: Masked Language Modeling

• Random Masking of Tokens: In BERT's MLM, 15% of tokens in a sequence are


replaced with a [MASK]to challenge the model to predict hidden words.
• Contextual Prediction: Predict the original tokens behind the masks using the
surrounding context, enhancing BERT’s deep understanding of language.

41
GPT-1: Architecture

• Introduced before BERT but has had more successors.


• GPT-1 is a decoder-only model consisting of stacked transformer decoders.
• Although termed as "decoder-only," GPT-1's architecture closely resembles
that of a transformer encoder due to its use of masked self-attention,
eliminating the need for encoder-decoder attention.

42
GPT-2 and GPT-3

• GPT-2:
– Increased Scale: 1.5 billion parameters for deeper complexity.
– Expanded Training Data: Trained on a diverse, larger internet-based dataset.
– Improved Generalization: Exhibits strong performance across various language
tasks without specific fine-tuning.
• GPT-3:
– Unprecedented Scale: Features 175 billion parameters, enhancing linguistic
capabilities.
– Vast Dataset: Trained on a comprehensive mix of internet text, books, and
Wikipedia.
– Advanced Learning Capabilities: Capable of few-shot and zero-shot learning from
minimal prompts.
– Controlled Usage: Accessible through an API to mitigate potential misuse.

43
GPT Pretraining

• Unsupervised Pretraining (in LLM context): GPT models undergo pretraining


via unsupervised learning, trained on a vast text corpus without specific task
labels, enabling broad language understanding.
• Next Word Prediction: Pretraining involves an autoregressive objective,
where the model predicts the next word based on all preceding words,
capturing intricate contextual relationships in text.

44
Visual Transformer (ViT)

• Introduced in the paper titled "An Image is Worth 16x16 Words:


Transformers for Image Recognition at Scale.“
• The first paper to successfully train a Transformer encoder on the ImageNet.
• Attained very good results, comparable or superior to traditional
convolutional architectures.

45
Visual Transformer (ViT)

Image Processing in Vision


Transformers:
– Separate images into patches and
linearly project the features of
flattened patches.
– Inject a learnable class token into
the sequence of patches.
– Follow with positional encoding to
retain information about the
location of each patch.
– Extract the final output from the
first position, which is the class
token position.
– Process this output through an MLP
(Multi-Layer Perceptron) classifier
head for final classification.

46
Example: Basic trajectory prediction; Encoder-Only

• Similar to Many-to-one RNN


• Architecture: Transformer Encoder
with CLS token prediction and MLP
regression bottleneck.
• Input: All previous steps

• Output: Next Step

• Aim to minimize the distances


between predicted states and true
states.
• MSE Loss function:

47
Example: Basic trajectory prediction; Decoder-Only

• Architecture: Transformer Decoder


(Encoder with Masked-Attention)
with MLP regression bottleneck.
• Input: All previous steps

• In causal transformer last element


represents the whole sequence.
• Output: Next Step
• Aim to minimize the distances
between predicted states and true
states.
• MSE Loss function:

48
Example: Basic trajectory prediction; Inference

• Multi-step prediction can be achieved by repeatedly predicting one step, then


appending the predicted value to the input sequence, and using the updated
sequence to predict the next step.

Step-by-Step Process: Predict one step using the current sequence.


1. Append the predicted value to the input sequence.
2. Use the updated sequence to predict the next step.
3. Repeat steps 1-3 until the desired number of steps is reached.

49

You might also like