AE556_2024_Topic7_Transformer
AE556_2024_Topic7_Transformer
Topic 7. Transformer
Han-Lim Choi
Transformer Overview
Ref: https://deeplearning.cs.cmu.edu/F23/document/slides/lec19.transformersLLMs.pdf 2
Popularity of Transformer
• Since ~2018, Transformers have been growing in popularity ... and size
Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 3
Tokens and Embedding
• Let us first begin with the NLP application as it is the original application the
transformer has been made for.
• As is the case in NLP applications in general, we begin by turning each input
word into a vector using an embedding layer.
• Basically, the embedding layer takes an input as a token index (in the
vocabulary) and outputs its weight that represent a vector of the word.
4
Embedding (Projection) for Continuous Data
5
Attention Mechanism
• Originally developed for language translation: Bahdanau, D., Cho, K., &
Bengio, Y. (2014). Neural machine translation by jointly learning to align and
translate. https://arxiv.org/abs/1409.0473
Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 6
Self-Attention: High Level Picture
Ref: https://jalammar.github.io/illustrated-transformer/ 7
Self-Attention: Basic Form
Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf 8
Self-Attention: Generating Query, Key, Value
• Previous basic version did not involve any learnable parameters, so not very
useful for learning a language model
• We are now adding 3 trainable weight matrices (Linear Projection) that are
multiplied with the input sequence embeddings ( )
Ref: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf
https://jalammar.github.io/illustrated-transformer/ 9
Self-Attention: QKV Operation
• The attention score determines how much focus to place on other parts of
the input sentence as we encode a word at a certain position.
Ref: https://github.com/redherring2141/AI504 10
Self-Attention: Step-by-Step
Ref: https://jalammar.github.io/illustrated-transformer/ 11
Self-Attention: QKV operation in batch
Ref: https://jalammar.github.io/illustrated-transformer/ 12
Multi-head Attention
Ref: https://jalammar.github.io/illustrated-transformer/ 13
Multi-head Attention
• The tokens are processed separately in multiple (8) different attention heads.
• The output size = head dimension
Ref: https://jalammar.github.io/illustrated-transformer/ 14
Multi-head Attention: Combining heads’ output
Ref: https://jalammar.github.io/illustrated-transformer/ 15
Multi-head Attention: The whole process
Ref: https://jalammar.github.io/illustrated-transformer/ 16
Multi-head Attention: Some practical notes
17
Multi-head Attention: Visualization
• As we encode the word "it", one attention head is focusing most on "the
animal", while another is focusing on "tired“
• In a sense, the model's representation of the word "it" bakes in some of the
representation of both "animal" and "tired".
Ref: https://jalammar.github.io/illustrated-transformer/ 18
Transformer: High-level Picture
Ref: https://jalammar.github.io/illustrated-transformer/ 19
Transformer Encoder Layer
• Input (words) -> Self-attention (QKV Op) -> MLP -> Output
• The feedforward network, an MLP, processes each embedding
individually, outputting the same size .
• The hidden layer projects input from size to follows by ReLU
activation function, and then projects the size back to .
• Then, the FFN sends out the output upwards to the next encoder.
• Input from previous block -> Self-attention -> MLP -> Output
Ref: https://jalammar.github.io/illustrated-transformer/ 21
Transformer: “Add” (Residual Connection)
Ref: https://jalammar.github.io/illustrated-transformer/ 22
Transformer: “Norm” (Layer Normalization)
Ref: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html 23
Transformer Encoder
• Mathematical Expression of a
transformer encoder layer.
Ref: https://jalammar.github.io/illustrated-transformer/ 24
Positional Encoding
Ref: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ 27
Learnable Positional Encoding
28
Transformer Decoder
Masked
29
Masked Attention (Causal Attention)
• Encoder
– Full (bi-directional) self-attention.
– An attention on a token can look into every
token in a sequence.
• Decoder
– Limited (masked, causal, or uni-directional)
self-attention.
– Can only attend to itself and backwards.
– In the time-dependent concept, it cannot
attend to future token.
Ref: https://github.com/redherring2141/AI504 30
Masked Attention (Causal Attention)
32
Decoder output: NLP application
Ref: https://jalammar.github.io/illustrated-transformer/ 33
Decoder output: Handle continuous output
34
Transformer Training: All at Once
35
Transformer Training: All at Once
36
Transformer Inference: Step-by-Step
38
Transformer family tree
39
BERT: Architecture
40
BERT Pretraining: Masked Language Modeling
41
GPT-1: Architecture
42
GPT-2 and GPT-3
• GPT-2:
– Increased Scale: 1.5 billion parameters for deeper complexity.
– Expanded Training Data: Trained on a diverse, larger internet-based dataset.
– Improved Generalization: Exhibits strong performance across various language
tasks without specific fine-tuning.
• GPT-3:
– Unprecedented Scale: Features 175 billion parameters, enhancing linguistic
capabilities.
– Vast Dataset: Trained on a comprehensive mix of internet text, books, and
Wikipedia.
– Advanced Learning Capabilities: Capable of few-shot and zero-shot learning from
minimal prompts.
– Controlled Usage: Accessible through an API to mitigate potential misuse.
43
GPT Pretraining
44
Visual Transformer (ViT)
45
Visual Transformer (ViT)
46
Example: Basic trajectory prediction; Encoder-Only
47
Example: Basic trajectory prediction; Decoder-Only
48
Example: Basic trajectory prediction; Inference
49