Transformer
Transformer
Course Instructors
Dr. Satish Kumar Singh, Associate Professor, IIIT Allahabad (Email: [email protected])
Dr. Shiv Ram Dubey, Assistant Professor, IIIT Allahabad (Email: [email protected])
h1 h2 h3 h4 h1 h2 h3 h4 V2
V1
A1,2 A2,2 A3,2
A1,1 A2,1 A3,1
Softmax(↑)
Q1 Q2 Q3
x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3
𝑄1 𝑄2 𝑄3 𝑄4
Self-attention layer 𝑌1 𝑌2 𝑌3
Product(→), Sum(↑)
• Query vectors: 𝑄 = 𝑋𝑊𝑄
• Key vectors: 𝐾 = 𝑋𝑊𝐾 𝑉3 𝐴1,3 𝐴2,3 𝐴3,3
𝑄𝑖 · 𝐾𝑗 Softmax(↑)
𝐸𝑖,𝑗 =
𝐷 𝐾3 𝐸1,3 𝐸2,3 𝐸3,3
(𝐷 is the dimensionality of the keys) 𝐾2 𝐸1,2 𝐸2,2 𝐸3,2
𝑌𝑖 = σ𝑗 𝐴𝑖,𝑗 𝑉𝑗 or 𝑌 = 𝐴𝑉 𝑋1 𝑋2 𝑋3
Keys 𝑠𝑖𝑗 = 𝒇 𝒙𝑖 𝑇
𝒈(𝒙𝑗 )
exp(𝑠𝑖𝑗 )
𝛽𝑗,𝑖 =
σ𝑖 exp(𝑠𝑖𝑗 )
How much to attend
to location 𝑖 while
synthesizing feature
Values at location 𝑗
𝒐𝑗 = 𝒗 𝛽𝑗,𝑖 𝒉 𝒙𝑖
𝑖
H. Zhang, I. Goodfellow, D. Metaxas, A. Odena. Self-Attention Generative Adversarial Networks. ICML 2019
This is ….
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝐴1,3 𝐴2,3 𝐴3,3
Softmax(↑)
𝑄1 𝑄2 𝑄3
𝑋1 𝑋2 𝑋3
<START> This is
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝐴1,3 𝐴2,3 𝐴3,3
Softmax(↑)
𝑄1 𝑄2 𝑄3
𝑋1 𝑋2 𝑋3
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝟎 𝟎 𝐴3,3
𝑉2 𝟎 𝐴2,2 𝐴3,2
Softmax(↑)
𝐾3 −∞ −∞ 𝐸3,3
𝐾2 −∞ 𝐸2,2 𝐸3,2
𝑄1 𝑄2 𝑄3
𝑋1 𝑋2 𝑋3
Transformer architecture: Details
Decoder
Encoder
N transformer N transformer
blocks blocks
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Different ways of processing sequences
Self-Attention and
RNN 1D convolutional network Transformer
Y1 Y2 Y3
Product(→), Sum(↑)
h1 h2 h3 h4 h1 h2 h3 h4 V2
V1
A1,2 A2,2 A3,2
A1,1 A2,1 A3,1
Softmax(↑)
Q1 Q2 Q3
x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3
Image source
Self-supervised language modeling with transformers
1. Download a lot of text from the internet
2. Train a transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
Bidirectional Encoder Representations from Transformers (BERT)
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding, EMNLP 2018
BERT: Pretext tasks
• Masked language model (MLM)
• Randomly mask 15% of tokens in input sentences, goal is to
reconstruct them using bidirectional context
Image source
BERT: Pretext tasks
• Next sentence prediction (NSP)
• Useful for Question Answering Predict likelihood that sentence B
belongs after sentence A
and Natural Language Inference
tasks
• In the training data, 50% of the
time B is the actual sentence that
follows A (labeled as IsNext),
and 50% of the time it is a
random sentence (labeled as
NotNext).
Image source
BERT: More detailed view
Image source
BERT: Downstream tasks
Textual entailment
Source: J. Hockenmaier
Find span in paragraph that contains the answer Source: SQuAD v1.1 paper
BERT: Downstream tasks
Image source
Image source
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
Yang et al., XLNet: Generalized Autoregressive Pretraining for Language Understanding, 2019
Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism, 2019
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
A. Radford et al., Improving language understanding with unsupervised learning, 2018 Image source
GPT-2 and GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning
GPT-2: A. Radford et al., Language models are unsupervised multitask learners, 2019
GPT-3: T. Brown et al., Language models are few-shot learners, arXiv 2020
GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning
• Few-shot learning: In addition to the task description, the
model sees a few examples of the task
(Three articles
provided as training
examples)
Task: Use new word in sentence
Gray: human prompts,
boldface: GPT-3
completions
Task: Correct grammar
Gray: human prompts,
boldface: GPT-3
completions
Task: Generate poems
Transformers: Outline
• Transformer architecture
• Attention models
• Implementation details
• Transformer-based language models
• BERT
• GPT and Other models
• Applications of transformers in vision
Vision-and-language BERT
image regions and features from detector output
vision-language co-attention
predict object class of masked out region predict whether image and sentence go together
J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-
Language Tasks, NeurIPS 2019
Detection Transformer (DETR)
https://openai.com/blog/image-gpt/
Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale."
International Conference on Learning Representations. 2020.
ACKNOWLEDGEMENT
• Deep Learning, Stanford University