0% found this document useful (0 votes)
43 views

Transformer

Uploaded by

IEC2020034
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Transformer

Uploaded by

IEC2020034
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Tackling CNN and RNN Issues

Computer Vision and Biometrics Lab (CVBL)


Department of Information Technology
Indian Institute of Information Technology Allahabad

Course Instructors
Dr. Satish Kumar Singh, Associate Professor, IIIT Allahabad (Email: [email protected])
Dr. Shiv Ram Dubey, Assistant Professor, IIIT Allahabad (Email: [email protected])

Teaching Assistants (TAs)


Mr. Neeraj Baghel, Research Scholar, IIIT Allahabad
Ms. Priyam Pandey, Research Scholar, IIIT Allahabad
Mr. Haresh Kumar, MTech Student, IIIT Allahabad
Mr. Umang Sorathiya, MTech Student, IIIT Allahabad
The content (text, image, and graphics) used in this slide are
adopted from many sources for academic purposes. Broadly,
the sources have been given due credit appropriately. However,
there is a chance of missing out some original primary
sources. The authors of this material do not claim any
copyright of such material.
Self-Attention and Transformer

Many slides adapted from J. Johnson


Different ways of processing sequences
Self-Attention and
RNN 1D convolutional network Transformer Y1 Y2 Y3
Product(→), Sum(↑)

V3 A1,3 A2,3 A3,3

h1 h2 h3 h4 h1 h2 h3 h4 V2
V1
A1,2 A2,2 A3,2
A1,1 A2,1 A3,1
Softmax(↑)

K3 E1,3 E2,3 E3,3


K2 E1,2 E2,2 E3,2
K1 E1,1 E2,1 E3,1

Q1 Q2 Q3

x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3

Works on ordered sequences Works on multidimensional grids • Works on sets of vectors


• Pros: Good at long sequences: • Con: Bad at long sequences:
the last hidden vector Need to stack many conv layers
encapsulates the whole for outputs to “see” the whole
sequence sequence
• Cons: Not parallelizable: need • Pro: All outputs can be computed
to compute hidden states in parallel
sequentially
Outline
• Transformer architecture
• Attention models
• Implementation details
• Transformer-based language models
• BERT
• GPT and Other models
• Applications of transformers in vision
Attention is all you need
• Neural Machine Translation (NMT) architecture using only point-
wise processing and attention (no recurrent units or convolutions)

Encoder: receives entire input Decoder: predicts next token


sequence and outputs encoded conditioned on encoder output and
sequence of the same length previously predicted tokens

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser,


I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source
Attention is all you need
• Neural Machine Translation (NMT) architecture using only point-
wise processing and attention (no recurrent units or convolutions)
• More efficient and parallelizable than recurrent or convolutional
architectures, faster to train, better accuracy

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser,


I. Polosukhin, Attention is all you need, NeurIPS 2017
Key-Value-Query attention model
Decoder
The decoder generates a
query describing what it
wants to focus on 𝑌1 𝑌2 𝑌3 𝑌4

Sum the values generated by


encoder weighted by the
attention weights

Feed the scores into a softmax


to create the attention weights

Compute dot products between


the query and the keys
generated by encoder, giving
𝑋1 𝑋2 𝑋3 𝑋4 alignment scores between
source tokens and the query
Encoder
Image source
Key-Value-Query attention model 𝑌1 𝑌2 𝑌3 𝑌4

• Key vectors: 𝐾 = 𝑋𝑊𝐾 Product( ), Sum( )

• Value Vectors: 𝑉 = 𝑋𝑊𝑉


• Query vectors 𝑉1 𝐴1,1 𝐴2,1 𝐴3,1 𝐴4,1

• Similarities: scaled dot-product attention 𝑉2 𝐴1,2 𝐴2,2 𝐴3,2 𝐴4,2


𝑄𝑖 · 𝐾𝑗
𝐸𝑖,𝑗 = 𝑉3 𝐴1,3 𝐴2,3 𝐴3,3 𝐴4,3
𝐷
(𝐷 is the dimensionality of the keys) Softmax( )

𝑋1 𝐾1 𝐸1,1 𝐸2,1 𝐸3,1 𝐸4,1


• Attn. weights: 𝐴 = softmax(𝐸, dim = 1)
𝑋2 𝐾2 𝐸1,2 𝐸2,2 𝐸3,2 𝐸4,2
• Output vectors:
𝑌𝑖 = σ𝑗 𝐴𝑖,𝑗 𝑉𝑗 or 𝑌 = 𝐴𝑉 𝑋3 𝐾3 𝐸1,3 𝐸2,3 𝐸3,3 𝐸4,3

𝑄1 𝑄2 𝑄3 𝑄4
Self-attention layer 𝑌1 𝑌2 𝑌3

Product(→), Sum(↑)
• Query vectors: 𝑄 = 𝑋𝑊𝑄
• Key vectors: 𝐾 = 𝑋𝑊𝐾 𝑉3 𝐴1,3 𝐴2,3 𝐴3,3

• Value vectors: 𝑉 = 𝑋𝑊𝑉 𝑉2 𝐴1,2 𝐴2,2 𝐴3,2

• Similarities: scaled dot-product attention 𝑉1 𝐴1,1 𝐴2,1 𝐴3,1

𝑄𝑖 · 𝐾𝑗 Softmax(↑)
𝐸𝑖,𝑗 =
𝐷 𝐾3 𝐸1,3 𝐸2,3 𝐸3,3
(𝐷 is the dimensionality of the keys) 𝐾2 𝐸1,2 𝐸2,2 𝐸3,2

𝐾1 𝐸1,1 𝐸2,1 𝐸3,1


• Attn. weights: 𝐴 = softmax(𝐸, dim = 1)
• Output vectors: 𝑄1 𝑄2 𝑄3

𝑌𝑖 = σ𝑗 𝐴𝑖,𝑗 𝑉𝑗 or 𝑌 = 𝐴𝑉 𝑋1 𝑋2 𝑋3

One query per input vector


Recall: Self-attention GAN
Queries

Keys 𝑠𝑖𝑗 = 𝒇 𝒙𝑖 𝑇
𝒈(𝒙𝑗 )
exp(𝑠𝑖𝑗 )
𝛽𝑗,𝑖 =
σ𝑖 exp(𝑠𝑖𝑗 )
How much to attend
to location 𝑖 while
synthesizing feature
Values at location 𝑗

𝒐𝑗 = 𝒗 ෍ 𝛽𝑗,𝑖 𝒉 𝒙𝑖
𝑖

H. Zhang, I. Goodfellow, D. Metaxas, A. Odena. Self-Attention Generative Adversarial Networks. ICML 2019
This is ….
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝐴1,3 𝐴2,3 𝐴3,3

𝑉2 𝐴1,2 𝐴2,2 𝐴3,2

𝑉1 𝐴1,1 𝐴2,1 𝐴3,1

Softmax(↑)

𝐾3 𝐸1,3 𝐸2,3 𝐸3,3

𝐾2 𝐸1,2 𝐸2,2 𝐸3,2

𝐾1 𝐸1,1 𝐸2,1 𝐸3,1

𝑄1 𝑄2 𝑄3

𝑋1 𝑋2 𝑋3

<START> This is
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝐴1,3 𝐴2,3 𝐴3,3

𝑉2 𝐴1,2 𝐴2,2 𝐴3,2

𝑉1 𝐴1,1 𝐴2,1 𝐴3,1

Softmax(↑)

𝐾3 𝐸1,3 𝐸2,3 𝐸3,3

𝐾2 𝐸1,2 𝐸2,2 𝐸3,2

𝐾1 𝐸1,1 𝐸2,1 𝐸3,1

𝑄1 𝑄2 𝑄3

𝑋1 𝑋2 𝑋3
Masked self-attention layer
𝑌1 𝑌2 𝑌3
• The decoder should not “look ahead” Product(→), Sum(↑)
in the output sequence
𝑉3 𝟎 𝟎 𝐴3,3

𝑉2 𝟎 𝐴2,2 𝐴3,2

𝑉1 𝐴1,1 𝐴2,1 𝐴3,1

Softmax(↑)

𝐾3 −∞ −∞ 𝐸3,3

𝐾2 −∞ 𝐸2,2 𝐸3,2

𝐾1 𝐸1,1 𝐸2,1 𝐸3,1

𝑄1 𝑄2 𝑄3

𝑋1 𝑋2 𝑋3
Transformer architecture: Details

Decoder

Encoder

A. Vaswani et al., Attention is all you need, NeurIPS 2017


Attention mechanisms

N transformer N transformer
blocks blocks

• Encoder self-attention: queries, keys, and values come from previous


layer of encoder
• Decoder self-attention: values corresponding to future decoder
outputs are masked out
• Encoder-decoder attention: queries come from previous decoder
layer, keys and values come from output of encoder
Multi-head attention
• Run ℎ attention models in parallel
on top of different linearly
projected versions of 𝑄, 𝐾, 𝑉;
concatenate and linearly project
the results
• Intuition: enables model to attend
to different kinds of information at
different positions
Transformer blocks
• A Transformer is a sequence
of transformer blocks
• Vaswani et al.: N=12 blocks,
embedding dimension = 512,
6 attention heads
• Add & Norm: residual
connection followed by layer
normalization
• Feedforward: two linear layers
with ReLUs in between, applied
independently to each vector
• Attention is the only
interaction between inputs!
Positional encoding
• To give transformer information about ordering of tokens, add
function of position (based on sines and cosines) to every input
position

Embedding dimension Image source


Results

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Different ways of processing sequences
Self-Attention and
RNN 1D convolutional network Transformer
Y1 Y2 Y3
Product(→), Sum(↑)

V3 A1,3 A2,3 A3,3

h1 h2 h3 h4 h1 h2 h3 h4 V2
V1
A1,2 A2,2 A3,2
A1,1 A2,1 A3,1
Softmax(↑)

K3 E1,3 E2,3 E3,3


K2 E1,2 E2,2 E3,2
K1 E1,1 E2,1 E3,1

Q1 Q2 Q3

x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3

Works on ordered sequences Works on multidimensional grids • Works on sets of vectors


• Pros: Good at long sequences: • Con: Bad at long sequences: • Pro: Good at long sequences:
the last hidden vector Need to stack many conv layers after one self-attention layer,
encapsulates the whole for outputs to “see” the whole each output “sees” all inputs!
sequence sequence • Pro: Highly parallel: Each
• Cons: Not parallelizable: need • Pro: Highly parallel: Each output output can be computed in
to compute hidden states can be computed in parallel parallel
sequentially • Con: Very memory-intensive
Outline
• Transformer architecture
• Attention models
• Implementation details
• Transformer-based language models
• BERT
• GPT and Other models
Self-supervised language modeling with transformers
1. Download a lot of text from the internet
2. Train a transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task

Image source
Self-supervised language modeling with transformers
1. Download a lot of text from the internet
2. Train a transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
Bidirectional Encoder Representations from Transformers (BERT)

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding, EMNLP 2018
BERT: Pretext tasks
• Masked language model (MLM)
• Randomly mask 15% of tokens in input sentences, goal is to
reconstruct them using bidirectional context

Image source
BERT: Pretext tasks
• Next sentence prediction (NSP)
• Useful for Question Answering Predict likelihood that sentence B
belongs after sentence A
and Natural Language Inference
tasks
• In the training data, 50% of the
time B is the actual sentence that
follows A (labeled as IsNext),
and 50% of the time it is a
random sentence (labeled as
NotNext).

Image source
BERT: More detailed view

WordPiece (from GNMT)

Trained on Wikipedia (2.5B words) + BookCorpus (800M words)

Image source
BERT: Downstream tasks

Textual entailment

Source: J. Hockenmaier

Entailment, textual equivalence and similarity


BERT: Downstream tasks

Sentiment classification, linguistic acceptability Image source


BERT: Downstream tasks

Find span in paragraph that contains the answer Source: SQuAD v1.1 paper
BERT: Downstream tasks

Image source

Named entity recognition


Outline
• Transformer architecture
• Attention models
• Implementation details
• Transformer-based language models
• BERT
• GPT and Other models
Other language models

Image source
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)

Vaswani et al., Attention is all you need, NeurIPS 2017


Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)

Yang et al., XLNet: Generalized Autoregressive Pretraining for Language Understanding, 2019
Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB

Radford et al., Language models are unsupervised multitask learners, 2019


Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)

~$430,000 on Amazon AWS!

Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism, 2019
Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU

Microsoft, Turing-NLG: A 17-billion parameter language model by Microsoft, 2020


Scaling up transformers
Model Layers Hidden dim. Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days)
BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12288 96 175B 694GB ?

~$4.6M, 355 GPU-years


(source)

Brown et al., Language Models are Few-Shot Learners, arXiv 2020


OpenAI GPT (Generative Pre-Training)
• Pre-training task: next token prediction

A. Radford et al., Improving language understanding with unsupervised learning, 2018 Image source
GPT-2 and GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning

GPT-2: A. Radford et al., Language models are unsupervised multitask learners, 2019
GPT-3: T. Brown et al., Language models are few-shot learners, arXiv 2020
GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning
• Few-shot learning: In addition to the task description, the
model sees a few examples of the task

T. Brown et al., Language models are few-shot learners, arXiv 2020


GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning
• One-shot learning: In addition to the task description, the
model sees a single example of the task

T. Brown et al., Language models are few-shot learners, arXiv 2020


GPT-3
• Key idea: if the model and training datasets are big enough,
model can adapt to new tasks without fine-tuning
• Zero-shot learning: The model sees the task description
and no training examples

T. Brown et al., Language models are few-shot learners, arXiv 2020


Task: Generate news article
Gray: human prompts,
boldface: GPT-3
completions

(Three articles
provided as training
examples)
Task: Use new word in sentence
Gray: human prompts,
boldface: GPT-3
completions
Task: Correct grammar
Gray: human prompts,
boldface: GPT-3
completions
Task: Generate poems
Transformers: Outline
• Transformer architecture
• Attention models
• Implementation details
• Transformer-based language models
• BERT
• GPT and Other models
• Applications of transformers in vision
Vision-and-language BERT
image regions and features from detector output
vision-language co-attention

predict object class of masked out region predict whether image and sentence go together

J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-
Language Tasks, NeurIPS 2019
Detection Transformer (DETR)

N. Carion et al., End-to-end object detection with transformers, arXiv 2020


Image transformer

N. Parmar et al., Image transformer, ICML 2018


Image GPT

https://openai.com/blog/image-gpt/

M. Chen et al., Generative pretraining from pixels, ICML 2020


Image GPT

M. Chen et al., Generative pretraining from pixels, ICML 2020


Image GPT

M. Chen et al., Generative pretraining from pixels, ICML 2020


Vision Transformer

Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale."
International Conference on Learning Representations. 2020.
ACKNOWLEDGEMENT
• Deep Learning, Stanford University

• Introduction to Deep Learning, University of Illinois at Urbana-Champaign

• Introduction to Deep Learning, Carnegie Mellon University

• Convolutional Neural Networks for Visual Recognition, Stanford University

• Natural Language Processing with Deep Learning, Stanford University

• NVDIEA Deep Learning Teaching Kit

• And many more…

You might also like