0% found this document useful (0 votes)
18 views

Lecture-28-TransformerIntroductionFinal-1

Uploaded by

muneebke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture-28-TransformerIntroductionFinal-1

Uploaded by

muneebke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Transformers

Lecture-2
CAP6412 Spring 2024
Mubarak Shah
[email protected]

1/11/2024 CAP6412 - Lecture 1 Introduction 1


Contents
• Basics
• What is Transformer?
• Self-Attention
• Query, Key, Value
• Position encoding
• Encoder-Decoder
• BERT
• Vision Transformers
• VIT
• SWIN
1/11/2024 CAP6412 - Lecture 1 Introduction 2
Transformer
• Used for modeling long dependencies between input sequence elements
• Supports parallel processing of sequence as compared to RNN (e.g. LSTM)
• Allows processing multiple modalities
• (e.g., images, videos, text and speech) using similar processing blocks
• Typically, pre-trained using pretext tasks on largescale (unlabeled) datasets
• Demonstrates excellent scalability to very large networks and huge datasets.

• GPT-4 (Generative Pretrained Transformer)

1/11/2024 CAP6412 - Lecture 1 Introduction 3


Vision Applications
• Recognition tasks (e.g., image classification, object detection, action
recognition, and segmentation),
• Generative modeling, multi-modal tasks (e.g., visual-question
answering, visual reasoning, and visual grounding),
• Video processing (e.g., activity recognition, video forecasting),
• Low-level vision (e.g., image super-resolution, image enhancement,
and colorization)
• 3D analysis (e.g., point cloud classification and segmentation)

1/11/2024 CAP6412 - Lecture 1 Introduction 4


Natural Language Processing

• BERT (Bidirectional Encoder Representations from Transformers),


• GPT (Generative Pre-trained Transformer) v1-4,
• RoBERTa (Robustly Optimized BERT Pre-training)
• T5 (Text-to-Text Transfer Transformer)

1/11/2024 CAP6412 - Lecture 1 Introduction 5


Transformer Basics
• It consists of Encoder and Decoder Blocks
• Main components of each block:
• Self-Attention
• Layer Normalization
• Feed Forward Network

1/11/2024 CAP6412 - Lecture 1 Introduction 6


Transformer

ENCODER

DECODER

1/11/2024 CAP6412 - Lecture 1 Introduction 7


Slide courtesy of AI Bites, Youtube Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 8


Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 9


Normalize

Slide courtesy of AI Bites, Youtube


Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 10


Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 11


Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 12


Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 13


Self-Attention
• So far no learning!

1/11/2024 CAP6412 - Lecture 1 Introduction 14


Self-Attention of Image Features

Attention Map

1/11/2024 CAP6412 - Lecture 1 Introduction 15


Self-Attention (Matrix Explanation)

Attention Map

1/11/2024 CAP6412 - Lecture 1 Introduction 16


Self-Attention

Attention Map
9x3
3x3 3x3

3x3
9x3 9x3 3x3

9x3 9x3

9x3 9x3
9x9

1/11/2024 CAP6412 - Lecture 1 Introduction 17


Transformers (Attention is all you need 2017)
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in NeurIPS, 2017. (102,618 citations)
• Two valuable sources
• http://nlp.seas.harvard.edu/2018/04/03/attention.html

• https://jalammar.github.io/illustrated-transformer/ (slides come from this


source)

• Slides from Ming Li, University of Waterloo, CS 886 Deep Learning


and NLP
02 Attention and Transformers

Transformer

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

An Encoder Block: same structure, different parameters

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Encoder

The ffnn is independent for each word.


Hence can be parallelized.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

• First, we create three vectors


by multiplying input embedding
xi with three matrices

• qi = xi WQ
• Ki = xi WK
• Vi = xi WV

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

Now we calculate
a score to determine how
much focus to place on other
Parts of the input.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

Formula

~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Multiple heads
1. It expands the model’s ability to
focus on different positions.
2. It gives the attention layer
multiple “representation
subspaces”

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Representing the input order (positional encoding)

• Transformer is permutation invariant

• The transformer adds a vector to each input embedding.

• These vectors follow a specific pattern that the model learns

• Learned pattern helps model


• to determine the position of each word, or
• the distance between different words.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Representing the input order (positional encoding)

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Position Encoding
Position Encoding
Position Encoding
• Can also be learned

• Learn like other parameters


02 Attention and Transformers

Add and Normalize

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Layer Normalization (Hinton)

Layer normalization normalizes the


inputs across the features.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers The complete transformer

K, V
Q

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
BERT (Stack Encoder Blocks)
BERT (Bidirectional Encoder Representations from Transformers)

• BERT jointly encodes the right and left context of a word in a sentence to improve
the learned feature representations
• BERT is trained on two pre-text tasks in self-supervised manner
• Masked Language Model (MLM)
• Mask fixed percentage (15%) of words in a sentence predict these masked words
• In predicting the masked words, the model learns the bidirectional context.
• Next Sentence Prediction (NSP)
• Given a pair of sentences A and B the model predicts a binary label i.e., whether
the pair is valid or not from the original document
• Pair is formed such that B is the actual sentence (next to A) 50% of the time, and
B is a random sentence for other 50% of the time.

1/11/2024 CAP6412 - Lecture 1 Introduction 41


BERT
GPT (Stack Decoder Blocks)
BERT and GPT

https://www.youtube.com/shorts/BEt_BACGw6g
Vision Transformers
Mubarak Shah
[email protected]

1/11/2024 CAP6412 - Lecture 1 Introduction 46


Vision Transformer (VIT)
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words:
Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929,
2020. (ICLR 2021; 26,938 citations!)

1/11/2024 CAP6412 - Lecture 1 Introduction 47


Vision Transformer (VIT)
• Naive application of self-attention to images requires high computation
• Divide an image into 16x16 patches (tokens)
• Transformers need to be trained on large datasets
• VIT attains excellent results when pre-trained on JFT-300M
• 88:55% on ImageNet,
• 90:72% on ImageNet-ReaL,
• 94:55% on CIFAR-100,
• 77:63% on the VTAB suite of 19 tasks

1/11/2024 CAP6412 - Lecture 1 Introduction 48


Vision Transformer (VIT)

1/11/2024 CAP6412 - Lecture 1 Introduction 49


Divide image into 16x16 patches

1/11/2024 CAP6412 - Lecture 1 Introduction Slide credit: Piotr Mazurek 50


Generate embedding for each patch

Slide credit: Piotr Mazurek


CLS (Classification) Token

CLS

1/11/2024 Slide credit: Piotr Mazurek


CAP6412 - Lecture 1 Introduction 52
Complete VIT

Slide credit: Piotr Mazurek


VIT Model Variants

1/11/2024 CAP6412 - Lecture 1 Introduction 54


Results

1/11/2024 CAP6412 - Lecture 1 Introduction 55


Attention

1/11/2024 CAP6412 - Lecture 1 Introduction 56


Vision Transformer (VIT)

1/11/2024 CAP6412 - Lecture 1 Introduction 57


SWIN
• Z Liu, Y Lin, Y Cao, H Hu, Y Wei, Z Zhang, S Lin, “Swin transformer:
Hierarchical vision transformer using shifted windows”, ICCV-2021.
(Marr Prize) 13,035 Citations

1/11/2024 CAP6412 - Lecture 1 Introduction 58


SWIN

• Adapting Transformer from language to vision is challenging


• Differences between language and vision Domains
• Large variations in the scale of visual entities
• High resolution of pixels in images compared to words in text
• To address these differences, SWIN proposes
• A hierarchical Transformer whose representation is computed with Shifted windows
• Shifted Windowing limit attention
• To local windows and
• Allowing cross window connections

1/11/2024 CAP6412 - Lecture 1 Introduction 59


Hierarchical Feature Maps and Local Attention

1/11/2024 CAP6412 - Lecture 1 Introduction 60


SWIN
• 87.3 top-1 accuracy on ImageNet-1K
• Dense prediction tasks such as
• Object detection (58.7 box AP and 51.1 mask AP on COCO)
• Semantic segmentation (53.5 mIoU on ADE20K )
• Performance surpasses the previous state-of-the art by
• +2.7 box AP and +2.6 mask AP on COCO, and
• +3.2 mIoU on ADE20K,

1/11/2024 CAP6412 - Lecture 1 Introduction 61


SWIN Architecture

C=96, 128,192

1/11/2024 CAP6412 - Lecture 1 Introduction 62


Self-Attention within each window and shifted windows

1/11/2024 CAP6412 - Lecture 1 Introduction 63


Transformer Blocks

1/11/2024 CAP6412 - Lecture 1 Introduction 64


Different Configurations

1/11/2024 CAP6412 - Lecture 1 Introduction 65


Results

1/11/2024 CAP6412 - Lecture 1 Introduction 66


Results

1/11/2024 CAP6412 - Lecture 1 Introduction 67


Results

1/11/2024 CAP6412 - Lecture 1 Introduction 68


Summary
• VIT is first Vision Transformer, but trained on huge dataset of 300M

• SWIN employs window attention


• Performs well on other tasks: object detection, semantic
segmentation

1/11/2024 CAP6412 - Lecture 1 Introduction 69

You might also like