0% found this document useful (0 votes)

18 views

Lecture-28-TransformerIntroductionFinal-1

Uploaded by

muneebke

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Lecture-28-TransformerIntroductionFinal-1

Uploaded by

muneebke

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Introduction to Transformers

Lecture-2
CAP6412 Spring 2024
Mubarak Shah
[email protected]

1/11/2024 CAP6412 - Lecture 1 Introduction 1

Contents
• Basics
• What is Transformer?
• Self-Attention
• Query, Key, Value
• Position encoding
• Encoder-Decoder
• BERT
• Vision Transformers
• VIT
• SWIN
1/11/2024 CAP6412 - Lecture 1 Introduction 2
Transformer
• Used for modeling long dependencies between input sequence elements
• Supports parallel processing of sequence as compared to RNN (e.g. LSTM)
• Allows processing multiple modalities
• (e.g., images, videos, text and speech) using similar processing blocks
• Typically, pre-trained using pretext tasks on largescale (unlabeled) datasets
• Demonstrates excellent scalability to very large networks and huge datasets.

• GPT-4 (Generative Pretrained Transformer)

1/11/2024 CAP6412 - Lecture 1 Introduction 3

Vision Applications
• Recognition tasks (e.g., image classification, object detection, action
recognition, and segmentation),
• Generative modeling, multi-modal tasks (e.g., visual-question
answering, visual reasoning, and visual grounding),
• Video processing (e.g., activity recognition, video forecasting),
• Low-level vision (e.g., image super-resolution, image enhancement,
and colorization)
• 3D analysis (e.g., point cloud classification and segmentation)

1/11/2024 CAP6412 - Lecture 1 Introduction 4

Natural Language Processing

• BERT (Bidirectional Encoder Representations from Transformers),

• GPT (Generative Pre-trained Transformer) v1-4,
• RoBERTa (Robustly Optimized BERT Pre-training)
• T5 (Text-to-Text Transfer Transformer)

1/11/2024 CAP6412 - Lecture 1 Introduction 5

Transformer Basics
• It consists of Encoder and Decoder Blocks
• Main components of each block:
• Self-Attention
• Layer Normalization
• Feed Forward Network

1/11/2024 CAP6412 - Lecture 1 Introduction 6

Transformer

ENCODER

DECODER

1/11/2024 CAP6412 - Lecture 1 Introduction 7

Slide courtesy of AI Bites, Youtube Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 8

Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 9

Normalize

Slide courtesy of AI Bites, Youtube

Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 10

Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 11

Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 12

Slide courtesy of AI Bites, Youtube
Channel:
https://www.youtube.com/c/AIBites

1/11/2024 CAP6412 - Lecture 1 Introduction 13

Self-Attention
• So far no learning!

1/11/2024 CAP6412 - Lecture 1 Introduction 14

Self-Attention of Image Features

Attention Map

1/11/2024 CAP6412 - Lecture 1 Introduction 15

Self-Attention (Matrix Explanation)

Attention Map

1/11/2024 CAP6412 - Lecture 1 Introduction 16

Self-Attention

Attention Map
9x3
3x3 3x3

3x3
9x3 9x3 3x3

9x3 9x3

9x3 9x3
9x9

1/11/2024 CAP6412 - Lecture 1 Introduction 17

Transformers (Attention is all you need 2017)
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in NeurIPS, 2017. (102,618 citations)
• Two valuable sources
• http://nlp.seas.harvard.edu/2018/04/03/attention.html

• https://jalammar.github.io/illustrated-transformer/ (slides come from this

source)

• Slides from Ming Li, University of Waterloo, CS 886 Deep Learning

and NLP
02 Attention and Transformers

Transformer

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

An Encoder Block: same structure, different parameters

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Encoder

The ffnn is independent for each word.

Hence can be parallelized.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

• First, we create three vectors

by multiplying input embedding
xi with three matrices

• qi = xi WQ
• Ki = xi WK
• Vi = xi WV

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

Now we calculate
a score to determine how
much focus to place on other
Parts of the input.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Self Attention

Formula

~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Multiple heads
1. It expands the model’s ability to
focus on different positions.
2. It gives the attention layer
multiple “representation
subspaces”

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Representing the input order (positional encoding)

• Transformer is permutation invariant

• The transformer adds a vector to each input embedding.

• These vectors follow a specific pattern that the model learns

• Learned pattern helps model

• to determine the position of each word, or
• the distance between different words.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Representing the input order (positional encoding)

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Position Encoding
Position Encoding
Position Encoding
• Can also be learned

• Learn like other parameters

02 Attention and Transformers

Add and Normalize

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers

Layer Normalization (Hinton)

Layer normalization normalizes the

inputs across the features.

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers The complete transformer

K, V
Q

Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
BERT (Stack Encoder Blocks)
BERT (Bidirectional Encoder Representations from Transformers)

• BERT jointly encodes the right and left context of a word in a sentence to improve
the learned feature representations
• BERT is trained on two pre-text tasks in self-supervised manner
• Masked Language Model (MLM)
• Mask fixed percentage (15%) of words in a sentence predict these masked words
• In predicting the masked words, the model learns the bidirectional context.
• Next Sentence Prediction (NSP)
• Given a pair of sentences A and B the model predicts a binary label i.e., whether
the pair is valid or not from the original document
• Pair is formed such that B is the actual sentence (next to A) 50% of the time, and
B is a random sentence for other 50% of the time.

1/11/2024 CAP6412 - Lecture 1 Introduction 41

BERT
GPT (Stack Decoder Blocks)
BERT and GPT

https://www.youtube.com/shorts/BEt_BACGw6g
Vision Transformers
Mubarak Shah
[email protected]

1/11/2024 CAP6412 - Lecture 1 Introduction 46

Vision Transformer (VIT)
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words:
Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929,
2020. (ICLR 2021; 26,938 citations!)

1/11/2024 CAP6412 - Lecture 1 Introduction 47

Vision Transformer (VIT)
• Naive application of self-attention to images requires high computation
• Divide an image into 16x16 patches (tokens)
• Transformers need to be trained on large datasets
• VIT attains excellent results when pre-trained on JFT-300M
• 88:55% on ImageNet,
• 90:72% on ImageNet-ReaL,
• 94:55% on CIFAR-100,
• 77:63% on the VTAB suite of 19 tasks

1/11/2024 CAP6412 - Lecture 1 Introduction 48

Vision Transformer (VIT)

1/11/2024 CAP6412 - Lecture 1 Introduction 49

Divide image into 16x16 patches

1/11/2024 CAP6412 - Lecture 1 Introduction Slide credit: Piotr Mazurek 50

Generate embedding for each patch

Slide credit: Piotr Mazurek

CLS (Classification) Token

CLS

1/11/2024 Slide credit: Piotr Mazurek

CAP6412 - Lecture 1 Introduction 52
Complete VIT

Slide credit: Piotr Mazurek

VIT Model Variants

1/11/2024 CAP6412 - Lecture 1 Introduction 54

Results

1/11/2024 CAP6412 - Lecture 1 Introduction 55

Attention

1/11/2024 CAP6412 - Lecture 1 Introduction 56

Vision Transformer (VIT)

1/11/2024 CAP6412 - Lecture 1 Introduction 57

SWIN
• Z Liu, Y Lin, Y Cao, H Hu, Y Wei, Z Zhang, S Lin, “Swin transformer:
Hierarchical vision transformer using shifted windows”, ICCV-2021.
(Marr Prize) 13,035 Citations

1/11/2024 CAP6412 - Lecture 1 Introduction 58

SWIN

• Adapting Transformer from language to vision is challenging

• Differences between language and vision Domains
• Large variations in the scale of visual entities
• High resolution of pixels in images compared to words in text
• To address these differences, SWIN proposes
• A hierarchical Transformer whose representation is computed with Shifted windows
• Shifted Windowing limit attention
• To local windows and
• Allowing cross window connections

1/11/2024 CAP6412 - Lecture 1 Introduction 59

Hierarchical Feature Maps and Local Attention

1/11/2024 CAP6412 - Lecture 1 Introduction 60

SWIN
• 87.3 top-1 accuracy on ImageNet-1K
• Dense prediction tasks such as
• Object detection (58.7 box AP and 51.1 mask AP on COCO)
• Semantic segmentation (53.5 mIoU on ADE20K )
• Performance surpasses the previous state-of-the art by
• +2.7 box AP and +2.6 mask AP on COCO, and
• +3.2 mIoU on ADE20K,

1/11/2024 CAP6412 - Lecture 1 Introduction 61

SWIN Architecture

C=96, 128,192

1/11/2024 CAP6412 - Lecture 1 Introduction 62

Self-Attention within each window and shifted windows

1/11/2024 CAP6412 - Lecture 1 Introduction 63

Transformer Blocks

1/11/2024 CAP6412 - Lecture 1 Introduction 64

Different Configurations

1/11/2024 CAP6412 - Lecture 1 Introduction 65

Results

1/11/2024 CAP6412 - Lecture 1 Introduction 66

Results

1/11/2024 CAP6412 - Lecture 1 Introduction 67

Results

1/11/2024 CAP6412 - Lecture 1 Introduction 68

Summary
• VIT is first Vision Transformer, but trained on huge dataset of 300M

• SWIN employs window attention

• Performs well on other tasks: object detection, semantic
segmentation

1/11/2024 CAP6412 - Lecture 1 Introduction 69

Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Review Article: Deep Learning For Computer Vision: A Brief Review
No ratings yet
Review Article: Deep Learning For Computer Vision: A Brief Review
14 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Lec25 Architectures
No ratings yet
Lec25 Architectures
52 pages
paper2
No ratings yet
paper2
8 pages
Transformer
No ratings yet
Transformer
5 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Research Notes
No ratings yet
Research Notes
9 pages
Computer Vision and AI
No ratings yet
Computer Vision and AI
34 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
Transformer
No ratings yet
Transformer
59 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
attention_transformer
No ratings yet
attention_transformer
41 pages
Advanced Deep Learning and Transformers - Cirrincione
No ratings yet
Advanced Deep Learning and Transformers - Cirrincione
3 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
An Introduction to Transformers
No ratings yet
An Introduction to Transformers
10 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
Flyer Room Assignment For Workshops and Tutorials
No ratings yet
Flyer Room Assignment For Workshops and Tutorials
2 pages
ViT Explained
No ratings yet
ViT Explained
15 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer
No ratings yet
Transformer
58 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
1-1-1 面向人工智能统一神经架构和预训练方法
No ratings yet
1-1-1 面向人工智能统一神经架构和预训练方法
51 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Transformers
No ratings yet
Transformers
30 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Major Project Presentation 2 - G6
No ratings yet
Major Project Presentation 2 - G6
13 pages
2023 LLMBC LLM Foundations
No ratings yet
2023 LLMBC LLM Foundations
92 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
130 pages
VR Part2 Lecture 6 Annotated
No ratings yet
VR Part2 Lecture 6 Annotated
10 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
L5
No ratings yet
L5
99 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
No ratings yet
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
55 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
paper3
No ratings yet
paper3
7 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Conditional Positional Encodings For Vision Transformers
No ratings yet
Conditional Positional Encodings For Vision Transformers
13 pages
anlp-05-transformers
No ratings yet
anlp-05-transformers
40 pages
Instant download (Ebook) The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) arXiv:2309.17421v2 [cs.CV] 11 Oct 2023 by Zhengyuan Yang∗, Linjie Li∗, Kevin Lin∗, Jianfeng Wang∗, Chung-Ching Lin∗, Zicheng Liu, Lijuan Wang∗♠ Microsoft Corporation ∗ Core Contributor ♠ Project Lead ISBN 230917421V2 pdf all chapter
100% (3)
Instant download (Ebook) The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) arXiv:2309.17421v2 [cs.CV] 11 Oct 2023 by Zhengyuan Yang∗, Linjie Li∗, Kevin Lin∗, Jianfeng Wang∗, Chung-Ching Lin∗, Zicheng Liu, Lijuan Wang∗♠ Microsoft Corporation ∗ Core Contributor ♠ Project Lead ISBN 230917421V2 pdf all chapter
81 pages
Blockchain Foundation Courseware - English
From Everand
Blockchain Foundation Courseware - English
Eppo Luppes
No ratings yet
Lecture-27-Introduction to VLM
No ratings yet
Lecture-27-Introduction to VLM
46 pages
DiffPose
No ratings yet
DiffPose
15 pages
Term Paper
No ratings yet
Term Paper
9 pages
Scheduler Activations
No ratings yet
Scheduler Activations
27 pages
Deep Learning - IIT Ropar - Unit 14 - Week 11
No ratings yet
Deep Learning - IIT Ropar - Unit 14 - Week 11
4 pages
Major base 3
No ratings yet
Major base 3
43 pages
Transformers Explained "Attention Is All You Need."
No ratings yet
Transformers Explained "Attention Is All You Need."
28 pages
IF4071 Deep Learning QP
No ratings yet
IF4071 Deep Learning QP
2 pages
Lecture 6 - Multi-Layer Feedforward Neural Networks Using Matlab Part 2
No ratings yet
Lecture 6 - Multi-Layer Feedforward Neural Networks Using Matlab Part 2
3 pages
第4章参数高效微调
No ratings yet
第4章参数高效微调
33 pages
REgnet
No ratings yet
REgnet
6 pages
Introduction to Large Language Models (LLMs) - - Unit 7 - Week 5
No ratings yet
Introduction to Large Language Models (LLMs) - - Unit 7 - Week 5
4 pages
Complete NN Concept
No ratings yet
Complete NN Concept
73 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
12 pages
Hedlin Novian Napitupulu Tugas3
No ratings yet
Hedlin Novian Napitupulu Tugas3
7 pages
Deep Belief Network
No ratings yet
Deep Belief Network
4 pages
Terminologies of ANN
No ratings yet
Terminologies of ANN
3 pages
Slides For 'Large Language Model: From Theory To Implementations', Chapter 1
No ratings yet
Slides For 'Large Language Model: From Theory To Implementations', Chapter 1
40 pages
Deep Learning
No ratings yet
Deep Learning
169 pages
Neural Network and Fuzzy Logic
No ratings yet
Neural Network and Fuzzy Logic
46 pages
Typical CNN (Convolutional Neural Network) Architecture: CHARAN S (1VE20CA005) Cse-Ai, Svce
No ratings yet
Typical CNN (Convolutional Neural Network) Architecture: CHARAN S (1VE20CA005) Cse-Ai, Svce
13 pages
6 C3 M4 L1-RecurrentNeuralNetwork1
No ratings yet
6 C3 M4 L1-RecurrentNeuralNetwork1
29 pages
Top 100 Deep Learning Interview Questions
No ratings yet
Top 100 Deep Learning Interview Questions
157 pages
Ann4-3s.pdf 7oct PDF
No ratings yet
Ann4-3s.pdf 7oct PDF
21 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Rr720507 Neural Networks
No ratings yet
Rr720507 Neural Networks
5 pages
Uni2 NNDL
No ratings yet
Uni2 NNDL
21 pages
Unit III
No ratings yet
Unit III
89 pages
Deep Learning UNIT 1&2
No ratings yet
Deep Learning UNIT 1&2
69 pages
Kannan M5L3 Notes
No ratings yet
Kannan M5L3 Notes
98 pages
OlahLSTM NEURAL NETWORK TUTORIAL 15
No ratings yet
OlahLSTM NEURAL NETWORK TUTORIAL 15
9 pages
Unit - I Artificial Neural Networks
No ratings yet
Unit - I Artificial Neural Networks
23 pages