0% found this document useful (0 votes)

8 views56 pages

Computer Vision 12 Vision Language Models(1)

The document discusses Vision Language Models (VLMs), which are multimodal large language models that process both visual inputs and text, enabling tasks like image captioning and visual question answering. It highlights popular VLMs such as OpenAI's GPT-4.5 and Google's Gemini 2.0, and explains the architecture of transformers used in VLMs, focusing on aligning text and image information. Additionally, it outlines an assignment related to object detection of cats and dogs, detailing tasks, datasets, and evaluation metrics.

Uploaded by

laughriotclip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views56 pages

Computer Vision 12 Vision Language Models(1)

Uploaded by

laughriotclip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

COMPUTER VISION

2024 - 2025
>VISION LANGUAGE
MODELS
UTRECHT UNIVERSITY
RONALD POPPE
VISION LANGUAGE MODELS
VLMs are multimodal LLMs that incorporate images/video and text
• They can deal with visual input
• Provide textual answers to textual queries about them

Increasingly, they can also provide visual responses

• Topic of next lecture
VISION LANGUAGE MODELS
Popular VLMS:
• OpenAI GPT-4.5 Preview: text and image
• Google Gemini 2.0 Pro: text, audio and image
• Anthropic Claude 3.7 Sonnet: text and image
• Alibaba Qwen 2.5 Max: text, images, audio, video

• DeepSeek v3: text, only extracts text from images (so no VLM)
VISION LANGUAGE MODELS
Common tasks are image captioning and visual question answering (VQA)
• Image captioning: providing a description of the visual input
• VQA: provide an answer with the image as reference
VLMS ARE TRANSFORMERS
Recall that transformers are sequence-to-sequence models
• Can deal with sequential input such as text and images
• Can deal with sequential output such as text and images
VLMS ARE TRANSFORMERS
Previous lecture:
• We looked at vision encoder
• We ignored sequential output

This lecture:
• We combine text and images
• We look at sequential text output

Next lecture:
• We look at image generation
VLMS ARE TRANSFORMERS
Two topics:
• Aligning text and visual information
• Producing sequential text output

Requires that we bring back the transformer decoder

TRANSFORMERS
Transfomers have an encoder and decoder
• Information from encoder to decoder

Lx
Components of both modules are similar
Lx
• Usage is somewhat different

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

Text

IMAGE CAPTIONING
Let’s first focus on image captioning
• Image used as input
• Text (caption) as output
Lx

Encoder output: embedded token sequence Lx

• Embedding should be shared with decoder

• Requires that we align text and image

Image
ALIGNING TEXT & IMAGE
ALIGN TEXT AND IMAGE
Goal of encoder is to encode input in a meaningful way
• Ensure that “similar” inputs have “similar” encodings
• Implicitly deals with invariancies

What is similar depends on what you focus on

• Semantic vs. appearance
ALIGN TEXT AND IMAGE
Text similarities are vastly different than image similarities
• Still, we have to align their embeddings

Requires that we learn how image variations are reflected in text

• And vice versa, how text variations are reflected in images
• Can only be learned in a supervised manner
CLIP
CLIP: contrastive language-image pretraining
• Jointly trains text and image encoder
• Both are transformer encoders
• Uses contrastive learning

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
CLIP is trained on 400M image-caption pairs
• Training task was to match caption to image

Result is:
• Trained text encoder
• Trained visual encoder

Both share encoding space

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
Contrastive learning doesn’t use fixed labels
• Distinction between “same” or “other”
• Goal is to bring encodings of “same” items closer, and “other” further apart
• Can be used in supervised or self-supervised settings

Many variants:
• Triplet loss
• InfoNCE
• Etc.
CLIP
CLIP training repeatedly uses batches of N <image, caption> pairs

Text processed text encoder

• Produces text embedding Ti

Images processed by image encoder

• Produces image embedding Ii
CLIP
Goals is to match embeddings between modalities
• Dot-product similarity
• Loss for off-diagonal “matches”
• Loss backpropagated to encoders
~
CLIP
After training, text and image modalities are aligned!

Lx
When we have an image and a corresponding text
• Both encoders should map to the same embedding
ZERO-SHOT CLASSIFICATION
We can use this model for zero-shot classification
• Use a set of text options
• Encode image and text options
• Select option with best similarity
PRODUCING OUTPUT
PRODUCING OUTPUTS
In the domain of NLP, a word is a token
• A sentence is a sequence of tokens
• Encoder produces embedding of entire sentence

Decoder also produces a sentence

• Translation
• Answer
• Etc.
DECODER BLOCK
Decoder consists of L blocks

Each block has three modules:

Lx
• Self-attention module
• Encoder-decoder attention module
• MLP module

Input processing is similar to encoder

• Embedding and positional encoding

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

DECODER BLOCK2
Input processing is similar to encoder
• Embedding and positional encoding

Output of the decoder is a classifier head Lx

• Linear projection (FC layer)

• Softmax output

In the context of NLP

• Output is a class label that corresponds to a word
• Typically many classes…
DECODER BLOCK3
Self-attention block is similar to that in decoder
• Apply MSA on input, with residual connection

Lx
MLP block is also similar to that in decoder
• Linearly project the embedding to the input size
DECODER BLOCK4
Encoder-decoder attention module uses encoder input

Similar to self-attention but:

Lx
• K and V come from encoder
• Q come from final decoder embedding
• Attends to similarities in both

Residual connection used to bypass attention

DECODER
The decoder processes and outputs one token at a time

In NLP:
• First input is the special <start> token
• First output is the first word
• Second input is the first output word
• Etc.
• Final output is <eos> (end of sentence) token
EXAMPLE
NLP translation example, first time step
EXAMPLE2
NLP translation example, next time steps
MASKING
When tokens have been output, they are used as inputs too
• Increasing token length

But tokens are not supposed to look into the future

• Solved by masking out half the self-attention matrix
TRAINING
TRAINING
Different options to train image (and text) encoder

Encoder Encoder

Single-encoder Dual-encoder Encoder-

model model decoder model
TRAINING2
Training a text decoder after learning image/text encoders
• Sufficient for image captioning

Nukrai et al., "CapDec: Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP 2022
QUESTIONS?
ASSIGNMENT 4 WALK-THROUGH
ASSIGNMENT
Object detection of cats and dogs
• Deal with images of varying sizes
• Build object detection network
CAT
• Train object detector DOG
• Test object detector
• Report performance

Counts for 15% of total grade

• 70 points in regular tasks
• 30 points in choice tasks
ASSIGNMENT2
Kaggle dog and cat detection dataset
• https://www.kaggle.com/datasets/andrewmvd/dog-and-cat-detection
• 3686 images
• 2 classes (cat=0, dog=1)
ASSIGNMENT3
A notebook is available to help you get started:
• https://colab.research.google.com/drive/1oyRBec77YBynOWLmEPI
XEzyO3R4BrIUF
• Provides data loading and image transformation functions
• Visualizes samples of the dataset
ASSIGNMENT4
Images have different sizes
• Need to resize
• Need to transform bounding boxes accordingly
• Check https://pytorch.org/vision/0.9/transforms.html

We resize the inputs to 112 x 112 x 3

• Choose how to do the resizing
• Adapt the bounding boxes
ASSIGNMENT5
Object detection typically uses a powerful network
• E.g., YOLOv1 has 45M parameters
• We will train a much smaller model with ~1M parameters

Input size 112 x 112 x 3, total 12 layers (with input layer)

• Four times a convolution 3x3 layer followed by a 2x2 pooling layer
• Final convolution 3x3 layer
• Fully connected layer with 512 neurons
• Fully connected layer with 343 neurons
ASSIGNMENT6
Output is a fully connected layer similar to YOLO v1
ASSIGNMENT7
Specifically, we use a 7 x 7 spatial grid
• Max B=1 object per cell

Per cell:
• Bounding box positions (4)
• Objectness score (1)
• Class scores (2 – cat, dog)

Total: 7 * 7 * 7 = 343
• Encoded as vector
ASSIGNMENT8
When training the network:
• Use the five losses from the original YOLO v1 paper
• Choose your hyperparameters
• Implement early stopping to prevent overfitting

Training will take longer than for Assignment 3

• Make informed changes to your network
• To see if your network trains at al, you don’t have to check many epochs
ASSIGNMENT9
When testing your network:
• Report mAP by changing the objectness score from 0-1
• Report a confusion matrix. Make sure you can identify
misclassifications (and not only missed/false detections)
• No need to implement non-maximum suppression
ASSIGNMENT10
Reporting:
1. Description of the architecture and parameters
2. Description of the way you processed images and bounding boxes
3. Overview of your training hyperparameters
4. Graphs for loss components and accuracy, over the training epochs
5. Training, validation and test mAP of your model
6. Confusion matrix for the best objectness threshold
7. Link to your models weights
8. A list of the choice tasks you implemented (with additional reporting)
CHOICE TASKS
CHOICE 1: Improve the architecture to provide better results (max 10)
• Only eligible with meaningful and impactful changes
• No simple changes such as changing the number or size of kernels, or
adding dropout

Report the mAP performance of the improved model

CHOICE TASKS2
CHOICE 2: Replace the backbone with a pretrained ResNet (max 10)
• Convolution layers before the first FC layer can be treated as the backbone.
Replace all layers before the first FC layer with ResNet-18 (without FC layers)
• Initialize ResNet-18 with the checkpoint pretrained on ImageNet (checkout
models.resnet18)
• Finetune the whole model on the cat-dog detection dataset. You need to
change the input size from 112 to 224 to adapt to ResNet-18.

Report the mAP of this network. Report any differences in your architecture
and training procedure
CHOICE TASKS3
CHOICE 3: Pretrain the backbone (max 15) instead of training the detection
network from scratch
• Pretrain the backbone on the cat-dog-classification dataset (Kaggle)
• Adapt the network to focus on recognition rather than detection
• Initialize your original detection network using the pretrained checkpoint
• Finetune your network for detection using the dog-cat-detection dataset

Report the mAP and loss of this network after finetuning

CHOICE TASKS4
CHOICE 4: Instead of having a fixed validation set, implement k-fold
cross-validation (max 5)
• Use k=5, but realize that this increases your training time 5-fold

Report:
• mAP per fold
• Discuss differences between folds
CHOICE TASKS5
CHOICE 5: Evaluate your network on a video of a cat and a dog (max 10)
• Search carefully if there are any videos online with cats and dogs
• Pick one
• Spending hours watching videos does not count as “education time”
• Process every kth frame for a total of >=50 frames
• Make a video with detection bounding boxes per class
• No need to manually annotate the ground truth

No need to report any metrics, just send the video

CHOICE TASKS6
CHOICE 6: Perform at least 3 data augmentation techniques (max 10)
• Choose your techniques wisely
• Max 5 points if you only use techniques that are not geometric
• For rotation, flipping, mirroring, scaling, cropping: adapt the bounding
boxes accordingly (check https://pytorch.org/vision/0.9/transforms.html)

Report mAP and explain how the performance is affected

CHOICE TASKS7
CHOICE 7: Show examples of misdetections (max 10)
• Provide and discuss at least 3 samples of each:
• Cat  dog misclassifications
• Dog  cat misclassifications
• False positive detections (None  cat/dog)
• False negatives (Cat/dog  None)
GOOD LUCK!
FINALLY…
ASSIGNMENT
Assignment 3
• Deadline Sunday March 16, 23:00

Assignment 4
• Deadline Sunday March 30, 23:00

Assignment support session Tuesday at 15:15

• Use the assignments channel!
NEXT LECTURE
Next lecture:
• Tuesday March 18, 13:15-15:00, KBG-Atlas (!)
• Image and video generation
MATERIALS
Background materials:
• Contrastive Language-Image Pretraining (CLIP)
• Gan et al., “Vision-Language Pre-training”, 2022

cheatsheet-transformers-large-language-models
No ratings yet
cheatsheet-transformers-large-language-models
4 pages
DBS Bank - A Tech Company Going All in On AI
No ratings yet
DBS Bank - A Tech Company Going All in On AI
20 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Clip
No ratings yet
Clip
15 pages
Lecture22-Multimodal
No ratings yet
Lecture22-Multimodal
32 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
L5
No ratings yet
L5
99 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
DL CO4 PPT-1
No ratings yet
DL CO4 PPT-1
29 pages
Transformers
No ratings yet
Transformers
27 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
2306.16410
No ratings yet
2306.16410
16 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Lec 7 Trans(decoder)+ViT
No ratings yet
Lec 7 Trans(decoder)+ViT
20 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Generative AI
No ratings yet
Generative AI
54 pages
Lecture-28-TransformerIntroductionFinal-1
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
69 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Final Demo
No ratings yet
Final Demo
48 pages
3. Graph Representation Learning
No ratings yet
3. Graph Representation Learning
32 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
One Model To Learn Them All: Work Performed While at Google Brain
No ratings yet
One Model To Learn Them All: Work Performed While at Google Brain
10 pages
Transformer
No ratings yet
Transformer
5 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
677 a Survey on Bridging VLMs
No ratings yet
677 a Survey on Bridging VLMs
20 pages
2306.07915v5
No ratings yet
2306.07915v5
26 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
architecture of LLMs
No ratings yet
architecture of LLMs
10 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
DB2 11 for z/OS: Intermediate Training for Application Developers
From Everand
DB2 11 for z/OS: Intermediate Training for Application Developers
Robert Wingate
No ratings yet
Learning OpenCV 3 Application Development
From Everand
Learning OpenCV 3 Application Development
Samyak Datta
No ratings yet
Guide for Dummies: from MATLAB to C++ through the MATLAB Coder: English and Italian Book
From Everand
Guide for Dummies: from MATLAB to C++ through the MATLAB Coder: English and Italian Book
Filippo Piccinini
No ratings yet
Ms Information Science Machine Learning Online University Arizona
No ratings yet
Ms Information Science Machine Learning Online University Arizona
19 pages
Maths in AI
No ratings yet
Maths in AI
9 pages
S6 Artificial Intelligence and Machine Learning
No ratings yet
S6 Artificial Intelligence and Machine Learning
74 pages
Pitch Deck
No ratings yet
Pitch Deck
12 pages
Paper 4
No ratings yet
Paper 4
33 pages
What Is Data Science - Quora
No ratings yet
What Is Data Science - Quora
15 pages
A Review of Image Recognition Technology
No ratings yet
A Review of Image Recognition Technology
5 pages
LE Industry-4.0 and the Future of UK Space Manufacturing Final Report
No ratings yet
LE Industry-4.0 and the Future of UK Space Manufacturing Final Report
122 pages
QP English Xii Pb-1 22-23 Set1
100% (1)
QP English Xii Pb-1 22-23 Set1
11 pages
SURGE Info Session
No ratings yet
SURGE Info Session
19 pages
Kantar and Komodo Health Partner To Create Innovative Linked Data Solutions
No ratings yet
Kantar and Komodo Health Partner To Create Innovative Linked Data Solutions
3 pages
Ahmed Et Al 2023 Chatbot Features For Anxiety and Depression A Scoping Review
No ratings yet
Ahmed Et Al 2023 Chatbot Features For Anxiety and Depression A Scoping Review
17 pages
HR Analytics Synopsis
100% (1)
HR Analytics Synopsis
3 pages
Final Lab Exam
No ratings yet
Final Lab Exam
13 pages
redp5748
No ratings yet
redp5748
101 pages
Blinkenlights Poster 1
No ratings yet
Blinkenlights Poster 1
1 page
Registration For TCS Group Internship Cum Performance Based PPO Recruitment Drive - 2025 Batch
No ratings yet
Registration For TCS Group Internship Cum Performance Based PPO Recruitment Drive - 2025 Batch
7 pages
Impact of Technology On Future Jobs
No ratings yet
Impact of Technology On Future Jobs
29 pages
Future Trendsin Commerce Management
No ratings yet
Future Trendsin Commerce Management
20 pages
LNAI 3139 Embodied Artificial Intelligence 1st edition by Yasuo Kuniyoshi, Rolf Pfeifer, Luc Steels, Fumiya Iida ISBN 354022484X Â 978-3540224846 - The latest ebook is available for instant download now
No ratings yet
LNAI 3139 Embodied Artificial Intelligence 1st edition by Yasuo Kuniyoshi, Rolf Pfeifer, Luc Steels, Fumiya Iida ISBN 354022484X Â 978-3540224846 - The latest ebook is available for instant download now
44 pages
222IC3 GS6 Level 1 Student Workbook (1)
No ratings yet
222IC3 GS6 Level 1 Student Workbook (1)
115 pages
PTE PDF 2
No ratings yet
PTE PDF 2
134 pages
Chapter 1 Accounting Information System James Hall
No ratings yet
Chapter 1 Accounting Information System James Hall
13 pages
Do You Agree or Disagree That Robots, Artificial Intelligence, and Automation Are Going To Replace Human Labor?
No ratings yet
Do You Agree or Disagree That Robots, Artificial Intelligence, and Automation Are Going To Replace Human Labor?
2 pages
CSS-2024 Topics Outline by adil riaz
No ratings yet
CSS-2024 Topics Outline by adil riaz
8 pages
Rine X
No ratings yet
Rine X
7 pages
Unit 1
No ratings yet
Unit 1
22 pages
Ai Will Transform Project Management En2019 Web
No ratings yet
Ai Will Transform Project Management En2019 Web
8 pages
Presentation Korolyova Marina DA-22
No ratings yet
Presentation Korolyova Marina DA-22
12 pages

Computer Vision 12 Vision Language Models(1)

Uploaded by

Computer Vision 12 Vision Language Models(1)

Uploaded by

COMPUTER VISION

Increasingly, they can also provide visual responses

Requires that we bring back the transformer decoder

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

Encoder output: embedded token sequence Lx

• Embedding should be shared with decoder

What is similar depends on what you focus on

Requires that we learn how image variations are reflected in text

Both share encoding space

Text processed text encoder

Images processed by image encoder

Decoder also produces a sentence

Each block has three modules:

Input processing is similar to encoder

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

Output of the decoder is a classifier head Lx

• Linear projection (FC layer)

In the context of NLP

Similar to self-attention but:

Residual connection used to bypass attention

But tokens are not supposed to look into the future

Single-encoder Dual-encoder Encoder-

Counts for 15% of total grade

We resize the inputs to 112 x 112 x 3

Input size 112 x 112 x 3, total 12 layers (with input layer)

Training will take longer than for Assignment 3

Report the mAP performance of the improved model

Report the mAP and loss of this network after finetuning

No need to report any metrics, just send the video

Report mAP and explain how the performance is affected

Assignment support session Tuesday at 15:15

You might also like