0% found this document useful (0 votes)
8 views56 pages

Computer Vision 12 Vision Language Models(1)

The document discusses Vision Language Models (VLMs), which are multimodal large language models that process both visual inputs and text, enabling tasks like image captioning and visual question answering. It highlights popular VLMs such as OpenAI's GPT-4.5 and Google's Gemini 2.0, and explains the architecture of transformers used in VLMs, focusing on aligning text and image information. Additionally, it outlines an assignment related to object detection of cats and dogs, detailing tasks, datasets, and evaluation metrics.

Uploaded by

laughriotclip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views56 pages

Computer Vision 12 Vision Language Models(1)

The document discusses Vision Language Models (VLMs), which are multimodal large language models that process both visual inputs and text, enabling tasks like image captioning and visual question answering. It highlights popular VLMs such as OpenAI's GPT-4.5 and Google's Gemini 2.0, and explains the architecture of transformers used in VLMs, focusing on aligning text and image information. Additionally, it outlines an assignment related to object detection of cats and dogs, detailing tasks, datasets, and evaluation metrics.

Uploaded by

laughriotclip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

COMPUTER VISION

2024 - 2025
>VISION LANGUAGE
MODELS
UTRECHT UNIVERSITY
RONALD POPPE
VISION LANGUAGE MODELS
VLMs are multimodal LLMs that incorporate images/video and text
• They can deal with visual input
• Provide textual answers to textual queries about them

Increasingly, they can also provide visual responses


• Topic of next lecture
VISION LANGUAGE MODELS
Popular VLMS:
• OpenAI GPT-4.5 Preview: text and image
• Google Gemini 2.0 Pro: text, audio and image
• Anthropic Claude 3.7 Sonnet: text and image
• Alibaba Qwen 2.5 Max: text, images, audio, video

• DeepSeek v3: text, only extracts text from images (so no VLM)
VISION LANGUAGE MODELS
Common tasks are image captioning and visual question answering (VQA)
• Image captioning: providing a description of the visual input
• VQA: provide an answer with the image as reference
VLMS ARE TRANSFORMERS
Recall that transformers are sequence-to-sequence models
• Can deal with sequential input such as text and images
• Can deal with sequential output such as text and images
VLMS ARE TRANSFORMERS
Previous lecture:
• We looked at vision encoder
• We ignored sequential output

This lecture:
• We combine text and images
• We look at sequential text output

Next lecture:
• We look at image generation
VLMS ARE TRANSFORMERS
Two topics:
• Aligning text and visual information
• Producing sequential text output

Requires that we bring back the transformer decoder


TRANSFORMERS
Transfomers have an encoder and decoder
• Information from encoder to decoder

Lx
Components of both modules are similar
Lx
• Usage is somewhat different

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017


Text

IMAGE CAPTIONING
Let’s first focus on image captioning
• Image used as input
• Text (caption) as output
Lx

Encoder output: embedded token sequence Lx

• Embedding should be shared with decoder


• Requires that we align text and image

Image
ALIGNING TEXT & IMAGE
ALIGN TEXT AND IMAGE
Goal of encoder is to encode input in a meaningful way
• Ensure that “similar” inputs have “similar” encodings
• Implicitly deals with invariancies

What is similar depends on what you focus on


• Semantic vs. appearance
ALIGN TEXT AND IMAGE
Text similarities are vastly different than image similarities
• Still, we have to align their embeddings

Requires that we learn how image variations are reflected in text


• And vice versa, how text variations are reflected in images
• Can only be learned in a supervised manner
CLIP
CLIP: contrastive language-image pretraining
• Jointly trains text and image encoder
• Both are transformer encoders
• Uses contrastive learning

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
CLIP is trained on 400M image-caption pairs
• Training task was to match caption to image

Result is:
• Trained text encoder
• Trained visual encoder

Both share encoding space

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
Contrastive learning doesn’t use fixed labels
• Distinction between “same” or “other”
• Goal is to bring encodings of “same” items closer, and “other” further apart
• Can be used in supervised or self-supervised settings

Many variants:
• Triplet loss
• InfoNCE
• Etc.
CLIP
CLIP training repeatedly uses batches of N <image, caption> pairs

Text processed text encoder


• Produces text embedding Ti

Images processed by image encoder


• Produces image embedding Ii
CLIP
Goals is to match embeddings between modalities
• Dot-product similarity
• Loss for off-diagonal “matches”
• Loss backpropagated to encoders
~
CLIP
After training, text and image modalities are aligned!

Lx
When we have an image and a corresponding text
• Both encoders should map to the same embedding
ZERO-SHOT CLASSIFICATION
We can use this model for zero-shot classification
• Use a set of text options
• Encode image and text options
• Select option with best similarity
PRODUCING OUTPUT
PRODUCING OUTPUTS
In the domain of NLP, a word is a token
• A sentence is a sequence of tokens
• Encoder produces embedding of entire sentence

Decoder also produces a sentence


• Translation
• Answer
• Etc.
DECODER BLOCK
Decoder consists of L blocks

Each block has three modules:


Lx
• Self-attention module
• Encoder-decoder attention module
• MLP module

Input processing is similar to encoder


• Embedding and positional encoding

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017


DECODER BLOCK2
Input processing is similar to encoder
• Embedding and positional encoding

Output of the decoder is a classifier head Lx

• Linear projection (FC layer)


• Softmax output

In the context of NLP


• Output is a class label that corresponds to a word
• Typically many classes…
DECODER BLOCK3
Self-attention block is similar to that in decoder
• Apply MSA on input, with residual connection

Lx
MLP block is also similar to that in decoder
• Linearly project the embedding to the input size
DECODER BLOCK4
Encoder-decoder attention module uses encoder input

Similar to self-attention but:


Lx
• K and V come from encoder
• Q come from final decoder embedding
• Attends to similarities in both

Residual connection used to bypass attention


DECODER
The decoder processes and outputs one token at a time

In NLP:
• First input is the special <start> token
• First output is the first word
• Second input is the first output word
• Etc.
• Final output is <eos> (end of sentence) token
EXAMPLE
NLP translation example, first time step
EXAMPLE2
NLP translation example, next time steps
MASKING
When tokens have been output, they are used as inputs too
• Increasing token length

But tokens are not supposed to look into the future


• Solved by masking out half the self-attention matrix
TRAINING
TRAINING
Different options to train image (and text) encoder

Encoder Encoder

Single-encoder Dual-encoder Encoder-


model model decoder model
TRAINING2
Training a text decoder after learning image/text encoders
• Sufficient for image captioning

Nukrai et al., "CapDec: Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP 2022
QUESTIONS?
ASSIGNMENT 4 WALK-THROUGH
ASSIGNMENT
Object detection of cats and dogs
• Deal with images of varying sizes
• Build object detection network
CAT
• Train object detector DOG
• Test object detector
• Report performance

Counts for 15% of total grade


• 70 points in regular tasks
• 30 points in choice tasks
ASSIGNMENT2
Kaggle dog and cat detection dataset
• https://www.kaggle.com/datasets/andrewmvd/dog-and-cat-detection
• 3686 images
• 2 classes (cat=0, dog=1)
ASSIGNMENT3
A notebook is available to help you get started:
• https://colab.research.google.com/drive/1oyRBec77YBynOWLmEPI
XEzyO3R4BrIUF
• Provides data loading and image transformation functions
• Visualizes samples of the dataset
ASSIGNMENT4
Images have different sizes
• Need to resize
• Need to transform bounding boxes accordingly
• Check https://pytorch.org/vision/0.9/transforms.html

We resize the inputs to 112 x 112 x 3


• Choose how to do the resizing
• Adapt the bounding boxes
ASSIGNMENT5
Object detection typically uses a powerful network
• E.g., YOLOv1 has 45M parameters
• We will train a much smaller model with ~1M parameters

Input size 112 x 112 x 3, total 12 layers (with input layer)


• Four times a convolution 3x3 layer followed by a 2x2 pooling layer
• Final convolution 3x3 layer
• Fully connected layer with 512 neurons
• Fully connected layer with 343 neurons
ASSIGNMENT6
Output is a fully connected layer similar to YOLO v1
ASSIGNMENT7
Specifically, we use a 7 x 7 spatial grid
• Max B=1 object per cell

Per cell:
• Bounding box positions (4)
• Objectness score (1)
• Class scores (2 – cat, dog)

Total: 7 * 7 * 7 = 343
• Encoded as vector
ASSIGNMENT8
When training the network:
• Use the five losses from the original YOLO v1 paper
• Choose your hyperparameters
• Implement early stopping to prevent overfitting

Training will take longer than for Assignment 3


• Make informed changes to your network
• To see if your network trains at al, you don’t have to check many epochs
ASSIGNMENT9
When testing your network:
• Report mAP by changing the objectness score from 0-1
• Report a confusion matrix. Make sure you can identify
misclassifications (and not only missed/false detections)
• No need to implement non-maximum suppression
ASSIGNMENT10
Reporting:
1. Description of the architecture and parameters
2. Description of the way you processed images and bounding boxes
3. Overview of your training hyperparameters
4. Graphs for loss components and accuracy, over the training epochs
5. Training, validation and test mAP of your model
6. Confusion matrix for the best objectness threshold
7. Link to your models weights
8. A list of the choice tasks you implemented (with additional reporting)
CHOICE TASKS
CHOICE 1: Improve the architecture to provide better results (max 10)
• Only eligible with meaningful and impactful changes
• No simple changes such as changing the number or size of kernels, or
adding dropout

Report the mAP performance of the improved model


CHOICE TASKS2
CHOICE 2: Replace the backbone with a pretrained ResNet (max 10)
• Convolution layers before the first FC layer can be treated as the backbone.
Replace all layers before the first FC layer with ResNet-18 (without FC layers)
• Initialize ResNet-18 with the checkpoint pretrained on ImageNet (checkout
models.resnet18)
• Finetune the whole model on the cat-dog detection dataset. You need to
change the input size from 112 to 224 to adapt to ResNet-18.

Report the mAP of this network. Report any differences in your architecture
and training procedure
CHOICE TASKS3
CHOICE 3: Pretrain the backbone (max 15) instead of training the detection
network from scratch
• Pretrain the backbone on the cat-dog-classification dataset (Kaggle)
• Adapt the network to focus on recognition rather than detection
• Initialize your original detection network using the pretrained checkpoint
• Finetune your network for detection using the dog-cat-detection dataset

Report the mAP and loss of this network after finetuning


CHOICE TASKS4
CHOICE 4: Instead of having a fixed validation set, implement k-fold
cross-validation (max 5)
• Use k=5, but realize that this increases your training time 5-fold

Report:
• mAP per fold
• Discuss differences between folds
CHOICE TASKS5
CHOICE 5: Evaluate your network on a video of a cat and a dog (max 10)
• Search carefully if there are any videos online with cats and dogs
• Pick one
• Spending hours watching videos does not count as “education time”
• Process every kth frame for a total of >=50 frames
• Make a video with detection bounding boxes per class
• No need to manually annotate the ground truth

No need to report any metrics, just send the video


CHOICE TASKS6
CHOICE 6: Perform at least 3 data augmentation techniques (max 10)
• Choose your techniques wisely
• Max 5 points if you only use techniques that are not geometric
• For rotation, flipping, mirroring, scaling, cropping: adapt the bounding
boxes accordingly (check https://pytorch.org/vision/0.9/transforms.html)

Report mAP and explain how the performance is affected


CHOICE TASKS7
CHOICE 7: Show examples of misdetections (max 10)
• Provide and discuss at least 3 samples of each:
• Cat  dog misclassifications
• Dog  cat misclassifications
• False positive detections (None  cat/dog)
• False negatives (Cat/dog  None)
GOOD LUCK!
FINALLY…
ASSIGNMENT
Assignment 3
• Deadline Sunday March 16, 23:00

Assignment 4
• Deadline Sunday March 30, 23:00

Assignment support session Tuesday at 15:15


• Use the assignments channel!
NEXT LECTURE
Next lecture:
• Tuesday March 18, 13:15-15:00, KBG-Atlas (!)
• Image and video generation
MATERIALS
Background materials:
• Contrastive Language-Image Pretraining (CLIP)
• Gan et al., “Vision-Language Pre-training”, 2022

You might also like