Computer Vision 12 Vision Language Models(1)
Computer Vision 12 Vision Language Models(1)
2024 - 2025
>VISION LANGUAGE
MODELS
UTRECHT UNIVERSITY
RONALD POPPE
VISION LANGUAGE MODELS
VLMs are multimodal LLMs that incorporate images/video and text
• They can deal with visual input
• Provide textual answers to textual queries about them
• DeepSeek v3: text, only extracts text from images (so no VLM)
VISION LANGUAGE MODELS
Common tasks are image captioning and visual question answering (VQA)
• Image captioning: providing a description of the visual input
• VQA: provide an answer with the image as reference
VLMS ARE TRANSFORMERS
Recall that transformers are sequence-to-sequence models
• Can deal with sequential input such as text and images
• Can deal with sequential output such as text and images
VLMS ARE TRANSFORMERS
Previous lecture:
• We looked at vision encoder
• We ignored sequential output
This lecture:
• We combine text and images
• We look at sequential text output
Next lecture:
• We look at image generation
VLMS ARE TRANSFORMERS
Two topics:
• Aligning text and visual information
• Producing sequential text output
Lx
Components of both modules are similar
Lx
• Usage is somewhat different
IMAGE CAPTIONING
Let’s first focus on image captioning
• Image used as input
• Text (caption) as output
Lx
Image
ALIGNING TEXT & IMAGE
ALIGN TEXT AND IMAGE
Goal of encoder is to encode input in a meaningful way
• Ensure that “similar” inputs have “similar” encodings
• Implicitly deals with invariancies
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
CLIP is trained on 400M image-caption pairs
• Training task was to match caption to image
Result is:
• Trained text encoder
• Trained visual encoder
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021
CLIP
Contrastive learning doesn’t use fixed labels
• Distinction between “same” or “other”
• Goal is to bring encodings of “same” items closer, and “other” further apart
• Can be used in supervised or self-supervised settings
Many variants:
• Triplet loss
• InfoNCE
• Etc.
CLIP
CLIP training repeatedly uses batches of N <image, caption> pairs
Lx
When we have an image and a corresponding text
• Both encoders should map to the same embedding
ZERO-SHOT CLASSIFICATION
We can use this model for zero-shot classification
• Use a set of text options
• Encode image and text options
• Select option with best similarity
PRODUCING OUTPUT
PRODUCING OUTPUTS
In the domain of NLP, a word is a token
• A sentence is a sequence of tokens
• Encoder produces embedding of entire sentence
Lx
MLP block is also similar to that in decoder
• Linearly project the embedding to the input size
DECODER BLOCK4
Encoder-decoder attention module uses encoder input
In NLP:
• First input is the special <start> token
• First output is the first word
• Second input is the first output word
• Etc.
• Final output is <eos> (end of sentence) token
EXAMPLE
NLP translation example, first time step
EXAMPLE2
NLP translation example, next time steps
MASKING
When tokens have been output, they are used as inputs too
• Increasing token length
Encoder Encoder
Nukrai et al., "CapDec: Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP 2022
QUESTIONS?
ASSIGNMENT 4 WALK-THROUGH
ASSIGNMENT
Object detection of cats and dogs
• Deal with images of varying sizes
• Build object detection network
CAT
• Train object detector DOG
• Test object detector
• Report performance
Per cell:
• Bounding box positions (4)
• Objectness score (1)
• Class scores (2 – cat, dog)
Total: 7 * 7 * 7 = 343
• Encoded as vector
ASSIGNMENT8
When training the network:
• Use the five losses from the original YOLO v1 paper
• Choose your hyperparameters
• Implement early stopping to prevent overfitting
Report the mAP of this network. Report any differences in your architecture
and training procedure
CHOICE TASKS3
CHOICE 3: Pretrain the backbone (max 15) instead of training the detection
network from scratch
• Pretrain the backbone on the cat-dog-classification dataset (Kaggle)
• Adapt the network to focus on recognition rather than detection
• Initialize your original detection network using the pretrained checkpoint
• Finetune your network for detection using the dog-cat-detection dataset
Report:
• mAP per fold
• Discuss differences between folds
CHOICE TASKS5
CHOICE 5: Evaluate your network on a video of a cat and a dog (max 10)
• Search carefully if there are any videos online with cats and dogs
• Pick one
• Spending hours watching videos does not count as “education time”
• Process every kth frame for a total of >=50 frames
• Make a video with detection bounding boxes per class
• No need to manually annotate the ground truth
Assignment 4
• Deadline Sunday March 30, 23:00