0% found this document useful (0 votes)
4 views

Lec6 Video Understanding

The document discusses various methods and architectures for video understanding and classification, including techniques like single-frame CNNs, late and early fusion, and 3D CNNs. It also covers advanced topics such as optical flow, two-stream networks, self-attention mechanisms, and the use of vision transformers for video analysis. Additionally, it addresses the challenge of temporal action localization in untrimmed video sequences.

Uploaded by

hoann19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lec6 Video Understanding

The document discusses various methods and architectures for video understanding and classification, including techniques like single-frame CNNs, late and early fusion, and 3D CNNs. It also covers advanced topics such as optical flow, two-stream networks, self-attention mechanisms, and the use of vision transformers for video analysis. Additionally, it addresses the challenge of temporal action localization in untrimmed video sequences.

Uploaded by

hoann19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Computer Vision

Topic 6. Video Understanding

Huynh Trung Hieu

Industrial University of Ho Chi Minh city


Video Classification

Swimming
Running
Recognize actions Jumping
Eating
Standing
Video

Step 1
Step 2 Recognize Yes
Recognize actions Step 3 actions No
Step 4
Step 5
Step 6
Video
Problem: Videos are big!

Videos are about 30 frames per second (fps)

Size of uncompressed video (3 bytes per pixel):


- SD (640 x 480): ~1.5 GB per minute
- HD (1920 x 1080): ~10 GB per minute

Input Video: Tx3xHxW

Solution: Train on short clips:


low fps and low spatial resolution
e.g. T = 16, H=W=112
(3.2 seconds at 5 fps, 588 KB)
Training on clips

Raw video: Long, high FPS

Training: Train model to classify short clips with low FPS

Testing: Run model on different clips, average predictions


Video Classification: Single-Frame CNN

➢ Simple idea: train normal 2D CNN to classify video


frames independently!
➢ (Average predicted probs at test-time)
➢ Often a very strong baseline for video classification

Running Running Running Running

CNN CNN CNN CNN


Video Classification: Late Fusion (with FC layers)

➢ Intuition: Get high-level appearance of each frame, and


combine them
Class scores: C
Run 2D CNN on each
frame, concatenate
features and feed to MLP MLP Clip features: TDH’W’

Frame features: Flatten


TxDxH’xW’
2D CNN
on each CNN CNN CNN CNN
frame

Input:
Tx3xHxW
Video Classification: Late Fusion (with pooling)

➢ Intuition: Get high-level appearance of each frame, and


combine them Class scores: C
Run 2D CNN on
each frame, pool Linear
Clip features: D
features and feed to
Linear Average pooling over space and time
Frame features:
TxDxH’xW’
Flatten
2D CNN
on each
frame
CNN CNN CNN CNN

Input:
Tx3xHxW
Video Classification: Late Fusion (with pooling)

➢ Intuition: Get high-level appearance of each frame, and


combine them Class scores: C Problem: Hard to compare low
level motion between frames
Run 2D CNN on
each frame, pool Linear
Clip features: D
features and feed to
Linear Average pooling over space and time
Frame features:
TxDxH’xW’
Flatten
2D CNN
on each
frame
CNN CNN CNN CNN

Input:
Tx3xHxW
Video Classification: Early Fusion
➢ Intuition: Compare frames with very first conv layer,
after that normal 2D CNN
Class scores: C
Rest of the network
is standard 2D CNN

CNN
Problem: One layer of temporal
First 2D convolution collapses
all temporal information processing may not be enough!
Input: 3TxHxW
Output: DxHxW

Reshape:
3TxHxW

Input:
Tx3xHxW
Video Classification: 3D CNN
➢ Intuition: Use 3D versions of convolution and pooling to
slowly fuse temporal information over the course of the
network
Class scores: C

3D CNN

Input:
Tx3xHxW
2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Input: (Cinx T)xHxW Weight: Coutx(CinxT)x3x3 Output: CoutxWxH


Slide over x and y 2D grid with Cout – dim feat
(3D grid with Cin-dim
at each point
feat at each point)

H=224
H=224

W=224
W=224

Cout different filters


2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Input: (Cinx T)xHxW Weight: Coutx(CinxT)x3x3 Output: CoutxWxH


Slide over x and y 2D grid with Cout – dim feat
(3D grid with Cin-dim
at each point
feat at each point) No temporal shift-invariance!
Needs to learn separate filters
for the same motion at different
times in the clip

H=224
H=224

W=224
W=224

Cout different filters


2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Input: (Cinx T)xHxW Weight: Coutx(CinxT)x3x3 Output: CoutxWxH


Slide over x and y 2D grid with Cout – dim feat
(3D grid with Cin-dim
at each point
feat at each point) No temporal shift-invariance!
Needs to learn separate filters
for the same motion at different
times in the clip

H=224
H=224

W=224
W=224

Cout different filters

How to recognize Green to Gray transitions


anywhere in space and time?
2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Input: (Cinx T)xHxW Weight: CoutxCinx3x3x3 Output: CoutxTxWxH


Slide over x and y 3D grid with Cout – dim feat
(3D grid with Cin-dim at each point
feat at each point)

H=224
H=224

W=224
W=224

Cout different filters

How to recognize Green to Gray transitions


anywhere in space and time?
2D Conv (Early Fusion) vs 3D Conv (3D CNN)

Input: (Cinx T)xHxW Weight: CoutxCinx3x3x3 Output: CoutxTxWxH


Slide over x and y 3D grid with Cout – dim feat
(3D grid with Cin-dim
at each point
feat at each point) Temporal shift-invariant
since each filter slides
over time!

H=224
H=224

W=224
W=224

Cout different filters

How to recognize Green to Gray transitions


anywhere in space and time?
C3D: The VGG of 3D CNNs

➢ 3D CNN that uses all 3x3x3 conv and


2x2x2 pooling (except Pool1 which is
1x2x2)

➢ Released model pretrained on Sports-


1M: Many people used this as a video
feature extractor

➢ Problem: 3x3x3 conv is very


expensive!
✓ AlexNet: 0.7 GFLOP
✓ VGG-16: 13.6 GFLOP
✓ C3D: 39.5 GFLOP (2.9x VGG!)

Tran et al, “Learning Spatiotemporal Features with 3D Convolutional Networks”, ICCV 2015
Early Fusion vs Late Fusion vs 3D CNN

Karpathy et al, “Large-scale Video Classification with Convolutional Neural Networks”, CVPR 2014
Tran et al, “Learning Spatiotemporal Features with 3D Convolutional Networks”, ICCV 2015
Measuring Motion: Optical Flow
Optical flow gives a displacement field F between
images It and It+1
Horizontal
flow, dx

Image at
frame t

Vertical
flow, dy

Image at Tells where each pixel will


frame t +1 move in the next frame:
F(x, y) = (dx, dy)
It+1(x+dx,y+dy) = It(x,y)
Simonyan and Zisserman, “Two-stream convolutional networks for action recognition in videos”, NeurIPS 2014
Separating Motion and Appearance: Two-Stream Networks

Simonyan and Zisserman, “Two-stream convolutional networks for action recognition in videos”, NeurIPS 2014
Self Attention layer
c0 c1 cM Outputs:
Context vector: c(shape: Dv)
cj=iai,jvi
Mul+add

a0,0 a0,1 a0,M


v0
y1 y2 … yM
… … …

aN-1,0 aN-1,1 aN-1,M
vN-1
Self-attention
Input Softmax
vectors

x0 k0 e0,0 e0,1 e0,M


x1 x2 … xN
… … … … …

xN-1 kN-1 eN-1,0 eN-1,1 eN-1,M

q0 q1 qM Inputs:
Features: x (shape: NxD)
Spatio-Temporal Self-Attention(Nonlocal Block)

Residual connection
Queries
C’xHxW
Attention Weights
Transpose (HxW)x(HxW)
1x1
conv CxHxW

Softmax
+
x
Keys
C’xHxW
3D 1x1
CNN conv
1x1
conv
Features
CxTxHxW Values
C’xHxW

x
1x1
conv C’xHxW

Nonlocal Block

Wang et al, “Non-local neural networks”, CVPR 2018


Spatio-Temporal Self-Attention(Nonlocal Block)

Input clip

Residual connection Residual connection


Queries Queries
C’xHx C’xHx
W Attention Weights W Attention Weights
Transpose (HxW)x(HxW) Transpose (HxW)x(HxW)
1x 1x
CxHx CxHx

3D
1 1
co W co W
nv
Softmax + nv
Softmax +
x
3D
x CNN
3D
Keys Keys
C’xHx C’xHx
W W

CNN
1x
1
co
CNN 1x
1
co
nv 1x nv 1x
1 1
Feature co Feature co
s nv s nv
Values Values
CxTxH CxTxH
C’xHx C’xHx
xW xW
W W
x x
1x 1x
1 C’xHx 1 C’xHx
co W co W
nv nv

Nonlocal Block Nonlocal Block

We can add nonlocal blocks into existing 3D CNN architectures.


But what is the best 3D CNN architecture?

Wang et al, “Non-local neural networks”, CVPR 2018


Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?

Inception Module
Idea: take a 2D CNN
Concatenate
architecture.

Replace each 2D Kh x Kw
conv/pool layer with a 3D 5x5 conv 3x3 conv 1x1 conv
KtxKhxKw version 1x1 conv
1x1 conv 1x1 conv 3x3
MaxPool

Previous layer

Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?

Inception Module
Idea: take a 2D CNN
Concatenate
architecture.

Replace each 2D KhxKw


conv/pool layer with a 3D 5x5x5 conv 3x3x3 conv 1x1x1 conv
KtxKhxKw version 1x1x1 conv
1x1x1 conv 1x1x1 conv 3x3x3
MaxPool

Previous layer

Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?
Input: 3xKtxHxW 3D conv kernel
Output: 1xHxW
CinxKtxKhxKw

Idea: take a 2D CNN


architecture.

Replace each 2D KhxKw


conv/pool layer with a 3D
KtxKhxKw version. Copy kernel Output
Duplicate Kt times, is the
Can use weights of 2D conv to input Kt times divide by Kt same
initialize 3D conv: copy Kt
times in space and divide by Kt

This gives the same result as


2D conv given “constant”
video input 2D conv kernel
Input: 3xHxW CinxKhxKw Output: HxW

Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?

Idea: take a 2D CNN


architecture.

Replace each 2D KhxKw


conv/pool layer with a 3D
KtxKhxKw version.

Can use weights of 2D conv to


initialize 3D conv: copy Kt
times in space and divide by Kt

This gives the same result as


2D conv given “constant”
video input

Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Vision Transformers for Video
Factorized attention: Attend over space / time

Bertasius et al, “Is Space-Time Attention All You Need for Video Understanding?”, ICML 2021
Arnab et al, “ViViT: A Video Vision Transformer”, ICCV 2021
Neimark et al, “Video Transformer Network”, ICCV 2021
Vision Transformers for Video
Pooling module: Reduce number of tokens

Fan et al, “Multiscale Vision Transformers”, ICCV 2021


Li et al, “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection”, CVPR 2022
Vision Transformers for Video

Li et al, “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection”, CVPR 2022
Temporal Action Localization
Given a long untrimmed video sequence, identify
frames corresponding to different actions
Running

Jumping

Can use architecture similar to Faster R-CNN:


first generate temporal proposals then classify

Chao et al, ” Rethinking the Faster R-CNN Architecture for Temporal Action Localization”, CVPR 2018
Handwashing recognition
Handwashing recognition
Handwashing recognition

You might also like