Lec6 Video Understanding
Lec6 Video Understanding
Swimming
Running
Recognize actions Jumping
Eating
Standing
Video
Step 1
Step 2 Recognize Yes
Recognize actions Step 3 actions No
Step 4
Step 5
Step 6
Video
Problem: Videos are big!
Input:
Tx3xHxW
Video Classification: Late Fusion (with pooling)
Input:
Tx3xHxW
Video Classification: Late Fusion (with pooling)
Input:
Tx3xHxW
Video Classification: Early Fusion
➢ Intuition: Compare frames with very first conv layer,
after that normal 2D CNN
Class scores: C
Rest of the network
is standard 2D CNN
CNN
Problem: One layer of temporal
First 2D convolution collapses
all temporal information processing may not be enough!
Input: 3TxHxW
Output: DxHxW
Reshape:
3TxHxW
Input:
Tx3xHxW
Video Classification: 3D CNN
➢ Intuition: Use 3D versions of convolution and pooling to
slowly fuse temporal information over the course of the
network
Class scores: C
3D CNN
Input:
Tx3xHxW
2D Conv (Early Fusion) vs 3D Conv (3D CNN)
H=224
H=224
W=224
W=224
H=224
H=224
W=224
W=224
H=224
H=224
W=224
W=224
H=224
H=224
W=224
W=224
H=224
H=224
W=224
W=224
Tran et al, “Learning Spatiotemporal Features with 3D Convolutional Networks”, ICCV 2015
Early Fusion vs Late Fusion vs 3D CNN
Karpathy et al, “Large-scale Video Classification with Convolutional Neural Networks”, CVPR 2014
Tran et al, “Learning Spatiotemporal Features with 3D Convolutional Networks”, ICCV 2015
Measuring Motion: Optical Flow
Optical flow gives a displacement field F between
images It and It+1
Horizontal
flow, dx
Image at
frame t
Vertical
flow, dy
Simonyan and Zisserman, “Two-stream convolutional networks for action recognition in videos”, NeurIPS 2014
Self Attention layer
c0 c1 cM Outputs:
Context vector: c(shape: Dv)
cj=iai,jvi
Mul+add
q0 q1 qM Inputs:
Features: x (shape: NxD)
Spatio-Temporal Self-Attention(Nonlocal Block)
Residual connection
Queries
C’xHxW
Attention Weights
Transpose (HxW)x(HxW)
1x1
conv CxHxW
Softmax
+
x
Keys
C’xHxW
3D 1x1
CNN conv
1x1
conv
Features
CxTxHxW Values
C’xHxW
x
1x1
conv C’xHxW
Nonlocal Block
Input clip
3D
1 1
co W co W
nv
Softmax + nv
Softmax +
x
3D
x CNN
3D
Keys Keys
C’xHx C’xHx
W W
CNN
1x
1
co
CNN 1x
1
co
nv 1x nv 1x
1 1
Feature co Feature co
s nv s nv
Values Values
CxTxH CxTxH
C’xHx C’xHx
xW xW
W W
x x
1x 1x
1 C’xHx 1 C’xHx
co W co W
nv nv
Inception Module
Idea: take a 2D CNN
Concatenate
architecture.
Replace each 2D Kh x Kw
conv/pool layer with a 3D 5x5 conv 3x3 conv 1x1 conv
KtxKhxKw version 1x1 conv
1x1 conv 1x1 conv 3x3
MaxPool
Previous layer
Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?
Inception Module
Idea: take a 2D CNN
Concatenate
architecture.
Previous layer
Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?
Input: 3xKtxHxW 3D conv kernel
Output: 1xHxW
CinxKtxKhxKw
Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Inflating 2D Networks to 3D (I3D)
There has been a lot of work on architectures for
images. Can we reuse image architectures for video?
Carreira and Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”, CVPR 2017
Vision Transformers for Video
Factorized attention: Attend over space / time
Bertasius et al, “Is Space-Time Attention All You Need for Video Understanding?”, ICML 2021
Arnab et al, “ViViT: A Video Vision Transformer”, ICCV 2021
Neimark et al, “Video Transformer Network”, ICCV 2021
Vision Transformers for Video
Pooling module: Reduce number of tokens
Li et al, “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection”, CVPR 2022
Temporal Action Localization
Given a long untrimmed video sequence, identify
frames corresponding to different actions
Running
Jumping
Chao et al, ” Rethinking the Faster R-CNN Architecture for Temporal Action Localization”, CVPR 2018
Handwashing recognition
Handwashing recognition
Handwashing recognition