0% found this document useful (0 votes)

19 views

spatiotemporal-scene-graphs-for-video-QA

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering presents a novel approach to enhance video QA by leveraging a (2.5+1)D scene graph representation that captures spatio-temporal information across video frames. This method transforms 2D frames into a pseudo-3D structure, allowing for the creation of static and dynamic sub-graphs, which improves reasoning efficiency and reduces redundancy in graph nodes. Experiments demonstrate that this approach significantly accelerates training and inference while achieving superior performance on video QA tasks compared to existing methods.

Uploaded by

similoluwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

spatiotemporal-scene-graphs-for-video-QA

Uploaded by

similoluwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

(2.

5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Anoop Cherian Chiori Hori Tim K. Marks Jonathan Le Roux

Mitsubishi Electric Research Labs (MERL), Cambridge, MA
{cherian, chori, tmarks, leroux}@merl.com
arXiv:2202.09277v2 [cs.CV] 26 Mar 2022

Abstract et al. 2015; Krishna et al. 2017; Dornadula et al. 2019), per-
haps because of the possibility for an easier disentanglement
Spatio-temporal scene-graph approaches to video-based rea- of the scene objects relevant to the given question. Such a
soning tasks, such as video question-answering (QA), typi-
disentanglement naturally leads to a graph representation of
cally construct such graphs for every video frame. These ap-
proaches often ignore the fact that videos are essentially se- the scene, usually called a scene graph (Johnson et al. 2015).
quences of 2D “views” of events happening in a 3D space, Using such a graph representation allows one to use the
and that the semantics of the 3D scene can thus be carried powerful machinery of graph neural networks for VQA, and
over from frame to frame. Leveraging this insight, we pro- has demonstrated significant promise (Li and Jiang 2019; Li
pose a (2.5+1)D scene graph representation to better cap- et al. 2019; Pan et al. 2020; Geng et al. 2021).
ture the spatio-temporal information flows inside the videos. While visual scene graphs were originally proposed for
Specifically, we first create a 2.5D (pseudo-3D) scene graph image-based tasks, there have been direct adaptations of this
by transforming every 2D frame to have an inferred 3D struc-
ture using an off-the-shelf 2D-to-3D transformation module,
data structure for video-based reasoning problems (Geng
following which we register the video frames into a shared et al. 2021; Chatterjee et al. 2021; Pan et al. 2020; Herzig
(2.5+1)D spatio-temporal space and ground each 2D scene et al. 2019). Usually, in such problems, scene graphs are
graph within it. Such a (2.5+1)D graph is then segregated into constructed for every video frame, followed by an inter-
a static sub-graph and a dynamic sub-graph, corresponding to frame representation learning to produce holistic video level
whether the objects within them usually move in the world. features for reasoning. However, having scene graphs for ev-
The nodes in the dynamic graph are enriched with motion ery frame may be redundant and could even become compu-
features capturing their interactions with other graph nodes. tationally detrimental for longer video sequences. Taking a
Next, for the video QA task, we present a novel transformer- step back, we note that videos are essentially 2D views of
based reasoning pipeline that embeds the (2.5+1)D graph a 3D space in which various events happen temporally, and
into a spatio-temporal hierarchical latent space, where the
sub-graphs and their interactions are captured at varied gran-
representing the scene in a 4D spatio-temporal space could
ularity. To demonstrate the effectiveness of our approach, thus potentially avoid such representational redundancies.
we present experiments on the NExT-QA and AVSD-QA Furthermore, object properties such as permanence (Sham-
datasets. Our results show that our proposed (2.5+1)D rep- sian et al. 2020) could be handled more effectively in a 3D
resentation leads to faster training and inference, while our space, as each object (that is visible in some video frame)
hierarchical model showcases superior performance on the gets a location therein, thereby disentangling the camera
video QA task versus the state of the art. views from its spatial location (Tung et al. 2020; Girdhar
and Ramanan 2019). Using such a 3D representation thus
would provide a natural way to avoid occlusions, which is a
Introduction significant problem when working with 2D scene graphs.
Recent advances in deep learning have made it possible to Motivated by the above insight, we explore a novel spatio-
think beyond individual domains, such as computer vision temporal scene graph representation, where the graph nodes
and natural language processing, and consider tasks that are not grounded on individual video frames, instead are
are at their intersections. Visual question answering (VQA) mapped to a shared 3D world coordinate frame. While there
is one such task that has witnessed a significant attention are approaches in computer vision that could produce such
lately (Antol et al. 2015; Anderson et al. 2018; Wu et al. a common 3D world (Hartley and Zisserman 2004), such
2017; Jang et al. 2017; Geng et al. 2021; Chen et al. 2020; methods usually assume: i) that the scene is static, without
Ghosh et al. 2019). While earlier approaches to this task dynamic objects, ii) that the camera calibration information
used holistic visuo-textual representations (Dang et al. 2021; is known, or iii) that multiple overlapping views of the same
Antol et al. 2015), it was found that decomposing a visual scene are available; none of which may exist for arbitrary
scene into its constituents (and their relationships) provided (internet) videos typically used in VQA tasks. Fortunately,
a better reasoning pipeline (Anderson et al. 2018; Johnson there have been several recent advancements in 3D recon-
struction from 2D images, such as (Ranftl et al. 2019; Fu
et al. 2018); these works take as input an image and pro- Scene graphs for QA: Since the seminal work of (Johnson
duces a realistic pseudo-3D structure for the image scene, et al. 2015) in using scene graphs as a rich representation of
typically called a 2.5D image. For every video frame, we an image scene, there have been extensions of this idea for
leverage such a 2.5D reconstruction to impart an approxi- video QA and captioning tasks (Herzig et al. 2019; Wang
mate 3D location for each graph node, thereby producing a et al. 2018; Jang et al. 2017; Tsai et al. 2019b; Girdhar et al.
(2.5+1)D spatio-temporal scene graph. 2019). Spatio-temporal scene graphs are combined with a
A technical challenge with the above (2.5+1)D scene knowledge distillation objective for video captioning in (Pan
graph representation is that each graph is still specific to a et al. 2020). Similarly, video scene graphs are combined
video frame, and is not registered to a shared space. Such with multimodal Transformers for video dialogs and QA
a registration is confounded by the fact that objects in the in (Geng et al. 2021). In (Jiang et al. 2019), a graph align-
scene may move from frame to frame. To this end, we pro- ment framework is proposed that uses graph co-attention be-
pose to: (i) split the (2.5+1)D scene graph into a static 2.5D tween visual and language cues for better video QA reason-
sub-graph and a dynamic (2.5+1)D sub-graph, depending on ing. In (Fan et al. 2019), a multi-step reasoning pipeline is
whether the class of the underlying scene graph node usually presented that attends to visual and textual memories. We
moves in scenes (e.g., a person class is dynamic, while a ta- note that scene graphs have been explored for various action
ble class is considered static), (ii) merge the graph nodes cor- recognition tasks as well. For example, video action graphs
responding to the static sub-graph based on their 3D spatial are presented in (Bar et al. 2020; Rashid, Kjellstrom, and
proximity across frames, thereby removing the node redun- Lee 2020; Wang and Gupta 2018). Action Genome (Ji et al.
dancy, and (iii) retain the nodes of the dynamic sub-graph 2020; Cong et al. 2021) characterizes manually annotated
from the original scene graph. As the dynamic sub-graph spatio-temporal scene graphs for action recognition. In con-
nodes are expected to not only capture the frame-level se- trast to these prior methods, we seek a holistic and poten-
mantics, but to also potentially involve object actions (e.g., tially minimal representation of a video scene via pseudo-3D
person picking a bottle), we enrich each dynamic graph node scene graphs for the QA task.
with motion features alongside its object-level feature rep- 3D scene graphs: Very similar to our motivations towards a
resentation. Thus, our proposed (2.5+1)D scene graph rep- comprehensive scene representation, 3D scene graphs have
resentation approximately summarizes the spatio-temporal been proposed in (Armeni et al. 2019). However, their focus
activity happening in a scene in a computationally efficient is on efficient annotation and collection of such graphs from
framework. 3D sensors. Similarly, more recent efforts such as (Zhang
The use of such a (2.5+1)D graph representation allows et al. 2021; Wu et al. 2021) are also targeted at improvising
for developing rich inference schemes for VQA tasks. For the efficiency of constructing a 3D scene graph from RGBD
example, to capture the interaction of a person with a static scans, while our focus is on constructing pseudo-3D graphs
object in the scene, the inference algorithm needs to attend leveraging recent advancements in 2D-to-3D methods. We
to regions in the (2.5+1)D graph where the spatio-temporal note that while precise 3D scene graphs may be important
proximity between the respective graph nodes is minimized. for several tasks such as robot navigation or manipulation,
Leveraging this intuition, we propose a hierarchical latent they need not be required for reasoning tasks such as what
embedding of the (2.5+1)D graph where the graph edges are we consider in this paper, and for such tasks approximate
constructed via varied spatio-temporal proximities, thereby 3D reasoning may be sufficient. We also note that 3D graphs
capturing the latent embeddings of the graph at multiple lev- have been explored for video prediction tasks in (Tung et al.
els of granularity. We use such a graph within a Transformer 2020), however in a very controlled setting.
reasoning pipeline (Vaswani et al. 2017) and conditioned on Graph Transformers: Similar to our contribution, connec-
the VQA questions to retrieve the correct answer. tions between graphs and Transformers have been explored
To validate the effectiveness of our approach, we present previously. For example, (Choromanski et al. 2021) has ex-
experiments on two recent video QA datasets, namely: plored long-range attention using kernelized Transformers,
(i) the NExT-QA dataset (Xiao et al. 2021) and (ii) the while (Tsai et al. 2019a) presents a kernel view of Trans-
QA task of the audio-visual scene aware dialog (AVSD) former attention, and Bello presents a long range attention
dataset (Alamri et al. 2019a). Our results on these datasets using lambda layers that captures both position and content
show that our proposed framework leads to about 4x speed interactions (Bello 2021). While there are some similarities
up in training, while pruning 25–50% graph nodes, and between these works and ours in the use of kernels and po-
showcases superior QA accuracy against recent state-of-the- sitional details in computing the similarity matrix within a
art methods. Transformer, our objective and goals are entirely different
from these works. Specifically, our proposed architecture is
to represent a pseudo-3D scene at multiple levels of spatio-
Related Work temporal granularity for a reasoning task, which is entirely
We note that visual question answering has been a very ac- different from the focus of these prior works.
tive research area in the recent times, and thus interested
readers may refer to excellent surveys, such as (Teney et al. Proposed Method
2018; Wu et al. 2017). In the following, we restrict our lit- In this section, we first present our setup for constructing
erature review to prior methods that are most similar to our (2.5+1)D scene graphs for a given video sequence, then
contributions. explain our hierarchical spatio-temporal Transformer-based
Question Embedding
Candidate
table Answers
table
bed

Hierarchical
Answer
chair (2.5+1)D
Selection
laptop person Transformer

cup
person
person
Input video frames (2.5+1)D transformation Static and Dynamic Graphs Motion Dyn. node Static node

Figure 1: A schematic illustration of our proposed (2.5+1)D video QA reasoning pipeline.

graph reasoning pipeline. See Fig. 1 for an overview of our frames, we will have mn graph nodes.1 However, as alluded
framework. to above, several of these graph nodes may be redundant,
thus motivating us to propose our (2.5+1)D scene graphs.
Problem Setup (2.5+1)D Scene Graphs
We assume that we have access to a set of N training video Suppose D : Rh×w×3 → Rh×w×4 denotes a neural network
sequences, S = {S1 , S2 , . . . , SN }, where the i-th video model that takes as input an RGB image and produces as
consists of ni frames. In the following, we eliminate the sub- output an RGBD image, where the depth is estimated. For a
scripts for simplicity, and use S to denote a generic video video frame (image) I, further let dI : R2 → R3 map a 2D
sequence from S that has n frames. We assume that each pixel location (x, y) to a respective 3D coordinate, denoted
video S is associated with at least one question Q, which is p = (x, y, z). To implement D, we use an off-the-shelf pre-
an ordered tuple of words from a predefined vocabulary (tok- trained 2D-to-3D deep learning framework. While there are
enized and embedded suitably). We define the task of video several options for this network (Fu et al. 2018; Li et al.
QA as that of retrieving a predicted answer, Apred , from a 2020), we use the MiDAS model (Ranftl et al. 2019), due
collection of ` possible answers, A = {A1 , A2 , . . . , A` }. Of to its ease of use and its state-of-the-art performance in esti-
these ` answers, we denote the ground-truth answer as Agt . mating realistic depth for a variety of real-world scenes. For
We propose to represent S as a (2.5+1)D spatio-temporal a scene graph node v ∈ Vt extracted from video frame t (im-
scene graph. The details of this process are described in the age It ), let bboxv denote the centroid of the node’s detected
following subsections. bounding box. To enrich the scene graph with (2.5+1)D
spatio-temporal information, we expand the representation
of node v to include depth and time by updating the tuple
2D Scene Graph Construction for v to be (fvo , cv , bboxv , pv , t), where pv = dIt (bboxv )
can be interpreted as the 3D centroid of the bounding box.
Let G = (V, E) be a scene graph representation of a video We denote the enriched graph as G3.5D .
sequence S of length n frames, where V = V1 ∪V2 ∪· · ·∪Vn
denotes the set of nodes, each Vt denotes the subset of nodes Static and Dynamic Sub-graphs
associated with frame t, and E ⊆ V × V denotes the set
While the nodes in G3.5D are equipped with depth, they are
of graph edges (which are computed as part of our hierar-
still grounded in every video frame, which is potentially
chical Transformer framework explained later). To construct
wasteful. This is because many of these nodes may corre-
the scene graph G, we follow the standard pipeline using an
spond to objects in the scene that seldom move in the real
object detector. Specifically, we first extract frames from the
world. If we can identify such objects, then we can prune
video sequence and pass each frame as input to a Faster R-
their redundant scene graph nodes. To this end, we segre-
CNN (FRCNN) object detection model (Ren et al. 2015).
gated the Visual Genome classes into two distinct categories,
The FRCNN implementation that we use is pre-trained on
namely (i) a category Cs of static scene objects, such as ta-
the Visual Genome dataset (Anderson et al. 2018) and can
ble, tree, sofa, television, etc., and (ii) a category Cd of dy-
thus detect 1601 object classes, which include a broad array
namic objects, such as people, mobile, football, clouds, etc.
of daily-life indoor and outdoor objects. In every frame, the
While the visual appearance of a static object may change
FRCNN model detects m objects, each of which is repre-
sented by a graph node v that contains a tuple of FRCNN 1
While this may appear not too big a memory footprint, note
outputs (fvo , cv , bboxv ), where fvo is the object’s neural rep- that each visual feature fvo is usually a 2048D vector. Thus, with
resentation, cv is its label in the Visual Genome database, m = 36, videos of length n ≈ 50 frames, and a batch size of 64,
and bboxv denotes its bounding box coordinates relative we would need about 15GB of GPU memory for forward propaga-
to the respective frame. Thus, for a video sequence with n tion alone.
from frame to frame, we assume that its semantics do not Algorithm 1: Identifying common ancestors for merging
change and are sufficient for reasoning about the object in for v1 ∈ V1s do
the QA task. However, such an assumption may not hold for ancestor(v1 ) := v1
a dynamic object, such as a person who may interact with for t = 2 to n do
various objects in the scene, and each interaction needs to for vt ∈ Vts do
be retained for reasoning. Using this class segregation, we if match(vt ) exists then
split the graph G3.5D into two distinct scene graphs, Gs and
ancestor(vt ) := ancestor match(vt )
Gd , corresponding to whether the object label cv of a node
v ∈ V belongs to Cs or Cd , respectively.
Our next subgoal is to register G3.5D in a shared 3D space.
There are two key challenges in such a registration, namely Motion Features
that (i) the objects in the scene may move, and (ii) the cam-
era may move. These two problems can be tackled easily
if we extract registration features only from the static sub- To recap, so far we have segregated the graph G3.5D into
graph nodes of the frames. Specifically, if there is camera Gs and Gd , where the nodes of Gs have been pruned and
motion, then one may find a frame-to-frame 3D projection registered into a common shared 3D space to form an up-
matrix using point features, and then use this projection ma- dated graph Gs0 , while the spatial locations of the nodes in
trix to spatially map all the graph nodes (including the dy- the dynamic graph Gd have been updated via transforma-
namic nodes) into a common coordinate frame. tion matrices produced from Gs0 into the same coordinate
While this setup is rather straightforward, we note that the frame. An important step missing in our framework is that
objects in the static nodes are only defined by their bound- the dynamic sub-graph that we have constructed so far is still
ing boxes, which are usually imprecise. Thus, to merge two essentially a series of graph nodes produced from FRCNN,
static nodes, we first consider whether the nodes are from which is a static object detection model – that is, the nodes
frames that are sufficiently close in time, with the same are devoid of any action features that are perhaps essential
object labels, and with the intersection over union (IoU) in capturing how a dynamic node acts within itself and on
of their bounding boxes above a threshold γ. Two nodes its environment (defined by the static objects). To this end,
vt , vt0 ∈ Gs , from frames with timestamps t 6= t0 (where we propose to incorporate motion features into the nodes
|t − t0 | < δ) are candidates for merging if the following of the dynamic graph. Specifically, we use the I3D action
criterion C is met: recognition neural network (Carreira and Zisserman 2017),
pre-trained on the Kinetics-400 dataset to produce convolu-
C(vt , vt0 ) := cvt = cvt0 ∧IoU(bboxvt , bboxvt0 ) > γ. (1) tional features from short temporal video clips. These fea-
If a static node vt has multiple candidate nodes in the pre- tures are then ROI-pooled using the (original) bounding
vious δ frames that satisfy criterion (1), the candidate with boxes associated with the dynamic graph nodes. Suppose
the nearest 3D centroid is selected as the matching node that fvat = ROIPool(I3D(st ), bboxvt ), where st denotes a short
will be merged: video clip around the t-th video frame of a video S, then we
match(vt ) = arg min pvt − pvt0 , augment the FRCNN feature vector by concatenating the ob-
s
vt0 ∈ Vt−δ s
∪ · · · ∪ Vt−1 ject and action features as fvoa ← fvo || fva , for all v ∈ Vd ,
where || denotes feature concatenation.
such that C(vt , vt0 ) = 1
(2)
where Vts = {vt ∈ Vt | vt ∈ Gs } denotes the set of all Hierarchical Graph Embedding
static nodes from frame t. Since (2) chooses the best match
from the past δ frames, rather than just from frame t − 1, it Using our (2.5+1)D scene graph thus constructed, we are
can tolerate more noise in the estimates of the depth and the now ready to present our video QA reasoning setup. As the
bounding boxes associated with the graph nodes. questions in a QA task may need reasoning at various levels
We can apply this matching process recursively in order of abstraction (e.g., Q: what is the color of a person’s shirt?
to determine larger equivalence classes of matched nodes to A: red, Q: why did the boy cry?: A: Because the ball hit him,
be merged, where an equivalence class is defined as the set etc.), we decided to design our reasoning pipeline so that it
of all nodes that share a single common ancestor. We ac- can capture such a hierarchy. To set the stage, let us review a
complish this by looping over the frames t in temporal or- few basics on Transformers in our context. In the sequel, we
der, where for each node vt for which match(v t ) exists, we assume the set of nodes in G3.5D is given by: V 0 = Vs0 ∪ Vd .
assign ancestor(vt ) = ancestor match(vt ) . This proce- 0

dure is detailed in Algorithm 1. Finally, for each ancestor, all Transformers: Suppose F ∈ Rr×|V | denotes a matrix of
nodes that share that ancestor are merged into a single node. features computed from the static and dynamic graph nodes
The feature fvo associated with the new node v is obtained of a video S via projecting their original features into latent
by averaging the features from all of the nodes that were spaces of dimensionality r using multi-layer perceptrons,
merged into it. We use the 3D coordinate p of the parent MLPs and MLPd ; i.e., F = MLPs (fVos0 ) || MLPd (fVoad ).
0
node for all the child nodes that are merged into it. Let Gs0 If QiF , KiF , VFi ∈ Rrk ×|V | denote the i-th k-headed query,
denote the new reduced version of Gs after each equivalence key, and value embeddings of F respectively, where rk =
class of matched nodes has been merged into one node. r/k, then a multi-headed self-attention Transformer encoder
produces features F 0 given by:
>
!
k QiF KiF
F 0 := || softmax √ VFi . (3)
i=1 rk 𝑀𝐿𝑃
Add & Norm
(2.5+1)D-Transformer: We note that in a standard Trans-
former described in (3), each output feature in F 0 is a mix- Feed Forward
Nx
𝑖 = 1,2, … , 𝜂
ture embedding of several input features, as decided by the Add & Norm
similarity computed within the softmax. Our key idea in
(2.5+1)D-Transformer is to use a similarity defined by the 𝑀𝐿𝑃$
spatio-temporal proximity of the graph nodes as character- Multi-head Multi-head
ized by our (2.5+1)D scene graph. For two nodes v1 , v2 ∈ Attention Kernel Attention
V 0 , let a similarity kernel κ be defined as: 𝜎#$ , 𝜎%$
! Q K V V 𝑝! 𝑝!"
2
kpv1 − pv2 k ktv1 − tv2 k1
κ(v1 , v2 |σS , σt ) = exp − − ,
σS2 σT
(4)
capturing the spatial-temporal proximity between v1 and v2
for scales σS and σT for spatial and temporal cues, respec-
tively. Then, our (2.5+1)D-Transformer is given by:
Node Features spatio-temporal positions
k (2.5+1)D scene graph
0 0 0
F3.5D := || softmax K(V , V |σS , σT )VFi , (5)
i=1 Figure 2: The architecture of the proposed Hierarchical
(2.5+1)D-Transformer for encoding (2.5+1)D scene graphs.
where we use K to denote the spatio-temporal kernel matrix
The left module in red (N×) is the standard Transformer.
constructed on V 0 using (4) between every pair of nodes.
Such a similarity kernel merges features from nodes in the
graph that are spatio-temporally nearby – such as for exam- question. This step precedes attending the encoded ques-
H
ple, a person interacting with an object, or the dynamics of tions on F3.5D via a multi-headed cross-attention Trans-
objects in Gd . Further, the kernel is computed on a union of former, followed by average pooling to produce question-
the graph nodes in Gs0 and Gd , and thus directly captures the Q
conditioned features F3.5D . In this case, the source to the
interactions between the static and dynamic graphs. H
cross-Transformer is the set of F3.5D features, while the tar-
Hierarchical (2.5+1)D-Transformer: Note that our get sequence corresponds to the self-attended question em-
(2.5+1)D-Transformer in (5) captures the spatio-temporal beddings.
features at a single granularity as defined by σS and σT .
However, we can improve this representation towards a Training Losses: To predict an answer Apred for a given
hierarchical abstraction of the scene graph at multiple gran- video S and a question Q, we use the question-conditioned
(2.5+1)D-Transformer features produced in the previous
ularities. Let σSj , σTj , j = 1, . . . , η be a set of scale, and let
step and compute its similarities with the set of candidate
MLPj , j = 1, . . . , η be a series of multilayer perceptrons,
answers. Specifically, the predicted answer is defined as
then combining (3) and (5), we define our hierarchical Q >
(2.5+1)D-Transformer producing features F3.5D H
as: Apred = softmax(F3.5D λ(A)), where λ(A) represents
embeddings of candidate answers. For training the model,
we use cross-entropy loss between Apred and the ground
η
X k truth answer Agt . Empirically, we find that rather than com-
H
F3.5D = MLP || softmax K(V 0, V 0 |σSj ,σTj )VFi . puting the cross-entropy loss against one of ` answers, if we
j i=1
j=1 compute the loss against b×` answers produced via concate-
(6) nating all answers in a batch, that produced better gradients
and training. Such a concatenation is usually possible as the
In words, (6) computes spatial-temporal kernels at vari- text answers for various questions are often different.
ous bandwidths and merges the respective scene graph node
features and embeds them into a hierarchical representation
space via the MLPs. In practice, we find that it is useful to
combine the kernel similarity in (6) with the feature similar- Experiments
ity in (3) and add (6) with (3) (after an MLP) to produce the
final graph features. Figure 2 shows the architecture of the In this section, we provide experiments demonstrating the
proposed Transformer. empirical benefits of our proposed representation and infer-
Question Conditioning: For the video QA task, we first ence pipeline. We first review the datasets used in our ex-
use a standard Transformer architecture in (3) to produce periments, following which we describe in detail our setup,
multi-headed self-attention on the embeddings of the given before presenting our numerical and qualitative results.
Datasets Text Features: For the NExT-QA dataset, we use the pro-
We used two recent video QA datasets for evaluating our vided BERT features for every question embedding. These
task, namely NExT-QA (Xiao et al. 2021) and AVSD- are 768D features, which we project into 256D latent space
QA (Alamri et al. 2019a). to be combined with our visual features. Each candidate an-
NExT-QA Dataset is a very recent video question answer- swer is concatenated with the question, and BERT features
ing dataset that goes beyond traditional VQA tasks, and in- are computed before matching them with the visual features
corporates a significant number of why and how questions, for selecting the answer. For NExT-QA, we also augment the
that often demand higher level abstractions and semantic BERT features with the recent CLIP features (Radford et al.
reasoning about the videos. The dataset consists of 3,870 2021) that are known to have better vision-language align-
training, 570 validation, and 1,000 test videos. The dataset ment. For AVSD-QA, we used the provided implementation
provides 34,132, 4,996, and 8,564 multiple choice questions to encode the question and the answers using an LSTM into
in the training, validation, and test sets respectively, and the a 128D feature space. We used the same LSTM to encode
task is to select one of the five candidate answers. As the the dialog history and the caption features; these features are
test video labels are with-held for online evaluation, we re- then combined with the visual features using multi-headed
port performances on the validation set in our experiments. shuffled Transformers as suggested in (Geng et al. 2021).
We use the code provided by the authors of (Xiao et al. 2021) Evaluation Protocol: We used the classification accuracy
for our experiments, which we modified to incorporate our on NExT-QA, while we use mean retrieval rank on the
Transformer pipeline. AVSD-QA dataset; the latter measure ranks the correct an-
AVSD-QA Dataset is a variant of the Audio-Visual Scene swer among the selections made by an algorithm and reports
Aware Dialog dataset (Alamri et al. 2019b), repurposed for the mean rank over the test set. Thus, a lower mean rank sug-
the QA task. The dataset consists of not only QA pairs, but gests better performance.
also provides a human generated conversation history, and Training Details: We use an Adam optimizer for training
captions for each video. In the QA version of this dataset, both the models. For NExT-QA, we used a learning rate of
the task is to use the last question from dialog history about 5e-5 as suggested in the paper with a batch size of 64 and
the video to select an answer from one of a hundred candi- trained for 50 epochs, while AVSD-QA used a learning rate
date answers. The dataset consists of 11,816 video clips and of 1e-3 and a batch size of 100, and trained for 20 epochs.
118,160 QA pairs, of which we follow the standard splits Hyperparameters: There are two key hyperparameters in
to use 7,985, 1,863, and 1,968 clips for training, validation, our model, namely (i) the number of spatial abstraction
and test. We report the performances on the test set. For this levels in the hierarchical Transformer, and (ii) the band-
dataset, we used an implementation that is shared by the au- widths for the spatio-temporal kernels. We found that for
thors of (Geng et al. 2021) and incorporated our modules. the NExT-QA dataset, a four layer hierarchy with σS ∈
{0.01, 0.1, 1, 10} showed the best results, while for AVSD-
Experimental Setup and Training QA, we used σS ∈ {1, 10}. As for the temporal scale, we di-
vided the frame index tv by the maximum number of video
Visual Features: As mentioned earlier, we use public imple-
frames in the dataset (making the temporal span of the video
mentations for constructing our scene graphs and (2.5+1)D
to be in the unit interval), and used σT = σS . We found that
graphs, these steps being done offline. Specifically, the video
using a larger number of hierarchical levels did not change
frames are sub-sampled at fixed 0.5 fps for constructing the
the performance for NExT-QA, while it showed slightly in-
scene graphs using Faster RCNN, and each frame is pro-
ferior performance on AVSD-QA. For the Transformer, we
cessed by a MiDAS pre-trained model2 for computing the
used a 4-headed attention for NExT-QA, and a 2-headed at-
RGBD images. The FRCNN and depth images are then
tention for AVSD-QA.
combined in a pre-processing stage for pruning the scene
graph nodes as described earlier. Out of 1600 object classes
Results
in the Visual Genome dataset, we classified 1128 of the
classes as dynamic and used those for constructing the dy- In this section, we provide numerical results of our approach
namic scene graph. Next, we used the I3D action recogni- against state of the art, as well as analyze the contribution of
tion model (Carreira and Zisserman 2017) to extract motion each component in our setup.
features from the dynamic graph nodes. For this model, we State-of-the-art Comparisons: In Tables 1 and 2, we com-
used the videos at their original frame rate, but averaged the pare the performance of our full (2.5+1)D-Transformer
spatio-temporal volumes via conditioning on the pruned FR- pipeline against recent state-of-the-art methods. Notably,
CNN bounding boxes for every dynamic object anchored at on NExT-QA we compare with methods that use spatio-
the frame corresponding to the frame rate used in the ob- temporal models for VQA such as spatio-temporal reason-
ject detection model. This setup produced 2048D features ing (Jang et al. 2019), graph alignment (Jiang and Han
for the static graph nodes and (2048+1024)D features for 2020), hierarchical relation models (Le et al. 2020), against
the dynamic graph nodes. These features are then separately which our proposed model shows a significant ∼4% im-
projected into a latent space of 256 for NExT-QA and 128 provement, clearly showing benefits. On AVSD-QA, as pro-
dimensions for AVSD-QA datasets on which the Transform- vided in Table 2, we compare against the state of the art
ers operate. STSGR model (Geng et al. 2021), as well as older multi-
modal Transformers (Le et al. 2019), outperforming them in
2
https://pytorch.org/hub/intelisl midas v2/ the mean rank of the retrieved answer. We found that when
Table 1: NExT-QA: Comparisons to the state of the art. Table 3: Ablation study on NExT-QA and AVSD-QA. Be-
Results for the various competitive methods are taken low, Txr is the standard Transformer and I3D+FRCNN is
from (Xiao et al. 2021). the averaged I3D and FRCNN features per frame (no graph),
Method Accuracy (%)↑ V(2+1)D Txr: without depth.
NExT-QA AVSD-QA
Spatio-Temporal VQA (Jang et al. 2019) 47.94
Co-Memory-QA (Gao et al. 2018) 48.04 Method Acc (%)↑ mean rank↓
Hier. Relation n/w (Le et al. 2020) 48.20 No dynamic graph 52.49 5.97
Multi-modal Attn VQA (Fan et al. 2019) 48.72 No static graph 53.00 6.03
graph-alignment VQA (Jiang and Han 2020) 49.74 No I3D 52.65 6.09
No hier. kernel 52.90 5.97
(2.5+1)D-Transformer (ours) 53.40 No ans. augment (AA) 49.98 5.92
No question condition (QC) 50.39 5.96
Table 2: AVSD-QA: Comparisons to the state of the art. The Full Model 53.40 5.84
prior results are taken from (Geng et al. 2021).
Method Mean Rank ↓ Table 4: Ablation study on NExT-QA by removing modules
in our pipeline.
Question Only (Alamri et al. 2019a) 7.63
Multimodal Transformers (Hori et al. 2019) 7.23 # Ablation Accuracy (%)↑
Question + Video (Alamri et al. 2019a) 6.86 1 Txr + I3D + FRCNN + QC 47.90
MTN (Le et al. 2019) 6.85 2 (1) + AA 49.80
ST Scene Graphs (Geng et al. 2021) 5.91 3 Txr + V(2+1)D Txr + AA + QC 52.40
4 Txr + V(2.5+1)D Txr + AA + QC 53.40
(2.5+1)D-Transformer (ours) 5.84 5 (4) using all nodes (no pruning) 53.50

Table 5: Ablation study on NExT-QA using different spatio-

our AVSD-QA model is combined with other text cues (such temporal hierarchies defined by the kernel bandwidth σ.
as dialog history and captions), the mean rank improves to Hier. levels bandwidths σ Accuracy
nearly 1.4, suggesting a significant bias between the ques- 1-level 0.01 52.13
2-levels {0.01, 0.1} 52.58
tions and the text-cues. Thus, we restrict our analysis only 3-levels {0.01, 0.1, 1.0} 52.97
to using the visual features. 4-levels {0.01, 0.1, 1.0, 10.0} 53.20
5-levels {0.01, 0.1, 1.0, 10, 20.0} 53.00
Ablation Studies
In Table 3, we provide an ablation study on the importance Table 6: Computational benefits of the proposed approach.
of each component in our setup on both the datasets. Our The numbers indicate the average number of graph nodes
results show that without the static or the dynamic sub- per video sequence in each dataset.
graphs, the performance drops. Without I3D features, the AVSD-QA NExT-QA
performance drops significantly for both the datasets, un-
Full graph 502.43 656.30
derlining the importance of motion features in the graph
Static graph 97.26 68.68
pipeline. We find that without the hierarchical Transformer,
Dynamic graph 136.10 430.83
the performance drops from 53.4→52.9 on NExT-QA, and
5.84→5.97 on AVSD-QA. Further, the trick of using aug- % node reduction 53.6 23.9
mented answers in the learning process as described in the
section on Training Losses seems to help improve the train-
ing of the models. We also evaluate the importance of ques-
tion conditioning, which appears to contribute to the final Computational Benefits: In Table 4 last row, we further
performance. As our proposed pipeline is sequential in na- show ablations on NExT-QA dataset, when the full set of
ture, we may also study the performance via removing indi- graph nodes are used for inference. As expected in this case,
vidual modules from the pipeline. In Table 4 rows 1-4, we the performance improves mildly, however our experiments
show the results of this experiment. Our results show that our show that the time taken for every training iteration in this
proposed Transformer module leads to nearly 3% improve- case slows down 4-fold (from ∼1.5 s per iteration to ∼6 s on
ment, and using the pseudo-depth improves by a further 1%. a single RTX6000 GPU). In Table 6, we compare the number
In Table 5, we ablate on the performance of the hierarchy of nodes in the static and dynamic graphs, and compare it to
in the (2.5+1)D-Transformer. Specifically, we show results the total number of nodes in the unpruned graph for both the
for various bandwidths and their combinations on the NExT- datasets. As the results show, our method prunes nearly 54%
QA dataset. The results suggest that including more band- of graph nodes on AVSD-QA dataset and 24% on NExT-
widths (hierarchies) leads to better modeling of the video QA. We believe the higher pruning ratio for AVSD-QA is
scene and better performance. We use only the (2.5+1)D- perhaps due to the fact that most of its videos do not contain
Transformer for this experiment without using the combi- shot-switches and use a stationary camera, which is not the
nation with the standard Transformer to clearly separate out case with NExT-QA.
the benefits. Question-Category Level Performance: In Table 7, we
Table 7: Comparison of our answer selection on various categories in NExT-QA dataset against other recent methods. The
numbers for competing methods are taken from (Xiao et al. 2021).
Method Why (W) How (H) Avg. (W+H) Prev&Next (P&N) Present (P) Avg. (P&N+P) Count (C) Location (L) Other (O) Avg. (C+L+O) Overall
STVQA, IJCV’19 45.37 43.05 44.76 47.52 51.73 49.26 43.50 65.42 53.77 55.86 47.94
CoMem, CVPR’18 46.15 42.61 45.22 48.16 50.38 49.07 41.81 67.12 51.80 55.34 48.04
HCRN, CVPR’20 46.99 42.90 45.91 48.16 50.83 49.26 40.68 65.42 49.84 53.67 48.20
HME, CVPR’19 46.52 45.24 46.18 47.52 49.17 48.20 45.20 73.56 51.15 58.30 48.72
HGA, AAAI’20 46.99 44.22 46.26 49.53 52.49 50.74 44.07 72.54 55.41 59.33 49.74
Ours 52.39 48.36 51.33 50.91 54.28 52.30 46.02 77.08 58.31 62.58 53.4
% improvement +5.4 +3.12 +5.07 +1.38 +1.79 +1.56 +0.82 +3.52 +2.91 +3.25 +3.66

First frame 3D rendered frame Static scene graph Dynamic scene graph

Figure 3: An example illustration of (2.5+1)D scene graphs produced by our method. The figure shows a video frame from the
NExT-QA dataset, its pseduo-3D rendering, and the (2.5+1)D static and dynamic graphs computed on all frames of the video.
What does the man in grey do after sitting down int the middle?
A1: talking on phone
A2: take the pipe GT: smell burger
A3: smiling Ours: smell burger
A4: smell burger HGA: take the pipe
A5: cross his legs
Where is the baby while him was fed milk?
A1: mobile
A2: in lady’s arm GT: in lady’s arm
A3: pillow Ours: in lady’s arm
A4: baby trolley HGA: living room
A5: living room

why did he get up ?

GT answer: the man got up to start cleaning the plate
Our answer: he stands up so he can go over to the stove
Our rank = 4 STSGR rank = 20

Does she pick anything up from off the couch ?

GT answer: yes , she folds clothes that are on the couch
Our answer: yes , she folds clothes that are on the couch
Our rank = 1 STSGR rank = 11

Figure 4: Qualitative responses: First two rows show the results on the NExT-QA dataset and the last two on the AVSD-QA
dataset. We compare our results to those produced by HGA (Fan et al. 2019) on NExT-QA, and STSGR (Geng et al. 2021) on
AVSD-QA datasets.

compare the performance of our (2.5+1)D approach on var- Figures 5, 6, 7, and 8 for more results.
ious question categories in the NExT-QA dataset. Specifi-
cally, the dataset categorizes its questions into 7 reasoning Conclusions
classes: (i) why, (ii) how, (iii) previous&next, (iv) present In this paper, we presented a novel (2.5+1)D representa-
(v) counting, (vi) spatial location related, and (vii) all other tion for the task of video question answering. We use 2.5D
questions. From Table 7, we see that our proposed repre- pseudo-depth of scene objects to be disentangled in 3D
sentation fares well in all the categories against the state of space, allowing the pruning of redundant detections. Using
the art. More interestingly, our method works significantly the 3D setup, we further disentangled the scene into a set
outperforms to the next best scheme HGA (Jiang and Han of dynamic objects that interact within themselves or with
2020), by more than 5% on why-related questions and 3.5% the environment (defined by static nodes); such interactions
on location-related questions, perhaps due to better spatio- are characterized in a latent space via spatio-temporal hierar-
temporal localization of the objects in the scenes as well as chical transformers, that produce varied abstractions of the
the spatio-temporal reasoning. scene at different scales. Such abstractions are then com-
Qualitative Results: Figure 3 gives an example static- bined with text queries in the video QA task to select an-
dynamic scene graph pair on a scene from NExT-QA. In swers. Our experiments demonstrate state-of-the-art results
Figure 4, we present qualitative QA results and compare on two recent and challenging real-world datasets.
against the responses produced by two recent methods. See
1021_5061117640 Rendered first frame
RGB first frame Depth first frame

Static Graph Dynamic Graph

1006_11871253306 Rendered first frame

RGB first frame Depth first frame

Dynamic Graph
Static Graph

Figure 5: RGB images, depth images produced using (Ranftl et al. 2019), the depth rendered RGB images, the static scene
graphs, and the dynamic scene graphs for the respective videos. See below for the questions and answers for these videos.
1015_13884293626 Rendered first frame
RGB first frame Depth first frame

Dynamic Graph
Static Graph

1023_10109097475 Rendered first frame

RGB first frame Depth first frame

Static Graph Dynamic Graph

Figure 6: RGB images, depth images produced using (Ranftl et al. 2019), the depth rendered RGB images, the static scene
graphs, and the dynamic scene graphs for the respective videos. See below for the questions and answers for these videos.
1. What does the man in grey do after sitting down int the middle?
4991_5061117640

A1: talking on phone

A2: take the pipe
A3: smiling GT: smell burger
A4: smell burger Ours: smell burger
A5: cross his legs HGA: take the pipe

2. How did the men make the game more challenging for the children?
4970_ 11871253306

A1: cry
A2: punch air
A3: hold the boy GT: kicked the ball away
A4: touching the controls Ours: kicked the ball away
A5: kicked the ball away HGA: touching the controls

3. What is the possible relationship between the lady in purple

4949_ 13884293626

and man in brown?

A1: husbandwifefriends
A2: nurse GT: teacher student
A3: family member Ours: teacher student
A4: teacher student HGA: family member
A5: husband wife
4. How many planes are involved in the video?
1165_ 10109097475

A1: six
A2: four GT: one
A3: three
Ours: one
A4: one
HGA: three
A5: eight

5. why is the man unable to ride the vehicle?

1111_2976913210

A1: stuck GT: stuck

A2: fell off from the vehicle Ours: stuck
A3: a barrier in front of him HGA: stuck
A4: met with dead end
A5: plastic trapped among the wheels

6. Where is the baby while him was fed milk?

A1: mobile
1059_ 632907781

A2: in lady’s arm

A3: pillow GT: in lady’s arm
A4: baby trolley Ours: in lady’s arm
A5: living room HGA: living room

7. Where is this video taken?

1046_ 7088595057

A1: playground
A2: stage
GT: paddock
A3: paddock Ours: paddock
A4: forest HGA: construction site
A5: construction site

8. How many elephants are there?

0040_ 2835125654

A1: one
A2: five GT: two
A3: two Ours: three
A4: four HGA: two
A5: three

9. Why did the book drop?

1025_ 6163877860

A1: open the cup

GT: baby kicked it
A2: baby kicked it
A3: the girl in pink slipped
Ours: baby kicked it
A4: safety
HGA: the girl in pink slipped
A5: lady pushed it too hard

10. What does the cat do when it first sat up?

A1: pooping
1111_ 8990144134

A2: eat GT: hit the bird

A3: examine the bag Ours: cross the ledge
A4: hit the bird HGA: eat
A5: cross the ledge

Figure 7: Qualitative responses from the NExT-QA dataset for various types of questions.
#1: PVP3C
Q: does it look more like a home or public place ?
GT answer: it looks like a public place , like dorm , apartment
Our answer: it looks like a public place , like dorm , apartment
Our rank = 1 STSGR rank = 4

#2: CWERM
Q: why did he get up ?
GT answer: the man got up to start cleaning the plate
Our answer: he stands up so he can go over to the stove
Our rank = 4 STSGR rank = 20

#3: J7DQX
Q: what is the man carrying ?
GT answer: he has a bottle of water , a cloth and a small
Our answer: he has a bottle of water , a cloth and a small
Our rank = 1 STSGR rank = 22

#4: 008UL
Q: does she pick anything up from off the couch ?
GT answer: yes , she folds clothes that are on the couch
Our answer: yes , she folds clothes that are on the couch
Our rank = 1 STSGR rank = 11

#5: 3ZPGZ
Q: why is he standing at the fridge ?
GT answer: to take pictures of the fridge from various
Our answer: to take pictures of the fridge from various
Our rank = 1 STSGR rank = 13

#6: PHCZQ
Q: where does he grab it from ?
GT answer: the same place he drops the other one off camera
Our best answer: he has it from the beginning of the video
Our correct answer rank = 5 STSGR rank = 9

#7: 4YNPN
Q: can you tell why he stands there ?
GT answer: i cannot tell why he is standing still after closing the door
Our best answer: no , it looks like he spends most of his time trying
to open the one package
Our correct answer rank = 3 STSGR rank = 8
#8: TH0BF
Q: is standing by the shelves in the beginning , or does he walk over
to the shelves ?
GT answer: he is already standing there
Our best answer: he is carrying plates and a mirror and seems to
leave the room
Our correct answer rank = 3 STSGR rank = 8
#9: SHGOT
Q: does she enter from somewhere else or is she there the entire time ?
GT answer: the only person is a young woman
Our answer: she seems to already be there , but doesn 't walk in front of
the camera for a few seconds
Our rank = 8 STSGR rank = 18

#10: HLITK
Q: who is in the room ?
GT answer: there is only one
Our answer: a man , the person holding the broom is a man
Our rank = 25 STSGR rank = 2

Figure 8: Qualitative responses from the AVSD-QA dataset for various types of questions.
References Dornadula, A.; Narcomey, A.; Krishna, R.; Bernstein, M.;
Alamri, H.; Cartillier, V.; Das, A.; Wang, J.; Cherian, A.; and Li, F.-F. 2019. Visual relationships as functions: En-
Essa, I.; Batra, D.; Marks, T. K.; Hori, C.; Anderson, P.; et al. abling few-shot scene graph prediction. In Proceedings of
2019a. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vi-
the IEEE/CVF Conference on Computer Vision and Pattern sion Workshops.
Recognition, 7558–7567. Fan, C.; Zhang, X.; Zhang, S.; Wang, W.; Zhang, C.; and
Huang, H. 2019. Heterogeneous memory enhanced mul-
Alamri, H.; Hori, C.; Marks, T. K.; Batra, D.; and Parikh, D.
timodal attention model for video question answering. In
2019b. Audio Visual Scene-aware dialog (AVSD) Track for
Proceedings of the IEEE/CVF conference on computer vi-
Natural Language Generation in DSTC7. In AAAI workshop
sion and pattern recognition, 1999–2007.
on the 7th edition of Dialog System Technology Challenge
(DSTC7). Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; and Tao, D.
2018. Deep ordinal regression network for monocular depth
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;
estimation. In Proceedings of the IEEE/CVF Conference on
Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at-
Computer Vision and Pattern Recognition, 2002–2011.
tention for image captioning and visual question answering.
In Proceedings of the IEEE/CVF Conference on Computer Gao, J.; Ge, R.; Chen, K.; and Nevatia, R. 2018. Motion-
Vision and Pattern Recognition, 6077–6086. appearance co-memory networks for video question answer-
ing. In Proceedings of the IEEE/CVF Conference on Com-
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zit- puter Vision and Pattern Recognition, 6576–6585.
nick, C. L.; and Parikh, D. 2015. Vqa: Visual question an-
swering. In Proceedings of the IEEE international confer- Geng, S.; Gao, P.; Chatterjee, M.; Hori, C.; Le Roux, J.;
ence on computer vision, 2425–2433. Zhang, Y.; Li, H.; and Cherian, A. 2021. Dynamic graph rep-
resentation learning for video dialog via multi-modal shuf-
Armeni, I.; He, Z.-Y.; Gwak, J.; Zamir, A. R.; Fischer, M.; fled transformers. In Proceedings of the AAAI Conference
Malik, J.; and Savarese, S. 2019. 3D Scene Graph: A on Artificial Intelligence.
structure for unified semantics, 3d space, and camera. In
Proceedings of the IEEE/CVF International Conference on Ghosh, S.; Burachas, G.; Ray, A.; and Ziskind, A. 2019.
Computer Vision, 5664–5673. Generating natural language explanations for visual ques-
tion answering using scene graphs and visual attention.
Bar, A.; Herzig, R.; Wang, X.; Chechik, G.; Darrell, T.; and arXiv:1902.05715.
Globerson, A. 2020. Compositional video synthesis with
action graphs. arXiv preprint arXiv:2006.15327. Girdhar, R.; Carreira, J.; Doersch, C.; and Zisserman, A.
2019. Video action transformer network. In Proceedings of
Bello, I. 2021. Lambda Networks: Modeling long- the IEEE/CVF Conference on Computer Vision and Pattern
range interactions without attention. arXiv preprint Recognition, 244–253.
arXiv:2102.08602.
Girdhar, R.; and Ramanan, D. 2019. CATER: A diagnostic
Carreira, J.; and Zisserman, A. 2017. Quo vadis, action dataset for Compositional Actions and TEmporal Reason-
recognition? a new model and the kinetics dataset. In Pro- ing. arXiv preprint arXiv:1910.04744.
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 6299–6308. Hartley, R. I.; and Zisserman, A. 2004. Multiple View Ge-
ometry in Computer Vision. Cambridge University Press,
Chatterjee, M.; Le Roux, J.; Ahuja, N.; and Cherian, A. ISBN: 0521540518, second edition.
2021. Visual Scene Graphs for Audio Source Separation.
Herzig, R.; Levi, E.; Xu, H.; Gao, H.; Brosh, E.; Wang, X.;
In Proceedings of the IEEE/CVF International Conference
Globerson, A.; and Darrell, T. 2019. Spatio-temporal action
on Computer Vision, 1204–1213.
graph networks. In Proceedings of the IEEE/CVF Interna-
Chen, L.; Yan, X.; Xiao, J.; Zhang, H.; Pu, S.; and Zhuang, tional Conference on Computer Vision Workshops.
Y. 2020. Counterfactual samples synthesizing for robust vi-
Hori, C.; Alamri, H.; Wang, J.; Wichern, G.; Hori, T.;
sual question answering. In Proceedings of the IEEE/CVF
Cherian, A.; Marks, T. K.; Cartillier, V.; Lopes, R. G.; Das,
Conference on Computer Vision and Pattern Recognition,
A.; et al. 2019. End-to-end audio visual scene-aware dialog
10800–10809.
using multimodal attention-based video features. In ICASSP
Choromanski, K.; Lin, H.; Chen, H.; and Parker-Holder, J. 2019-2019 IEEE International Conference on Acoustics,
2021. Graph Kernel Attention Transformers. arXiv preprint Speech and Signal Processing, 2352–2356.
arXiv:2107.07999. Jang, Y.; Song, Y.; Kim, C. D.; Yu, Y.; Kim, Y.; and Kim, G.
Cong, Y.; Liao, W.; Ackermann, H.; Rosenhahn, B.; and 2019. Video question answering with spatio-temporal rea-
Yang, M. Y. 2021. Spatial-temporal transformer for dynamic soning. International Journal of Computer Vision, 127(10):
scene graph generation. In Proceedings of the IEEE/CVF In- 1385–1412.
ternational Conference on Computer Vision, 16372–16382. Jang, Y.; Song, Y.; Yu, Y.; Kim, Y.; and Kim, G. 2017. Tgif-
Dang, L. H.; Le, T. M.; Le, V.; and Tran, T. 2021. Hierarchi- qa: Toward spatio-temporal reasoning in visual question an-
cal Object-oriented Spatio-Temporal Reasoning for Video swering. In Proceedings of the IEEE/CVF Conference on
Question Answering. arXiv preprint arXiv:2106.13432. Computer Vision and Pattern Recognition, 2758–2766.
Ji, J.; Krishna, R.; Fei-Fei, L.; and Niebles, J. C. 2020. Ac- Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-
tion Genome: Actions as Composition of Spatio-temporal CNN: Towards real-time object detection with region pro-
Scene Graphs. In Proceedings of the IEEE/CVF Conference posal networks. Advances in Neural Information Processing
on Computer Vision and Pattern Recognition. Systems, 28: 91–99.
Jiang, P.; and Han, Y. 2020. Reasoning with heterogeneous Shamsian, A.; Kleinfeld, O.; Globerson, A.; and Chechik, G.
graph alignment for video question answering. In Proceed- 2020. Learning object permanence from video. In European
ings of the AAAI Conference on Artificial Intelligence, vol- Conference on Computer Vision, 35–50. Springer.
ume 34, 11109–11116. Teney, D.; Anderson, P.; He, X.; and Van Den Hengel, A.
Jiang, Z.; Gao, P.; Guo, C.; Zhang, Q.; Xiang, S.; and Pan, 2018. Tips and tricks for visual question answering: Learn-
C. 2019. Video object detection with locally-weighted de- ings from the 2017 challenge. In Proceedings of the IEEE
formable neighbors. In Proceedings of the AAAI Conference Conference on Computer Vision and Pattern Recognition,
on Artificial Intelligence, volume 33, 8529–8536. 4223–4232.
Tsai, Y.-H. H.; Bai, S.; Yamada, M.; Morency, L.-P.; and
Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.;
Salakhutdinov, R. 2019a. Transformer Dissection: A Unified
Bernstein, M.; and Fei-Fei, L. 2015. Image retrieval using
Understanding of Transformer’s Attention via the Lens of
scene graphs. In Proceedings of the IEEE Conference on
Kernel. arXiv preprint arXiv:1908.11775.
Computer Vision and Pattern Recognition, 3668–3678.
Tsai, Y.-H. H.; Divvala, S.; Morency, L.-P.; Salakhutdinov,
Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Car- R.; and Farhadi, A. 2019b. Video relationship reasoning us-
los Niebles, J. 2017. Dense-captioning events in videos. In ing gated spatio-temporal energy graph. In Proceedings of
Proceedings of the IEEE/CVF Conference on Computer Vi- the IEEE/CVF Conference on Computer Vision and Pattern
sion and Pattern Recognition, 706–715. Recognition, 10424–10433.
Le, H.; Sahoo, D.; Chen, N. F.; and Hoi, S. C. 2019. Multi- Tung, H.-Y. F.; Xian, Z.; Prabhudesai, M.; Lal, S.; and
modal transformer networks for end-to-end video-grounded Fragkiadaki, K. 2020. 3D-OES: Viewpoint-invariant
dialogue systems. arXiv preprint arXiv:1907.01166. object-factorized environment simulators. arXiv preprint
Le, T. M.; Le, V.; Venkatesh, S.; and Tran, T. 2020. Hi- arXiv:2011.06464.
erarchical conditional relation networks for video question Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
answering. In Proceedings of the IEEE/CVF Conference on L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Computer Vision and Pattern Recognition, 9972–9981. tention is all you need. In Advances in Neural Information
Li, H.; Gordon, A.; Zhao, H.; Casser, V.; and Angelova, A. Processing Systems, 5998–6008.
2020. Unsupervised monocular depth learning in dynamic Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-
scenes. arXiv preprint arXiv:2010.16404. local neural networks. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
Li, L.; Gan, Z.; Cheng, Y.; and Liu, J. 2019. Relation-aware 7794–7803.
Graph Attention Network for Visual Question Answering.
arXiv:1903.12314. Wang, X.; and Gupta, A. 2018. Videos as space-time re-
gion graphs. In Proceedings of the European Conference on
Li, X.; and Jiang, S. 2019. Know more say less: Image cap- Computer Vision, 399–417.
tioning based on scene graphs. IEEE Transactions on Mul- Wu, Q.; Teney, D.; Wang, P.; Shen, C.; Dick, A.; and van den
timedia, 21(8): 2117–2130. Hengel, A. 2017. Visual question answering: A survey of
Pan, B.; Cai, H.; Huang, D.-A.; Lee, K.-H.; Gaidon, A.; methods and datasets. Computer Vision and Image Under-
Adeli, E.; and Niebles, J. C. 2020. Spatio-temporal graph standing, 163: 21–40.
for video captioning with knowledge distillation. In Pro- Wu, S.-C.; Wald, J.; Tateno, K.; Navab, N.; and Tombari,
ceedings of the IEEE/CVF Conference on Computer Vision F. 2021. SceneGraphFusion: Incremental 3D Scene Graph
and Pattern Recognition, 10870–10879. Prediction from RGB-D Sequences. In Proceedings of
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; the IEEE/CVF Conference on Computer Vision and Pattern
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Recognition, 7515–7525.
et al. 2021. Learning transferable visual models from natural Xiao, J.; Shang, X.; Yao, A.; and Chua, T.-S. 2021. NExT-
language supervision. arXiv preprint arXiv:2103.00020. QA: Next Phase of Question-Answering to Explaining Tem-
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; and poral Actions. In Proceedings of the IEEE/CVF Conference
Koltun, V. 2019. Towards robust monocular depth esti- on Computer Vision and Pattern Recognition, 9777–9786.
mation: Mixing datasets for zero-shot cross-dataset transfer. Zhang, C.; Yu, J.; Song, Y.; and Cai, W. 2021. Exploiting
arXiv preprint arXiv:1907.01341. Edge-Oriented Reasoning for 3D Point-based Scene Graph
Analysis. In Proceedings of the IEEE/CVF Conference on
Rashid, M.; Kjellstrom, H.; and Lee, Y. J. 2020. Action
Computer Vision and Pattern Recognition, 9705–9715.
graphs: Weakly-supervised action localization with graph
convolution networks. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision,
615–624.