0% found this document useful (0 votes)
112 views

2019 - VideoBERT - A Joint Model For Video and Language Representation Learning

This document summarizes the VideoBERT model, which jointly models video and language representations using self-supervised learning. VideoBERT adapts the BERT model to learn bidirectional joint distributions over sequences of discrete visual and linguistic tokens derived from videos and automatic speech recognition. It can perform tasks like text-to-video generation, video captioning, and long-range video forecasting without explicit supervision. The model outperforms the state-of-the-art on video captioning by increasing the BLEU-4 score significantly on the YouCook II dataset.

Uploaded by

farismbaker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

2019 - VideoBERT - A Joint Model For Video and Language Representation Learning

This document summarizes the VideoBERT model, which jointly models video and language representations using self-supervised learning. VideoBERT adapts the BERT model to learn bidirectional joint distributions over sequences of discrete visual and linguistic tokens derived from videos and automatic speech recognition. It can perform tasks like text-to-video generation, video captioning, and long-range video forecasting without explicit supervision. The model outperforms the state-of-the-art on video captioning by increasing the BLEU-4 score significantly on the YouCook II dataset.

Uploaded by

farismbaker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

VideoBERT: A Joint Model for Video and Language Representation Learning

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid

Google Research
arXiv:1904.01766v1 [cs.CV] 3 Apr 2019

Season the steak with Carefully place the steak Flip the steak to the Now let it rest and enjoy
input text
salt and pepper. to the pan. other side. the delicious steak.

VideoBERT
output video

output
input
video
video
futures

VideoBERT

Figure 1: VideoBERT text-to-video generation and future forecasting. (Above) Given some recipe text divided into
sentences, y = y1:T , we generate a sequence of video tokens x = x1:T by computing x∗t = arg maxk p(xt = k|y) using
VideoBERT. (Below) Given a video token, we show the top three future tokens forecasted by VideoBERT at different time
scales. In this case, VideoBERT predicts that a bowl of flour and cocoa powder may be baked in an oven, and may become a
brownie or cupcake. We visualize video tokens using the images from the training set closest to centroids in feature space.

Abstract 1. Introduction
Self-supervised learning has become increasingly impor-
tant to leverage the abundance of unlabeled data avail- Deep learning can benefit a lot from labeled data [23],
able on platforms like YouTube. Whereas most existing but this is hard to acquire at scale. Consequently there has
approaches learn low-level representations, we propose a been a lot of recent interest in “self supervised learning”,
joint visual-linguistic model to learn high-level features where we train a model on various “proxy tasks”, which
without any explicit supervision. In particular, inspired we hope will result in the discovery of features or repre-
by its recent success in language modeling, we build upon sentations that can be used in downstream tasks (see e.g.,
the BERT model to learn bidirectional joint distributions [22]). A wide variety of such proxy tasks have been pro-
over sequences of visual and linguistic tokens, derived from posed in the image and video domains. However, most of
vector quantization of video data and off-the-shelf speech these methods focus on low level features (e.g., textures)
recognition outputs, respectively. We use this model in a and short temporal scales (e.g., motion patterns that last a
number of tasks, including action classification and video second or less). We are interested in discovering high-level
captioning. We show that it can be applied directly to open- semantic features which correspond to actions and events
vocabulary classification, and confirm that large amounts that unfold over longer time scales (e.g. minutes), since
of training data and cross-modal information are critical to such representations would be useful for various video un-
performance. Furthermore, we outperform the state-of-the- derstanding tasks.
art on video captioning, and quantitative results verify that In this paper, we exploit the key insight that human
the model learns high-level semantic features. language has evolved words to describe high-level objects
and events, and thus provides a natural source of “self”

1
Cut the cabbage into Put cabbage in the wok Add soy sauce and ... Put on a plate the dish is
input text
pieces. and stir fry. then keep stir frying. now ready to be served.

VideoBERT
output video

output
input
video
video
futures

VideoBERT

Figure 2: Additional text-to-video generation and future forecasting examples from VideoBERT, see Figure 1 for details.
(Above) VideoBERT predicts video tokens given text from steps of a stir-fry recipe. (Below) Given an input video token
with raw eggs and flour in a bowl, VideoBERT predicts that the next high level steps are likely to be whisking, or adding the
mixture in a pan for different dishes.

supervision. In particular, we present a simple way to In summary, our main contribution in this paper is a
model the relationship between the visual domain and the simple way to learn high level video representations that
linguistic domain by combining three off-the-shelf meth- capture semantically meaningful and temporally long-range
ods: an automatic speech recognition (ASR) system to con- structure. The remainder of this paper describes this con-
vert speech into text; vector quantization (VQ) applied to tribution in detail. In particular, Section 2 briefly reviews
low-level spatio-temporal visual features derived from pre- related work; Section 3 describes how we adapt the recent
trained video classfication models; and the recently pro- progress in natural language modeling to the video domain;
posed BERT model [6] for learning joint distributions over Section 4 presents results on activity recognition and video
sequences of discrete tokens. captioning tasks; and Section 5 concludes.
More precisely, our approach is to apply BERT to learn a
model of the form p(x, y), where x is a sequence of “visual 2. Related Work
words”, and y is a sequence of spoken words. Given such
a joint model, we can easily tackle a variety of interesting Supervised learning. Some of the most successful ap-
tasks. For example, we can perform text-to-video predic- proaches for video representation learning have leveraged
tion, which can be used to automatically illustrate a set of large labeled datasets (e.g., [9, 18, 36, 7]) to train convolu-
instructions (such as a recipe), as shown in the top examples tional neural networks for video classification. However, it
of Figure 1 and 2. We can also perform the more traditional is very expensive to collect such labeled data, and the cor-
video-to-text task of dense video captioning [10] as shown responding label vocabularies are often small and not ca-
in Figure 6. In Section 4.6, we show that our approach to pable of representing the nuances of many kinds of actions
video captioning outperforms the previous state-of-the-art (e.g., “sipping” is slightly different than “drinking” which
approach of [39] on the YouCook II dataset by a large mar- is slightly different than “gulping”). In addition, these ap-
gin, increasing the BLEU-4 score from 1.42 to 4.52. proaches are designed for representing short video clips,
We can also use our model in a “unimodal” fashion. For typically a few seconds long. The main difference to our
example, the implied marginal distribution p(x) is a lan- work is that we focus on the long-term evolution of events
guage model for visual words, which we can use for long- in video, and we do not use manually provided labels.
range forecasting. This is illustrated in the bottom examples Unsupervised learning. Recently, a variety of ap-
of Figure 1 and 2. Of course, there is uncertainty about the proaches for learning density models from video have been
future, but the model can generate plausible guesses at a proposed. Some use a single static stochastic variable,
much higher level of abstraction than other deep generative which is then “decoded” into a sequence using an RNN,
models for video, such as those based on VAEs or GANs either using a VAE-style loss [31, 35] or a GAN-style loss
(see e.g., [4, 5, 12, 26]), which tend to predict small changes [30, 16]. More recent work uses temporal stochastic vari-
to low level aspects of the scene, such as the location or pose ables, e.g., the SV2P model of [4] and the SVGLP model
of a small number of objects. of [5]. There are also various GAN-based approaches, such
as the SAVP approach of [12] and the MoCoGAN approach 3.1. The BERT model
of [26]. We differ from this work in that we use a Markov
The BERT model, introduced in [6], can be thought of as
Random Field model, without any explicit stochastic latent
a fully connected Markov Random Field (MRF) on a set of
variables, applied to visual tokens derived from the video.
discrete tokens, which is trained to approximately maximize
Thus our model is not a generative model of pixels, but it is
the pseudo log-likelihood, as explained in [32]. In more
a generative model of features derived from pixels, which is
detail, let x = {x1 , . . . , xL } be a set of discrete tokens,
an approach that has been used in other work (e.g., [29]).
xl ∈ X . We can define a joint probability distribution over
Self-supervised learning. To avoid the difficulties of
this set as follows:
learning a joint model p(x1:T ), it has become popular to
learn conditional models of the form p(xt+1:T |x1:t ), where L L
!
1 Y X
we partition the signal into two or more blocks, such as gray p(x|θ) = φl (x|θ) ∝ exp log φl (x|θ)
Z(θ)
scale and color, or previous frame and next frame (e.g., l=1 l=1
[17]), and try to predict one from the other (see e.g., [22]
where φl (x) is the l’th potential function, with parameters
for an overview). Our approach is similar, except we use
θ, and Z is the partition function.
quantized visual words instead of pixels. Furthermore, al-
The above model is permutation invariant. In order to
though we learn a set conditional distributions, our model is
capture order information, we can “tag” each word with its
a proper joint generative model, as explained in Section 3.
position in the sentence. The BERT model learns an embed-
Cross-modal learning. The multi-modal nature of video
ding for each of the word tokens, as well as for these tags,
has also been an extensive source of supervision for learn-
and then sums the embedding vectors to get a continuous
ing video representations, which our paper builds on. Since
representation for each token. The log potential (energy)
most videos contain synchronized audio and visual sig-
functions for each location are defined by
nals, the two modalities can supervise each other to learn
strong self-supervised video representations, as shown in log φl (x|θ) = xTl fθ (x\l )
[3, 19, 20]. In this work, we use speech (provided by ASR)
rather than low-level sounds as a source of cross-modal su- where xl is a one-hot vector for the l’th token (and its tag),
pervision. and
Natural language models. We build upon recent
progress in the NLP community, where large-scale lan- x\l = (x1 , . . . , xl−1 , MASK, xl+1 , . . . , xL )
guage models such as ELMO [21] and BERT [6] have
The function f (x\l ) is a multi-layer bidirectional trans-
shown state-of-the-art results for various NLP tasks, both at
former model [27] that takes an L × D1 tensor, contain-
the word level (e.g., POS tagging) and sentence level (e.g.,
ing the D1 -dimensional embedding vectors corresponding
semantic classification). In particular, we extend the BERT
to x\l , and returns an L × D2 tensor, where D2 is the size
model to capture structure in both the linguistic and visual
of the output of each transformer node. See [6] for details.
domains.
The model is trained to approximately maximize the pseudo
Image and video captioning. There has been much re-
log-likelihood
cent work on image captioning (see e.g., [11, 8, 14]), which
is a model of the form p(y|x), where y is the manually pro- L
X
vided caption and x is the image. There has also been some L(θ) = Ex∼D log p(xl |x\l ; θ)
work on video captioning, using either manually provided l=1
temporal segmentation or estimated segmentations (see e.g.,
[10, 39]). We use our joint p(x, y) model and apply it to In practice, we can stochastically optimize the logloss
video captioning, and achieve state-of-the-art results, as we (computed from the softmax predicted by the f function)
discuss in Section 4.6. by sampling locations as well as training sentences.
Instructional videos. Various papers (e.g., [15, 2, 10, BERT can be extended to model two sentences by con-
38, 39]) have trained models to analyse instructional videos, catenating them together. However, we are often not only
such as cooking. We differ from this work in that we do not interested in simply modeling the extended sequence, but
use any manual labeling, and we learn a large-scale genera- rather relationships between the two sentences (e.g., is this a
tive model of both words and (discretized) visual signals. pair of consecutive or randomly selected sentences). BERT
accomplishes this by prepending every sequence with a spe-
3. Models cial classification token, [CLS], and by joining sentences
with a special separator token, [SEP]. The final hidden state
In this section, we briefly summarize the BERT model, corresponding to the [CLS] token is used as the aggregate
and then describe how we extend it to jointly model video sequence representation from which we predict a label for
and language data. classification tasks, or which may otherwise be ignored. In
[CLS] Place the steak in the pan [>] [SEP]

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

VideoBERT
E[CLS] EPlace Ethe E[MASK] Ein Ethe Epan E[>] Ev( ) E[MASK] Ev( ) Ev( ) Ev( ) E[SEP]

[CLS] Place the [MASK] in the pan [>] [MASK] [SEP]

Figure 3: Illustration of VideoBERT in the context of a video and text masked token prediction, or cloze, task. This task also
allows for training with text-only and video-only data, and VideoBERT can furthermore be trained using a linguistic-visual
alignment classification objective (not shown here, see text for details).

addition to differentiating sentences with the [SEP] token, for an illustration.


BERT also optionally tags each token by the sentence it While this cloze task extends naturally to sequences of
comes from. The corresponding joint model can be written linguistic and visual tokens, applying a next sentence pre-
as p(x, y, c), where x is the first sentence, y is the second, diction task, as used by BERT, is less straightforward. We
and c = {0, 1} is a label indicating whether the sentences propose a linguistic-visual alignment task, where we use the
were separate or consecutive in the source document, re- final hidden state of the [CLS] token to predict whether the
spectively. linguistic sentence is temporally aligned with the visual sen-
For consistency with the original paper, we also add a tence. Note that this is a noisy indicator of semantic relat-
[SEP] token to the end of the sequence, even though it edness, since even in instructional videos, the speaker may
is not strictly needed. So, a typical masked-out training be referring to something that is not visually present.
sentence pair may look like this: [CLS] let’s make To combat this, we first randomly concatenate neighbor-
a traditional [MASK] cuisine [SEP] orange ing sentences into a single long sentence, to allow the model
chicken with [MASK] sauce [SEP]. The corre- to learn semantic correspondence even if the two are not
sponding class label in this case would be c = 1, indicating well aligned temporally. Second, since the pace of state
that x and y are consecutive. transitions for even the same action can vary greatly be-
tween different videos, we randomly pick a subsampling
3.2. The VideoBERT model rate of 1 to 5 steps for the video tokens. This not only helps
To extend BERT to video, in such a way that we may the model be more robust to variations in video speeds, but
still leverage pretrained language models and scalable im- also allows the model to capture temporal dynamics over
plementations for inference and learning, we decided to greater time horizons and learn longer-term state transi-
make minimal changes, and transform the raw visual data tions. We leave investigation into other ways of combining
into a discrete sequence of tokens. To this end, we propose video and text to future work.
to generate a sequence of “visual words” by applying hi- Overall, we have three training regimes corresponding
erarchical vector quantization to features derived from the to the different input data modalities: text-only, video-only
video using a pretrained model. See Section 4.2 for details. and video-text. For text-only and video-only, the standard
Besides its simplicity, this approach encourages the model mask-completion objectives are used for training the model.
to focus on high level semantics and longer-range temporal For text-video, we use the linguistic-visual alignment clas-
dynamics in the video. This is in contrast to most existing sification objective described above. The overall training
self-supervised approaches to video representation learning, objective is a weighted sum of the individual objectives.
which learn low-level properties such as local textures and The text objective forces VideoBERT to do well at language
motions, as discussed in Section 2. modeling; the video objective forces it to learn a “language
We can combine the linguistic sentence (derived from the model for video”, which can be used for learning dynam-
video using ASR) with the visual sentence to generate data ics and forecasting; and the text-video objective forces it to
such as this: [CLS] orange chicken with [MASK] learn a correspondence between the two domains.
sauce [>] v01 [MASK] v08 v72 [SEP], where v01 Once we have trained the model, we can use it in a va-
and v08 are visual tokens, and [>] is a special token we in- riety of downstream tasks, and in this work we quantita-
troduce to combine text and video sentences. See Figure 3 tively evaluate two applications. In the first application, we
treat it as a probabilistic model, and ask it to predict or im- We evaluate VideoBERT on the YouCook II dataset [38],
pute the symbols that have been MASKed out. We illustrate which contains 2000 YouTube videos averaging 5.26 min-
this in Section 4.4, where we perform “zero-shot” classifi- utes in duration, for a total of 176 hours. The videos have
cation. In the second application, we extract the predicted manually annotated segmentation boundaries and captions.
representation (derived from the internal activations of the On average there are 7.7 segments per video, and 8.8 words
model) for the [CLS] token, and use that dense vector as per caption. We use the provided dataset split, with 1333
a representation of the entire input. This can be combined videos for training and 457 for validation. Note that by
with other features derived from the input to be used in a the time we downloaded the videos, roughly 9% had been
downstream supervised learning task. We demonstrate this deleted in each split, so we exclude them from the dataset.
in Section 4.6, where we perform video captioning. To avoid potential bias during pretraining, we also remove
any videos which appear in YouCook II from our pretrain-
4. Experiments and Analysis ing set.

In this section we describe our experimental setup, and 4.2. Video and Language Preprocessing
show quantitative and qualitative results. For each input video, we sample frames at 20 fps, and
create clips from 30-frame (1.5 seconds) non-overlapping
4.1. Dataset
windows over the video. For each 30-frame clip, we apply
Deep learning models, in both language and vision do- a pretrained video ConvNet to extract its features. In this
mains, have consistently demonstrated dramatic gains in work, we use the S3D [34] which adds separable temporal
performance with increasingly large datasets. For example, convolutions to an Inception network [24] backbone. We
the “large” BERT model (which we use) was pretrained on take the feature activations before the final linear classifier
the concatenation of the BooksCorpus (800M words) and and apply 3D average pooling to obtain a 1024-dimension
English Wikipedia (2,500M words). feature vector. We pretrain the S3D network on the Kinet-
Therefore, we would like to train VideoBERT with a ics [9] dataset, which covers a wide spectrum of actions
comparably large-scale video dataset. Since we are inter- from YouTube videos, and serves as a generic representa-
ested in the connection between language and vision, we tion for each individual clip.
would like to find videos where the spoken words are more We tokenize the visual features using hierarchical k-
likely to refer to visual content. Intuitively, this is often means. We adjust the number of hierarchy levels d and the
the case for instructional videos, and we focus on cooking number of clusters per level k by visually inspecting the co-
videos specifically, since it is a well studied domain with herence and representativeness of the clusters. We set d = 4
existing annotated datasets available for evaluation. Unfor- and k = 12, which yields 124 = 20736 clusters in total.
tunately, such datasets are relatively small, so we turn to Figure 4 illustrates the result of this “vector quantization”
YouTube to collect a large-scale video dataset for training. process.
We extract a set of publicly available cooking videos For each ASR word sequence, we break the stream
from YouTube using the YouTube video annotation sys- of words into sentences by adding punctuation using an
tem to retrieve videos with topics related to “cooking” and off-the-shelf LSTM-based language model. For each sen-
“recipe”. We also filter videos by their duration, removing tence, we follow the standard text preprocessing steps from
videos longer than 15 minutes, resulting in a set of 312K BERT [6] and tokenize the text into WordPieces [33]. We
videos. The total duration of this dataset is 23,186 hours, or use the same vocabulary provided by the authors of BERT,
roughly 966 days. For reference, this is more than two or- which contains 30,000 tokens.
ders of magnitude larger than the next largest cooking video Unlike language which can be naturally broken into sen-
dataset, YouCook II, which consists of 2K videos with a to- tences, it is unclear how to break videos into semantically
tal duration of 176 hours [38]. coherent segments. We use a simple heuristic to address
To obtain text from the videos, we utilize YouTube’s au- this problem: when an ASR sentence is available, it is as-
tomatic speech recognition (ASR) toolkit provided by the sociated with starting and ending timestamps, and we treat
YouTube Data API [1] to retrieve timestamped speech in- video tokens that fall into that time period as a segment.
formation. The API returns word sequences and the pre- When ASR is not available, we simply treat 16 tokens as a
dicted language type. Among the 312K videos, 180K have segment.
ASR that can be retrieved by the API, and 120K of these
4.3. Model Pre-training
are predicted to be in English. In our experiments, while we
use all videos for the video-only objective, we only use text We initialize the BERT weights with a model checkpoint
from English ASR for VideoBERT’s text-only and video- that was pretrained on text data. Specifically, we use the
text objectives. BERTLARGE model released by the authors of [6], using the
Figure 4: Examples of video sentence pairs from the pretraining videos. We quantize each video segment into a token, and
then represent it by the corresponding visual centroid. For each row, we show the original frames (left) and visual centroids
(right). We can see that the tokenization process preserves semantic information rather than low-level visual appearance.

same backbone architecture: it has 24 layers of Transformer


blocks, where each block has 1024 hidden units and 16 self-
attention heads.
We add support for video tokens by appending 20,736
entries to the word embedding lookup table for each of
our new “visual words”. We initialize these entries with
the S3D features from their corresponding cluster centroids.
The input embeddings are frozen during pretraining.
Our model training process largely follows the setup of
BERT: we use 4 Cloud TPUs in the Pod configuration with
a total batch size of 128, and we train the model for 0.5
million iterations, or roughly 8 epochs. We use the Adam
optimizer with an initial learning rate of 1e-5, and a linear
decay learning rate schedule. The training process takes
around 2 days.

4.4. Zero-shot action classification


Once pretrained, the VideoBERT model can be used
for “zero-shot” classification on novel datasets, such as
YouCook II. (By “zero-shot” we mean the model is not
trained on YouCook II data nor is the model explicitly
trained with the same label ontology used in YouCook II.)
More precisely, we want to compute p(y|x) where x is
the sequence visual tokens, and y is a sequence of words.
Since the model is trained to predict sentences, we define
y to be the fixed sentence, “now let me show you how
to [MASK] the [MASK],” and extract the verb and noun
labels from the tokens predicted in the first and second Figure 5: Using VideoBERT to predict nouns and verbs
masked slots, respectively. See Figure 5 for some quali- given a video clip. See text for details. The video clip is
tative results. first converted into video tokens (two are shown here for
For quantitative evaluation, we use the YouCook II each example), and then visualized using their centroids.
dataset. In [37], the authors collected ground truth bound-
ing boxes for the 63 most common objects for the validation
Method Supervision verb top-1 (%) verb top-5 (%) object top-1 (%) object top-5 (%)
S3D [34] yes 16.1 46.9 13.2 30.9
BERT (language prior) no 0.0 0.0 0.0 0.0
VideoBERT (language prior) no 0.4 6.9 7.7 15.3
VideoBERT (cross modal) no 3.2 43.3 13.1 33.7

Table 1: Action classification performance on YouCook II dataset. See text for details.

Method Data size verb top-1 (%) verb top-5 (%) object top-1 (%) object top-5 (%)
VideoBERT 10K 0.4 15.5 2.9 17.8
VideoBERT 50K 1.1 15.7 8.7 27.3
VideoBERT 100K 2.9 24.5 11.2 30.6
VideoBERT 300K 3.2 43.3 13.1 33.7

Table 2: Action classification performance on YouCook II dataset as a function of pre-training data size.

set of YouCook II. However, there are no ground truth la- VideoBERT has an effectively open vocabulary. (See Fig-
bels for actions, and many other common objects are not ure 5 for an illustration of the ambiguity of the action la-
labeled. So, we collect action and object labels, derived bels.) However, the top-5 accuracy metric reveals that
from the ground truth captions, to address this shortcoming. VideoBERT achieves comparable performance to the fully
We run an off-the-shelf part-of-speech tagger on the ground supervised S3D baseline, without using any supervision
truth captions to retrieve the 100 most common nouns and from YouCook II, indicating that the model is able to per-
45 most common verbs, and use these to derive ground truth form competitively in this “zero-shot” setting.
labels. While VideoBERT’s word piece vocabulary gives
it the power to effectively perform open-vocabulary clas- 4.5. Benefits of large training sets
sification, it is thus more likely to make semantically cor- We also studied the impact of the size of the pretrain-
rect predictions that do not exactly match the more limited ing dataset. For this experiment, we take random subsets
ground truth. So, we report both top-1 and top-5 classifica- of 10K, 50K and 100K videos from the pretraining set,
tion accuracy metrics, where the latter is intended to miti- and pretrain VideoBERT using the same setup as above,
gate this issue, and we leave more sophisticated evaluation for the same number of epochs. Table 2 shows the perfor-
techniques for future work. Lastly, if there is more than mance. We can see that the accuracy grows monotonically
one verb or noun associated with a video clip, we deem a as the amount of data increases, showing no signs of satura-
prediction correct if it matches any of those. We report the tion. This indicates that VideoBERT may benefit from even
performance on the validation set of YouCook II. larger pretraining datasets.
Table 1 shows the top-1 and top-5 accuracies of
VideoBERT and its ablations. To verify that VideoBERT 4.6. Transfer learning for captioning
actually makes use of video inputs, we first remove the We further demonstrate the effectiveness of VideoBERT
video inputs to VideoBERT, and use just the language when used as a feature extractor. To extract features given
model p(y) to perform prediction. We also use the lan- only video inputs, we again use a simple fill-in-the-blank
guage prior from the text-only BERT model, that was not task, by appending the video tokens to a template sentence
fine-tuned on cooking videos. We can see that VideoBERT “now let’s [MASK] the [MASK] to the [MASK],
significantly outperforms both baselines. As expected, the and then [MASK] the [MASK].” We extract the fea-
language prior of VideoBERT is adapted to cooking sen- tures for the video tokens and the masked out text tokens,
tences, and is better than the vanilla BERT model. take their average and concatenate the two together, to be
We then compare with a fully supervised classifier that used by a supervised model in a downstream task.
was trained using the training split of YouCook II. We We evaluate the extracted features on video caption-
use the pre-computed S3D features (same as the inputs to ing, following the experimental setup from [39], where the
VideoBERT), applying average pooling over time, followed ground truth video segmentations from YouCook II are used
by a linear classifier. Table 1 shows the results. As we to train a supervised model mapping video segments to cap-
can see, the supervised framework outperforms VideoBERT tions. We use the same model that they do, namely a trans-
in top-1 verb accuracy, which is not surprising given that former encoder-decoder, but we replace the inputs to the en-
Method BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Zhou et al. [39] - 1.42 11.20 - -
S3D [34] 6.12 3.24 10.00 26.05 0.35
VideoBERT 6.80 4.07 10.99 27.51 0.50
VideoBERT + S3D 7.81 4.52 11.85 28.78 0.55

Table 3: Video captioning performance on YouCook II. We follow the setup from [39] and report captioning performance on
the validation set, given ground truth video segments. Higher numbers are better.

Figure 6: Examples of generated captions by VideoBERT and the S3D baseline. In the last example, VideoBERT fails to
exploit the full temporal context, since it misses the paper towel frame.

coder with the features derived from VideoBERT described VideoBERT and S3D, the model achieves the best perfor-
above. We also concatenate the VideoBERT features with mance across all metrics.
average-pooled S3D features; as a baseline, we also con- Figure 6 shows some qualitative results. We note that
sider using just S3D features without VideoBERT. We set the predicted word sequence is rarely exactly equal to the
the number of Transformer block layers to 2, the hidden ground truth, which explains why the metrics in Table 3
unit size to 128, and Dropout probability to 0.4. We use a (which measure n-gram overlap) are all low in absolute
5-fold cross validation on the training split to set the hyper- value. However, semantically the results seem reasonable.
parameters, and report performance on the validation set.
We train the model for 40K iterations with batch size of 5. Discussion and conclusion
128. We use the same Adam optimizer as in VideoBERT
pre-training, and set the initial learning rate to 1e-3 with a This paper adapts the powerful BERT model to learn a
linear decay schedule. joint visual-linguistic representation for video. Our exper-
Table 3 shows the results. Besides the BLEU and ME- imental results demonstrate that we are able to learn high-
TEOR metrics used by [39], we also report ROUGE-L [13] level semantic representations, and we outperform the state-
and CIDEr [28]. We can see that VideoBERT consistently of-the-art for video captioning on the YouCook II dataset.
outperforms the S3D baseline in all metrics, especially for We also show that this model can be used directly for open-
CIDEr. Furthermore, by concatenating the features from vocabulary classification, and that its performance grows
monotonically with the size of training set. References
This work is a first step in the direction of learning
[1] YouTube Data API. https://developers.google.
such joint representations. For many applications, includ-
com/youtube/v3/docs/captions. 5
ing cooking, it is important to use spatially fine-grained vi-
[2] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev,
sual representations, instead of just working at the frame or
and S. Lacoste-Julien. Unsupervised learning from narrated
clip level, so that we can distinguish individual objects and instruction videos. In CVPR, 2016. 3
their attributes. We envision either using pretrained object
[3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learn-
detection and semantic segmentation models, or using unsu- ing sound representations from unlabeled video. In NeurIPS,
pervised techniques for broader coverage. We also want to 2016. 3
explicitly model visual patterns at multiple temporal scales, [4] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and
instead of our current approach, that skips frames but builds S. Levine. Stochastic variational video prediction. In ICLR,
a single vocabulary. 2018. 2
Beyond improving the model, we plan to assess our ap- [5] E. Denton and R. Fergus. Stochastic video generation with a
proach on other video understanding tasks, and on other do- learned prior. In ICML, 2018. 2
mains besides cooking. (For example, we may use the re- [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT:
cently released COIN dataset of manually labeled instruc- Pre-training of deep bidirectional transformers for language
tional videos [25].) We believe the future prospects for large understanding. arXiv preprint arXiv:1810.04805, 2018. 2,
scale representation learning from video and language look 3, 5
quite promising. [7] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li,
S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar,
Acknowledgements. We would like to thank Jack Hessel, et al. AVA: A video dataset of spatio-temporally localized
atomic visual actions. In CVPR, 2018. 2
Bo Pang, Radu Soricut, Baris Sumengen, Zhenhai Zhu, and
the BERT team for sharing amazing tools that greatly fa- [8] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-
ments for generating image descriptions. In CVPR, 2015.
cilitated our experiments; Justin Gilmer, Abhishek Kumar,
3
David Ross, and Rahul Sukthankar for helpful discussions.
[9] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-
Chen would like to thank Y. M. for inspiration.
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.
The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950, 2017. 2, 5
[10] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles.
Dense-Captioning events in videos. In ICCV, 2017. 2, 3
[11] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
and T. L. Berg. Baby talk: Understanding and generating
image descriptions. In CVPR, 2011. 3
[12] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn,
and S. Levine. Stochastic adversarial video prediction.
arXiv:1804.01523, 2018. 2, 3
[13] C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. Text Summarization Branches Out, 2004. 8
[14] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In
CVPR, 2018. 3
[15] J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabi-
novich, and K. Murphy. What’s cookin’? interpreting cook-
ing videos using text, speech and vision. In NAACL, Mar.
2015. 3
[16] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale
video prediction beyond mean square error. In ICLR, 2016.
2
[17] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn:
unsupervised learning using temporal order verification. In
ECCV, 2016. 3
[18] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A.
Bargal, Y. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Von-
drick, et al. Moments in time dataset: one million videos for
event understanding. TPAMI, 2019. 2
[19] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel- [38] L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning
son, and W. T. Freeman. Visually indicated sounds. In of procedures from web instructional videos. In AAAI, 2018.
CVPR, 2016. 3 3, 5
[20] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and [39] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-
A. Torralba. Ambient sound provides supervision for visual to-end dense video captioning with masked transformer. In
learning. In ECCV, 2016. 3 CVPR, 2018. 2, 3, 7, 8
[21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark,
K. Lee, and L. Zettlemoyer. Deep contextualized word rep-
resentations. In NAACL, 2018. 3
[22] M. A. Ranzato and A. Graves. Deep unsupervised learning.
NIPS Tutorial, 2018. 1, 3
[23] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting
unreasonable effectiveness of data in deep learning era. In
ICCV, 2017. 1
[24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
novich. Going deeper with convolutions. arXiv preprint
arXiv:1409.4842, 2014. 5
[25] Y. Tang, D. Ding, Y. Rao, Y. Zheng, D. Zhang, L. Zhao,
J. Lu, and J. Zhou. COIN: A large-scale dataset for compre-
hensive instructional video analysis. In CVPR, 2019. 9
[26] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN:
Decomposing motion and content for video generation. In
CVPR, 2018. 2, 3
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
you need. In NIPS, 2017. 3
[28] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. In CVPR,
2015. 8
[29] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating
visual representations from unlabeled video. In CVPR, 2016.
3
[30] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating
videos with scene dynamics. In NeurIPS, 2016. 2
[31] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncer-
tain future: Forecasting from static images using variational
autoencoders. In ECCV, 2016. 2
[32] A. Wang and K. Cho. BERT has a mouth, and it must speak:
BERT as a markov random field language model. arXiv
preprint arXiv:1902.04094, 2019. 3
[33] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
et al. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation. arXiv
preprint arXiv:1609.08144, 2016. 5
[34] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking
spatiotemporal feature learning for video understanding. In
ECCV, 2018. 5, 7, 8
[35] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynam-
ics: Probabilistic future frame synthesis via cross convolu-
tional networks. In NIPS, 2016. 2
[36] H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba.
Slac: A sparsely labeled dataset for action classification and
localization. arXiv preprint arXiv:1712.09374, 2017. 2
[37] L. Zhou, N. Louis, and J. J. Corso. Weakly-supervised video
object grounding from text by loss weighting and object in-
teraction. In BMVC, 2018. 6
Figure A1: Visualizations for video to text prediction. For each example, we show the key frames from the original video
(top left) and the associated ASR outputs (top right), we then show the centroid images of video tokens (bottom left) and the
top predicted verbs and nouns by VideoBERT (bottom right). Note that the ASR outputs are not used to predict verbs and
nouns.
Figure A2: Visualizations for video to video prediction. Given an input video token, we show the top 3 predicted video
tokens 2 steps away in the future. We visualize each video token by the centroids.
Figure A3: Visualizations for text to video prediction. In particular, we make small changes to the input text, and compare
how the generated video tokens vary. We show top 2 retrieved video tokens for each text query.

You might also like