0% found this document useful (0 votes)
26 views16 pages

Towards A Better Metric For Text-to-Video Generation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views16 pages

Towards A Better Metric For Text-to-Video Generation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Towards A Better Metric for Text-to-Video Generation

Jay Zhangjie Wu1∗ Guian Fang1∗ Haoning Wu4∗ Xintao Wang3 Yixiao Ge3
Xiaodong Cun3 David Junhao Zhang1 Jia-Wei Liu1 Yuchao Gu1 Rui Zhao1
Weisi Lin4 Wynne Hsu2 Ying Shan3 Mike Zheng Shou1
1
Show Lab, 2 National University of Singapore 3 ARC Lab, Tencent PCG
4
Nanyang Technological University
arXiv:2401.07781v1 [cs.CV] 15 Jan 2024

https://showlab.github.io/T2VScore

Abstract

Generative models have demonstrated remarkable capa-


bility in synthesizing high-quality text, images, and videos.
For video generation, contemporary text-to-video models
exhibit impressive capabilities, crafting visually stunning
videos. Nonetheless, evaluating such videos poses signifi-
cant challenges. Current research predominantly employs
automated metrics such as FVD, IS, and CLIP Score. How-
ever, these metrics provide an incomplete analysis, particu-
larly in the temporal assessment of video content, thus ren-
dering them unreliable indicators of true video quality. Fur-
thermore, while user studies have the potential to reflect
human perception accurately, they are hampered by their
time-intensive and laborious nature, with outcomes that are
often tainted by subjective bias. In this paper, we investi-
gate the limitations inherent in existing metrics and intro-
duce a novel evaluation pipeline, the Text-to-Video Score
(T2VScore). This metric integrates two pivotal criteria: Figure 1. T2VScore: We measure text-conditioned generated
(1) Text-Video Alignment, which scrutinizes the fidelity of videos from two essential perspectives: text alignment and video
the video in representing the given text description, and (2) quality. Our proposed T2VScore achieves the highest correlation
Video Quality, which evaluates the video’s overall produc- with human judgment. We encourage readers to click and play
tion caliber with a mixture of experts. Moreover, to evalu- using Adobe Acrobat.
ate the proposed metrics and facilitate future improvements
on them, we present the TVGE dataset, collecting human erative models coming out from companies [2, 3, 5, 19, 48]
judgements of 2,543 text-to-video generated videos on the and opensource community [17, 50, 56, 82]. These models,
two criteria. Experiments on the TVGE dataset demon- equipped with the ability to learn from vast datasets of text-
strate the superiority of the proposed T2VScore on offering video pairs, can generate creative video content that can
a better metric for text-to-video generation. The code and range from simple animations to complex, lifelike scenes.
dataset will be open-sourced. To assess text-conditioned generated videos, most ex-
isting studies employ objective metrics like Fréchet Video
Distance (FVD) [54] and Video Inception Score (IS) [47]
1. Introduction for video quality, and CLIPScore [45] for text-video align-
ment. However, these metrics have limitations. FVD and
Text-to-video generation marks one of the most exciting
Video IS are unsuitable for open-domain video generation
achievements in generative AI, with awesome video gen-
due to their Full-Reference nature. Meanwhile, the CLIP
∗ Equal contribution. Score computes an average of per-frame text-image simi-

1
larities using image CLIP models, overlooking important 2. Related Work
temporal motion changes in videos. This leads to a mis-
match between these objective metrics and human percep- 2.1. Text-to-Video Generation
tion, as evident in recent studies [38, 42]. Current stud- Diffusion-based models have been widely explored to
ies also incorporate subjective user evaluations for text-to- achieve text-to-video generation [5, 13, 16, 19, 20, 32, 56,
video generation. However, conducting large-scale human 59, 60, 73, 80, 82, 86, 87]. VDM [20] pioneered the explo-
evaluations is labor-intensive and, therefore, not practical ration of the diffusion model in the text-to-video generation,
for widespread, open comparisons. To address this, there is in which a 3D version of U-Net [46] structure is explored to
a need for fine-grained automatic metrics tailored for eval- jointly learn the spatial and temporal generation knowledge.
uating text-guided generated videos. Make-A-Video [48] proposed to learn temporal knowledge
In this work, we take a significant step forward by in- with only unlabeled videos. Imagen Video [19] built cas-
troducing T2VScore, a novel automatic evaluator specif- caded diffusion models to generate video and then spatially
ically designed for text-to-video generation. T2VScore and temporally up-sample it in cascade. PYoCo [13] intro-
assesses two essential aspects of text-guided generated duced the progressive noise prior model to preserve the tem-
videos: text-video alignment (i.e., how well does the video poral correlation and achieved better performance in fine-
match the text prompt?), and video quality (i.e., how good tuning the pre-trained text-to-image models to text-to-video
is the quality of the synthesized video?). Two metrics are generation. The subsequent works, LVDM [16] et al., fur-
then introduced: 1) T2VScore-A evaluates the correct- ther explored training a 3D U-Net in latent space to re-
ness of all spatial and temporal elements in the text prompt duce training complexity and computational costs. These
by querying the video using cutting-edge vision-language works can be classified respectively as pixel-based models
models; 2) T2VScore-Q is designed to predict a robust and latent-based models. Show-1 [82] marks the first in-
and generalizable quality score for text-guided generated tegration of pixel-based and latent-based models for video
videos via a combo of structural and training strategies. generation. It leverages pixel-based models for generating
To examine the reliability and robustness of the pro- low-resolution videos and employs latent-based models to
posed metrics in the evaluation of text-guided gener- upscale them to high resolution, combining the advantages
ated videos, we present the Text-to-Video Generation of high efficiency from latent-based models and superior
Evaluation (TVGE) dataset. This dataset gathers extensive content quality from pixel-based models. Recently, the text-
human opinions on two key aspects: text-video alignment to-video generation products, such as Gen-2 [2], Pika [3],
and video quality, as investigated in our T2VScore. The and Floor33 [1], and the open-sourced foundational text-
TVGE dataset will serve as an open benchmark for assess- to-video diffusion models, such as ModelScopeT2V [57],
ing the correlation between automatic metrics and human ZeroScope [50], VideoCrafter [17], have democratized the
judgments. Moreover, it can help automatic metrics to bet- video generation, garnering widespread interest from both
ter adapt to the domain of text-guided generated videos. the community and academia.
Extensive experiments on the TGVE dataset demonstrate
better alignment of our T2VScore with human judgment 2.2. Evaluation Metrics
compared to all baseline metrics. Image Metrics. Image-level metrics are widely utilized
To summarize, we make the following contributions: to evaluate the frame quality of generated videos. These in-
clude Peak Signal-to-Noise Ratio (PSNR) [63], and Struc-
• We introduce T2VScore as a novel evaluator dedi- tural Similarity Index (SSIM) [62], Learned Perceptual Im-
cated to automatically assessing text-conditioned gen- age Patch Similarity (LPIPS) [84], Fréchet Inception Dis-
erated videos, focusing on two key aspects: text-video tance (FID) [44], and CLIP Score [45]. Among them,
alignment and video quality. the PSNR [63], SSIM [62], and LPIPS [84] are mainly
• We collect the Text-to-Video Generation Evaluation employed to evaluate the quality of reconstructed video
(TVGE) dataset, which is posited as the first open-source frames by comparing the difference between generated
dataset dedicated to benchmarking and enhancing evalu- frames and original frames. Specifically, PSNR [63] is
ation metrics for text-to-video generation. the ratio between the peak signal and the Mean Squared
Error (MSE) [61]. SSIM [62] evaluates brightness, con-
• We validate the inconsistency between current objective trast, and structural features between generated and origi-
metrics and human judgment on the TVGE dataset. Our nal images. LPIPS [84] is a perceptual metric that com-
proposed metrics, T2VScore-A and T2VScore-Q, putes the distance of image patches in the latent feature
demonstrate superior performance in correlation analysis space. FID [44] utlizes the InceptionV3 [51] to extract
with human evaluations, thereby serving as more effective feature maps from normalized generated and real-world
metrics for evaluating text-conditioned generated videos. frames, and computes the mean and covariance matrices for

2
FID [44] scores. CLIP Score [45] measures the similarity of to create questions in various areas, such as color and
the CLIP features extracted from the images and texts, and shape, and checks the answers with VQA systems like
it has been widely employed in text-to-video generation or mPLUG [26]. VQ2 A[9] makes VQA more reliable by
editing tasks [5, 15, 35, 48, 73, 82, 86]. synthesizing new data and employing high-quality nega-
Video Metrics. In contrast to the frame-wise metrics, tive sampling. VPEval[10] improves this process through
video metrics focus more on the comprehensive evaluation the use of object detection and Optical Character Recogni-
of the video quality. Fréchet Video Distance (FVD) [54] tion (OCR), combining these with ChatGPT for more con-
utilizes the Inflated-3D Convnets (I3D) [7] pre-trained on trolled testing. However, these methods have not yet ex-
Kinetics [8] to extract the features from videos, and com- plored videos, where both spatial and temporal elements
pute their means and covariance matrices for FVD scores. should be evaluated. We are adding specific designs to the
Differently, Kernel Video Distance (KVD) [53] computes temporal domain to improve VQA for video understanding.
the Maximum Mean Discrepancy (MMD) [14] of the video This provides a more comprehensive method for evaluating
features extracted using I3D [7] to evaluate the video qual- text-video alignment from both space and time.
ity. Video Inception Score (Video IS) [47] computes the
inception score of videos with the features extracted from 3. Proposed Metrics
C3D [52]. Frame Consistency CLIP Score [45] calculates
We introduce two metrics to evaluate text-guided generated
the cosine similarity of the CLIP image embeddings for all
videos, focusing on two essential dimensions: Text Align-
pairs of video frames to measure the consistency of edited
ment (Sec. 3.1) and Video Quality (Sec. 3.2).
videos [15, 35, 73–75, 85].
2.3. Video Quality Assessment 3.1. Text Alignment

State-of-the-arts on video quality assessment (VQA) have State-of-the-art multimodal large language models
been predominated by learning-based approaches [24, 27, (MLLMs) have demonstrated human-level capabilities in
67]. Typically, these approaches leverage pre-trained deep both visual and textual comprehension and generation.
neural networks as feature extractors and use human opin- Here, we introduce a framework for assessing the text-
ions as supervision to regress these features into quality video alignment using these MLLMs. An overview of our
scores. Some most recent works [65, 66, 70] have adopted text alignment evaluation process is presented in Fig. 2.
a new strategy that uses a large VQA database on natural
Entity Decomposition in Text Prompt. Consider a text
videos [81] to learn better feature representations for VQA,
prompt denoted as P. Our initial step involves parsing P
and then transfer to diverse types of videos with only a
into distinct semantic elements, represented as ei . We then
few labeled videos available. This strategy has been vali-
identify the hierarchical semantic relationships among these
dated as an effective way to improve the prediction accuracy
elements, forming entity tuples {(ei , ej )}. Here, ej is se-
and robustness on relatively small VQA datasets for en-
mantically dependent on ei to form a coherent meaning. For
hanced videos [36] and computer-generated contents [79].
instance, the tuple (dog, a) implies that the article “a” is as-
In our study, we extend this strategy for evaluating text-
sociated with the noun “dog”, while (cat, playing soccer)
conditioned generated videos, bringing a more reliable and
suggests that the action “playing soccer” is attributed to the
generalizable quality metric for text-to-video generation.
“cat”. Elements that exert a global influence over the entire
Despite leveraging from large video quality databases,
prompt, like style or camera motion, are categorized under a
several recent works [55, 68, 69, 72] have also explored
global element. This structuring not only clarifies the in-
to adopt multi-modality foundation models e.g. CLIP [45]
terconnections within the elements of a text prompt but also
for VQA. With the text prompts as natural quality indica-
implicitly prioritizes them based on their hierarchical sig-
tors (e.g. good/bad), these text-prompted methods prove
nificance. For instance, mismatching an element that holds
superior abilities on zero-shot or few-shot VQA settings,
a higher dependency rank would result in a more substantial
and robust generalization among distributions. Inspired by
penalty on the final text alignment score.
existing studies, the proposed quality metric in T2VScore
also ensembles a text-prompted structure, which is proved Question/Answer Generation with LLMs. Our main
to better align with human judgments on videos generated goal is to generate diverse questions that cover all elements
by novel generators that are not seen during training. of the text input evenly. Drawing inspiration from previous
studies [22], for a text prompt P, we utilize large language
2.4. QA-based Evaluation
models (LLMs) to generate question-choice-answer tuples
Recent studies have emerged around the idea of using {Qi , Ci , Ai }N
i=1 , as depicted on the top of Fig. 2. Different
Visual Question Answering (VQA) to test the accuracy from prior work focusing on text-image alignment, we em-
of advanced AI models. TIFA [22] utilizes GPT-3 [6] phasize the temporal aspects, such as object trajectory and

3
(a) Technical Expert

Transformer (T)
Video Swin

FC(768,64)

FC(64,1)
Qtech

FAST-VQA
Processor
T2VScore-Q: 0.57

Processor
CLIP Image
CLIP-Adapter

CLIP-ViT-Large

FC(768,256)

FC(256,768)
Qsem

good, high quality


Text Transformer
poor, low quality
(b) Semantic Expert

Figure 3. Pipeline for Calculating T2VScore-Q: a mixture of


a technical expert (a) to capture spatial and temporal technical
Figure 2. Pipeline for Calculating T2VScore-A: We input the distortions, and a text-prompted semantic expert (b).
text pormpt into large language models (LLMs) to generate ques-
tions and answers. Utilizing CoTracker [23], we extract the auxil- a video V, created by T2V models using text prompt T ,
iary trajectory, which, along with the input video, is fed into multi-
alongside its tracking trajectory Vtrack and question-choice
modal LLM (MLLMs) for visual question answering (VQA). The
pairs {Qi , Ci }N
i=1 generated by LLMs. These inputs are
final T2VScore-A is measured based on the accuracy of VQA.
Please click and play using Adobe Acrobat. then fed into multi-modality LLMs for question answering:
Âi = VQA(V, Vtrack , Qi , Ci ).
We define the Text-to-Video (T2V) alignment score
camera motion, which are unique and essential for evalu-
T2VScore-A as the accuracy of the video question an-
ating text alignment in dynamic video contexts. We em-
swering process:
ploy a single-pass inference using in-context learning with
GPT-3.5 [6, 64] to generate both questions and answers. We N
1 X
manually curate 3 examples and use them as in-context ex- T2VScore-A(T , V) = 1[Âi = Ai ]. (1)
amples for GPT-3.5 to follow. The complete prompt used N i=1
for generating question and answer pairs can be found in
the supplementary. The T2VScore-A ranges from 0 to 1, with higher values
indicating better alignment between text T and video V.
Video Question Answering with Auxiliary Trajectory.
3.2. Video Quality
Most open-domain vision-language models are image-
centric [11, 30, 33, 34, 77, 89], with only a few focusing In this section, we discuss the proposed video quality met-
on video [31, 39, 78]. These VideoLLMs often struggle ric in the T2V Score. Our core principle is simple: it
with fine-grained temporal comprehension, as evidenced by should be able to keep effective to evaluate videos from un-
their performance on benchmarks like SEED-Bench [25]. seen generation models that come up after we propose this
To address this, we introduce the use of auxiliary trajecto- score. Under this principle, the proposed metric aims to
ries, generated by off-the-shelf point tracking models (e.g., achieve two important goals: (G1) It can more accurately
CoTracker [23] and OmniMotion [58]), to enhance the un- assess the quality of generated videos without seeing any
derstanding of object and camera movements. We process of them (zero-shot); (G2) while adapted to videos gener-

4
🔥 Updated Modules 28M Optimizable Params [ctx]- Learnable Text Prefix
ated on known models, it can significantly improve gener-
alized performance on unknown models. Both aims in- Decreased Dataset Scale
spire us to drastically improve the generalization ability of

Technical Expert
🔥 28M 🔥 50K 0
the metric, via a combo of Mix-of-Limited-Expert Structure FC(64,1) FC(64,1) FC(64,1)
🔥 🔥
(Sec. 3.2.1), Progressive Optimization Strategy (Sec. 3.2.2), FC(768,64) FC(768,64) FC(768,64)
🔥
and List-wise Learning Objectives (Sec. 3.2.3), elaborated Video Swin Video Swin Video Swin
as follows. Transformer (T) Transformer (T) Transformer (T)
LSVQ (28K) MaxWell (3.6K) —frozen—

(1) Pre-training (2) Fine-tuning (3) Adaptation


3.2.1 Mix-of-Limited-Expert Structure
🔥
match with text match with [ctx]-text

Semantic Expert
Given the hard-to-explain nature of quality assessment [84], 39K 768
🔥
match with text FC(256,768) FC(256,768)
current VQA methods that only learn from human opin- 🔥
FC(768,256) FC(768,256)
ions in video quality databases will more or less come 304M
🔥
with their own biases, leading to poor generalization abil- CLIP-ViT-Large CLIP-ViT-Large CLIP-ViT-Large
ity [29]. Considering our goals, inspired by existing prac-
MetaCLIP (400M) MaxWell (3.6K) TVGE (1.5K)
tices [24, 70, 71], we select two evaluators with different bi-
ases as limited experts, and fuse their judgments to improve
Reduced Optimizable Parameters
generalization capacity of the final prediction. Primarily,
we include a technical expert (Fig. 3(a)), aiming at captur- Figure 4. Optimization Strategy of the Video Quality Met-
ing distortion-level quality. This branch adopts the structure ric, via gradually decreased scales of training datasets, and cor-
of FAST-VQA [65], which is pre-trained from the largest respondingly progressively reduced optimizable parameters.
VQA database, LSVQ [81], and further fine-tuned on the
MaxWell [71] database that contains a wide range of spa- smaller datasets, coped with progressively reduced opti-
tial and temporal distortions. While the technical branch mizable parameters. For Qtech , the optimization strate-
can already cover scenarios related to naturally-captured gies for each stage are listed as follows: (1) end-to-end pre-
videos, generated videos are more likely to include seman- training with large-scale LSVQtrain dataset (28K videos);
tic degradations, i.e. failing to generate correct structures (2) multi-layer fine-tuning with medium-scale MaxWelltrain
or components of an object. Thus, we include an additional dataset (3.6K videos); (3) Given that specific distortions on
text-prompted semantic expert (Fig. 3(b)). It is based on generated videos (See Fig. 5) are usually associated with
MetaCLIP [76], and calculated via a confidence score on semantics, to avoid over-fitting, the technical expert is kept
the binary classification between the positive prompt good, frozen during the adaptation stage. For Qsem , (1) we di-
high quality and negative prompt poor, low quality. We also rectly adopt official weights from MetaCLIP [76] as pre-
add an additional adapter [12] to better suit the CLIP-based training; (2) for the fine-tuning stage, we train a lightweight
evaluator in the domain of video quality assessment. adapter [12] on MaxWelltrain ; (3) for adaptation, we train an
Denote the technical score for the video V as Qtech (V), additional prefix token [55, 69, 88] to robustly adapt it to
the text-prompted semantic score as Qsem (V), we fuse the the domain of generated videos.
two independently-optimized judgements via ITU-standard
perceptual-oriented remapping [4], into the T2VScore-Q: 3.2.3 List-wise Learning Objectives
1 Plenty of existing studies [24, 29, 65, 66] have pointed out
R(s) = s−µ(s)
(2)
1+e − σ(s)
that compared with independent scores, the rank relations
among different scores are more reliable and generalizable,
R(Qtech ) + R(Qsem )
T2VScore-Q(V) = (3) especially for small-scale VQA datasets [49]. Given these
2
insights, we decide to adopt the list-wise learning objec-
The T2VScore-Q ranges from 0 to 1, with higher values tives [28] combined by rank loss (Lrank ) and linear loss
indicating better visual quality of video V. (Llinear ) as our training objective for both limited experts:
X
Lrank = max((sipred − sjpred ) sgn (sjgt − sigt ), 0) (4)
3.2.2 Progressive Optimization Strategy i,j

After introducing the structure, we discuss the optimiza- < spred − spred , sgt − sgt >
tion strategy for the T2VScore-Q (Fig. 4). In general, Llin = (1 − )/2 (5)
∥spred − spred ∥2 ∥sgt − sgt ∥2
the training is conducted in three stages: pre-training, fine-
tuning, and adaptation. The stages come with gradually L = Llin + λLrank (6)

5
Floor33 Gen2 ModelScope PIKA ZeroScope

μalignment = 2.59

Text Alignment
Figure 5. Domain Gap with Natural Videos. The common dis-
tortions in generated videos (as in TVGE dataset) are different
from those in natural videos [71], both spatially and temporally.
We encourage readers to click and play using Adobe Acrobat. μquality = 2.77

Video Quality
where spred and sgt are lists of predicted scores and labels
in a batch respectively, and sgn denotes the sign function. Figure 6. Score Distributions in TVGE, suggesting that current
text-to-video generation methods generally face challenges in pro-
4. TVGE Dataset ducing video with either good quality or high alignment with text.

Motivation. An inalienable part of our study is to evalu-


and tested their annotation reliability on a subset of TVGE
ate the reliability and robustness of the proposed metrics on
videos. Each video is rated on a five-point-like scale on ei-
text-conditioned generated videos. To this end, we propose
ther perspective, while examples for each scale are provided
the Text-to-Video Generation Evaluation (TVGE) dataset,
in the training materials for subjects.
collecting rich human opinions on the two perspectives
(alignment & quality) studied in the T2V Score. On both
Analysis and Conclusion. In Fig. 6, we show the distri-
perspectives, the TVGE can be considered as first-of-its-
butions of human annotated quality and alignment scores
kind: First, for the alignment perspective, the dataset will be
in the TVGE dataset. In general, the generated videos re-
the first dataset providing text alignment scores rated by a
ceive lower-than-average human ratings (µalignment = 2.59,
large crowd of human subjects; Second, for the quality per-
µquality = 2.77) on both perspectives, suggesting the need to
spective, while there are plenty of VQA databases on nat-
continuously improve these methods to eventually produce
ural contents [21, 71, 81], they show notably different dis-
plausible videos. Nevertheless, specific models also prove
tortion patterns (both spatially and temporally, see Fig. 5)
decent proficiency on one single dimension, e.g. Pika gets
from the generated videos, resulting in an non-negligible
an average score of 3.45 on video quality. Between the two
domain gap. The proposed dataset will serve as a validation
perspectives, we notice a very low correlation (0.223 Spear-
on the alignment between the proposed T2V Score and hu-
man’s ρ, 0.152 Kendall’s ϕ), proving that the two dimen-
man judgments. Furthermore, it can help our quality metric
sions are different and should be considered independently.
to better adapt to the domain of text-conditioned generated
We show more qualitative examples in the supplementary.
videos. Details of the dataset are as follows.
Collection of Videos. In total, 2543 text-guided generated 5. Experiments
videos are collected for human rating in the TVGE dataset.
These videos are generated by 5 popular text-to-video gen- 5.1. Text Alignment
eration models, under a diverse prompt set as defined by
Baselines. We compare our T2VScore-A with several
EvalCrafter [37] covering a wide range of scenarios.
standard metrics on text-video alignment, listed as follows:
Subjective Studies. In the TVGE dataset, each video is
• CLIP Score [18, 45]: Average text-image similarity in
independently annotated by 10 experienced human sub-
the embedding space of image CLIP models [45] over all
jects from both text alignment and video quality perspec-
video frames.
tives. Before the annotation, we trained the human subjects1
• X-CLIP Score [40]: Text-video similarity measured by a
1 Training materials are provided in supplementary materials. video-based CLIP model finetuned on text-video data.

6
• BLIP-BLEU [37]: Text-to-text similarity measured by a notable performance disparity between these open-source
BLEU [43] score using BLIP-2’s image captioning. MLLMs and GPT-4V, with GPT-4V demonstrating superior
• mPLUG-BLEU: Same as BLIP-BLEU but using a performance in video question answering. This is evidenced
mPLUG-OWL2 for video captioning. by its higher correlation with human judgment, outperform-
ing other models by a significant margin.
Spearman’s Kendall’s
Method Model
ρ τ Effect of auxiliary trajectory. We leverage the point
CLIP Score 0.343 0.236 trajectory data generated by CoTracker to enhance fine-
Traditional Metric
X-CLIP Score 0.257 0.175 grained temporal understanding. This approach effectively
BLIP-BLEU 0.152 0.104 captures the subtle motion changes of both the object and
mPLUG-BLEU 0.059 0.040
the camera, which is instrumental in answering questions
Otter† 0.181 0.134
related to temporal dynamics. As shown in Fig. 7, mod-
Video-LLaMA† 0.288 0.206
mPLUG-OWL2-V† 0.394 0.285 els that incorporate trajectory data can accurately identify
T2VScore-A specific camera movements, such as “panning from right to
InstructBLIP∗ 0.342 0.246
mPLUG-OWL2-I∗ 0.358 0.257 left” and “rotating counter-clockwise”. In contrast, mod-
GPT-4V∗ 0.486 0.360 els without trajectory input struggle to perceive these subtle
† via Video QA; ∗ via Image QA motion changes. The numerical results in Tab. 5 and Tab. 6
further supports our observation.
Table 1. Correlation Analysis. Correlations between objective
metrics and human judgment on text-video alignment. Spearman’s 5.2. Video Quality
ρ and Kendall’s τ are used for correlation calculation. The best is
bold-faced, and the second-best is underlined. Baselines. We compare the T2VScore-Q with several
state-of-the-art methods on video quality assessment:
Comparison with traditional metrics. We evaluate ex- • FAST-VQA [65]: State-of-the-art technical quality evalu-
isting objective metrics using our TVGE dataset and ob- ator, with multiple mini-patches (“fragments”) as inputs.
serve a low correlation with human judgment in text-video • DOVER [70]: State-of-the-art VQA method, consisting
alignment. This observation aligns with findings in re- of FAST-VQA and an additional aesthetic branch.
cent research [38, 42] that current objective metrics are in- • MaxVQA [71]: CLIP-based text-prompted VQA method.
compatible with human perception. In particular, video- We also validate the performance of multi-modality
based CLIP models exhibit even lower correlations than foundation models in evaluating generated video quality:
their image-based counterparts in comprehending videos. • CLIP [45, 76]: As CLIP is one of the important bases of
This discrepancy may be attributed to the X-CLIP score the T2VScore-Q, it is important to how original zero-
model, which has been fine-tuned exclusively on the Ki- shot CLIP variants work on this task. The original CLIPs
netics datasets, a scope insufficient for broad-domain video are evaluated under the same prompts as the proposed se-
understanding. Additionally, while BLEU is a widely em- mantic expert: good, high quality↔poor, low quality.
ployed evaluation metric in NLP research, its effectiveness
diminishes in text-video alignment tasks. This is due to Settings. As discussed in Sec. 3.2, we validate the effec-
the inherent challenge of accurate video captioning. Conse- tiveness of the T2VScore-Q under two settings:
quently, video-based models such as mPLUG-Owl-2 prove • (G1): zero-shot: In this setting, no generated videos are
to be less helpful in this context. seen during model training. Aligning with the settings
of off-the-shelf evaluators, it fairly compares between the
Comparison on MLLMs. Our T2VScore-A model is
baseline methods and the proposed T2VScore-Q.
designed to be model-agnostic, which ensures it is com-
• (G2): adapted, cross-model: In this setting, we further
patible with a wide variety of multimodal language learn-
adapt the T2VScore-Q to a part of the TVGE dataset
ing models (MLLMs). This includes open-source models
with videos generated by one known model, and evaluate
like Otter [11], Video-LLaMA [83], and mPLUG-OWL2-
the accuracy on other 4 unknown generation models. It is
V [78], as well as proprietary models such as GPT-4V [41].
a rigorous setting to check the reliability of the proposed
In our experiments, we tested T2VScore-A with both im-
metric with future generation models coming.
age and video-based LLMs. We found that its performance
significantly depends on the capabilities of the underlying Comparison on the zero-shot setting. We show the com-
MLLMs. Open-source image LLMs like InstructBLIP and parison between the T2VScore-Q and baseline methods
mPLUG-OWL2-I show decent results in visual question an- in Tab. 2, under the zero-shot setting without training on
swering. However, their limited temporal understanding any generated videos. Firstly, after our fine-tuning (stage 2,
makes them less effective compared to the more advanced on natural VQA dataset), the two experts that make up the
open-source video LLMs like mPLUG-OWL2-V in video- T2VScore-Q have notably improved compared with their
based question-answering tasks. Despite this, there is still corresponding baselines; Second, the mixture of the limited

7
Spearman’s Kendall’s Pearson’s Spearman’s Kendall’s Pearson’s
Metric Strategy
ρ ϕ ρ ρ ϕ ρ
FAST-VQA [65] 0.3518 0.2405 0.3460 - Evaluated on other 4 models except PIKA
DOVER [70] 0.3591 0.2447 0.3587 zero-shot 0.4758 0.3311 0.4643
MaxVQA [71] 0.4110 0.2816 0.4002 Trained on PIKA, cross 0.5467 0.3834 0.5377
- Evaluated on other 4 models except Floor33
CLIP-ResNet-50 [45] 0.3164 0.2162 0.3018
zero-shot 0.5467 0.3801 0.5363
CLIP-ViT-Large-14 [76] 0.3259 0.2213 0.3140
Trained on Floor33, cross 0.5923 0.4192 0.5805
the Technical Expert 0.4557 0.3136 0.4426 - Evaluated on other 4 models except ZeroScope
the Semantic Expert 0.4623 0.3210 0.4353 zero-shot 0.4148 0.2884 0.4330
T2VScore-Q (Ours) 0.5029 0.3498 0.4945 Trained on ZeroScope, cross 0.4561 0.3194 0.4623
improvements +22.3% +24.2% +23.6% - Evaluated on other 4 models except ModelScope
zero-shot 0.4826 0.3340 0.4835
Table 2. Zero-shot comparison on Video Quality. Correlations Trained on ModelScope, cross 0.5406 0.3785 0.5368
- Evaluated on other 4 models except Gen2
comparison Spearman’s ρ, Kendall’s ϕ, and Pearson’s ρ are in-
zero-shot 0.4964 0.3472 0.4920
cluded for correlation calculation. Trained on Gen2, cross 0.5514 0.3895 0.5481
average cross-model gain +11.2% +11.2% +10.6%
experts also resulted in significant performance gain. Both
improvements lead to the final more than 20% improve- Table 3. Cross-model Improvements on Video Quality. In each
ments on all correlation coefficients than existing VQA ap- setting, we adapt the T2VScore-Q with about 500 videos gener-
proaches, demonstrating the superiority of the proposed ated with one of the models, and test its improvements of accuracy
metric. Nevertheless, without training on any T2V-VQA on the rest of the videos generated by the other 4 models.
datasets, all zero-shot metrics are still not enough accurate
to evaluate the quality of generated videos, bringing the Components in T2VScore-Q Spearman’s Kendall’s Pearson’s
necessity to discuss a robust and effective adaptation ap- Qsem fine-tune Qtech fine-tune ρ ϕ ρ
proach. ✓ 0.3259 0.2213 0.3140
✓ ✓ 0.4623 0.3210 0.4353
Cross-model improvements of adaptation. A common ✓ 0.3518 0.2405 0.3460
✓ ✓ 0.4557 0.3136 0.4426
concern on data-driven-generated content quality assess- ✓ ✓ ✓ 0.4458 0.3074 0.4409
ment is that evaluators trained on a specific set of models ✓ ✓ ✓ 0.4629 0.3197 0.4514
cannot generalize well on evaluating a novel set of models. ✓ ✓ ✓ ✓ 0.5029 0.3498 0.4945
Thus, to simulate the real-world application scenario, we
abandon the random five-fold splits and use rigorous cross- Table 4. Ablation Study. Spearman’s ρ, Kendall’s ϕ, and Pear-
model settings during the adaptation stage. As shown in son’s ρ are included for correlation calculation.
Tab. 3, in each setting, we only adopt the T2VScore-Q on
quality. Moreover, we present the TVGE dataset to better
videos generated in one among five models in the TVGE
evaluate the proposed metrics. The experimental results on
dataset and evaluate the changes of accuracy on the rest
the TVGE dataset underscore the effectiveness of T2VScore
of videos generated by other 4 models. The table has
over existing metrics, providing a more comprehensive and
proven that the proposed prefix-tuning-based adaptation
reliable means of assessing text-to-video generation. This
strategy can effectively generalize to unseen model sets
proposed metric, along with the dataset, paves the way for
with an average of 11% improvements, proving that the
further research and development in the field aiming at more
T2VScore-Q can be a reliable open-set quality metric for
accurate evaluation methods for video generation methods.
generated videos.
Ablation Studies. We show the ablation experiments in
Limitations and future work. The T2VScore-A heav-
Tab. 4. As is shown in the table, the proposed fine-tuning
ily relies on multimodal large language models (MLLMs)
(stage 2) on both experts improved their single branch accu-
to perform accurate Visual Question Answering. However,
racy and the overall accuracy of T2VScore-Q, suggesting
the current capabilities of MLLMs are not yet sufficient to
the effectiveness of the proposed components.
achieve high accuracy, particularly in evaluating temporal
dimensions. We anticipate that as MLLMs become more
6. Conclusion advanced, our T2VScore-A will also become increasingly
In this paper, to address the shortcomings of existing text- stable and reliable.
to-video generation metrics, we introduced the Text-to- As new open-source text-to-video models continue to
Video Score (T2VScore), a novel evaluation metric that emerge, we will keep track of the latest developments and
holistically assesses video generation by considering both incorporate their results into our TVGE dataset as part of
the alignment of the video with the input text, and the video our future efforts.

8
References [17] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and
Qifeng Chen. Videocrafter: A toolkit for text-to-video gen-
[1] Floor33 pictures. http://floor33.tech/. 2 eration and editing. https://github.com/AILab-
[2] Gen-2. https://research.runwayml.com/gen2. CVC/VideoCrafter, 2023. 1, 2
1, 2 [18] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
[3] Pika labs. https://www.pika.art/. 1, 2 and Yejin Choi. Clipscore: A reference-free evaluation met-
[4] Recommendation 500-10: Methodology for the subjective ric for image captioning, 2022. 6
assessment of the quality of television pictures. ITU-R Rec. [19] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
BT.500, 2000. 5 Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
[5] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. video: High definition video generation with diffusion mod-
Align your latents: High-resolution video synthesis with la- els. arXiv:2210.02303, 2022. 1, 2
tent diffusion models. In CVPR, 2023. 1, 2, 3 [20] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
Chan, Mohammad Norouzi, and David J Fleet. Video dif-
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
fusion models. In NeurIPS, 2022. 2
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
[21] Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The
guage models are few-shot learners. Advances in neural in-
konstanz natural video database (konvid-1k). In QoMEX,
formation processing systems, 33:1877–1901, 2020. 3, 4
pages 1–6, 2017. 6
[7] Joao Carreira and Andrew Zisserman. Quo vadis, action [22] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Os-
recognition? a new model and the kinetics dataset. In CVPR, tendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate
2017. 3 and interpretable text-to-image faithfulness evaluation with
[8] Joao Carreira and Andrew Zisserman. Quo vadis, action question answering. arXiv preprint arXiv:2303.11897, 2023.
recognition? a new model and the kinetics dataset. In CVPR, 3
2017. 3 [23] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia
[9] Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Neverova, Andrea Vedaldi, and Christian Rupprecht. Co-
Chen, Nan Ding, and Radu Soricut. All you may need for tracker: It is better to track together. arXiv preprint
vqa are image captions. In NAACL, 2022. 3 arXiv:2307.07635, 2023. 4, 1
[10] Jaemin Cho, Abhay Zala, and Mohit Bansal. Visual pro- [24] Bowen Li, Weixia Zhang, Meng Tian, Guangtao Zhai, and
gramming for text-to-image generation and evaluation. In Xianpei Wang. Blindly assess quality of in-the-wild videos
NeurIPS, 2023. 3 via quality-aware pre-training and motion perception. IEEE
[11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat TCSVT, 2022. 3, 5
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale [25] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix-
Fung, and Steven Hoi. Instructblip: Towards general- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul-
purpose vision-language models with instruction tuning, timodal llms with generative comprehension. arXiv preprint
2023. 4, 7 arXiv:2307.16125, 2023. 4, 1
[12] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao [26] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng
Clip-adapter: Better vision-language models with feature Cao, et al. mplug: Effective and efficient vision-language
adapters. arXiv preprint arXiv:2110.04544, 2021. 5 learning by cross-modal skip-connections. arXiv preprint
arXiv:2205.12005, 2022. 3
[13] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew
[27] Dingquan Li, Tingting Jiang, and Ming Jiang. Qual-
Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-
ity assessment of in-the-wild videos. In ACM MM, page
Yu Liu, and Yogesh Balaji. Preserve your own correlation:
2351–2359, 2019. 3
A noise prior for video diffusion models. arXiv:2305.10474,
[28] Dingquan Li, Tingting Jiang, and Ming Jiang. Norm-in-norm
2023. 2
loss with faster convergence and better performance for im-
[14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern- age quality assessment. In ACM MM, page 789–797, 2020.
hard Schölkopf, and Alexander Smola. A kernel two-sample 5
test. J Mach Learn Res, 2012. 3 [29] Dingquan Li, Tingting Jiang, and Ming Jiang. Unified qual-
[15] Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei ity assessment of in-the-wild videos with mixed datasets
Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, training. International Journal of Computer Vision, 129(4):
Mike Zheng Shou, and Kevin Tang. Videoswap: Customized 1238–1257, 2021. 5
video subject swapping with interactive semantic point cor- [30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
respondence. arXiv preprint arXiv:2312.02087, 2023. 3 Blip-2: Bootstrapping language-image pre-training with
[16] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and frozen image encoders and large language models. In ICML,
Qifeng Chen. Latent video diffusion models for high-fidelity 2023. 4
video generation with arbitrary lengths. arXiv:2211.13221, [31] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai
2022. 2 Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao.

9
Videochat: Chat-centric video understanding. arXiv preprint [46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
arXiv:2305.06355, 2023. 4 Convolutional networks for biomedical image segmentation.
[32] Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, In MICCAI, 2015. 2
Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong [47] Masaki Saito, Shunta Saito, Masanori Koyama, and So-
Wang. Videogen: A reference-guided latent diffusion ap- suke Kobayashi. Train sparsely, generate densely: Memory-
proach for high definition text-to-video generation. arXiv efficient unsupervised training of high-resolution temporal
preprint arXiv:2309.00398, 2023. 2 gan. IJCV, 2020. 1, 3
[33] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. [48] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
Improved baselines with visual instruction tuning. arXiv Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
preprint arXiv:2310.03744, 2023. 4 Oran Gafni, et al. Make-a-video: Text-to-video generation
[34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. without text-video data. In ICLR, 2023. 1, 2, 3
Visual instruction tuning. arXiv preprint arXiv:2304.08485, [49] Zeina Sinno and Alan Conrad Bovik. Large-scale study of
2023. 4 perceptual video quality. IEEE Trans. Image Process., 28(2):
[35] Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, 612–627, 2019. 5
Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, and [50] Spencer Sterling. Zeroscope. https://huggingface.
Mike Zheng Shou. Dynvideo-e: Harnessing dynamic nerf co/cerspense/zeroscope_v2_576w, 2023. 1, 2
for large-scale motion-and view-change human-centric video [51] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
editing. arXiv preprint arXiv:2310.10624, 2023. 3 Shlens, and Zbigniew Wojna. Rethinking the inception ar-
[36] Xiaohong Liu, Xiongkuo Min, Wei Sun, et al. Ntire 2023 chitecture for computer vision. In CVPR, 2016. 2
quality assessment of video enhancement challenge, 2023. 3 [52] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
[37] Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong
3d convolutional networks. In ICCV, 2015. 3
Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond
[53] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach,
Chan, and Ying Shan. Evalcrafter: Benchmarking and eval-
Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To-
uating large video generation models, 2023. 6, 7
wards accurate generative models of video: A new metric &
[38] Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng
challenges. arXiv:1812.01717, 2018. 3
Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A bench-
[54] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach,
mark for fine-grained evaluation of open-domain text-to-
Raphaël Marinier, Marcin Michalski, and Sylvain Gelly.
video generation. arXiv preprint arXiv:2311.01813, 2023.
Fvd: A new metric for video generation. In ICLR, 2019.
2, 7
1, 3
[39] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa-
[55] Jianyi Wang, Kelvin C. K. Chan, and Chen Change Loy. Ex-
had Shahbaz Khan. Video-chatgpt: Towards detailed video
ploring clip for assessing the look and feel of images, 2022.
understanding via large vision and language models. arXiv
3, 5
preprint arXiv:2306.05424, 2023. 4
[56] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
[40] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin technical report. arXiv:2308.06571, 2023. 1, 2
Ling. Expanding language-image pretrained models for gen- [57] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
eral video recognition, 2022. 6 Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
[41] OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 7 technical report. arXiv preprint arXiv:2308.06571, 2023. 2
[42] Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, [58] Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li,
Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Bharath Hariharan, Aleksander Holynski, and Noah Snavely.
Satoh. Toward verifiable and reproducible human evaluation Tracking everything everywhere all at once. arXiv preprint
for text-to-image generation. In The IEEE/CVF Conference arXiv:2306.05422, 2023. 4
on Computer Vision and Pattern Recognition, 2023. 2, 7 [59] Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen
[43] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap at-
Zhu. Bleu: a method for automatic evaluation of machine tention in spatiotemporal diffusions for text-to-video gener-
translation. In Proceedings of the 40th Annual Meeting of ation. arXiv:2305.10874, 2023. 2
the Association for Computational Linguistics, pages 311– [60] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou,
318, Philadelphia, Pennsylvania, USA, 2002. Association for Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu,
Computational Linguistics. 7 Peiqing Yang, et al. Lavie: High-quality video generation
[44] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On with cascaded latent diffusion models. arXiv:2309.15103,
aliased resizing and surprising subtleties in gan evaluation. 2023. 2
In CVPR, pages 11410–11420, 2022. 2, 3 [61] Zhou Wang and Alan C Bovik. Mean squared error: Love
[45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya it or leave it? a new look at signal fidelity measures. IEEE
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Signal Process Mag, 2009. 2
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- [62] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P
ing transferable visual models from natural language super- Simoncelli. Image quality assessment: from error visibility
vision. In ICML, 2021. 1, 2, 3, 6, 7, 8 to structural similarity. TIP, 2004. 2

10
[63] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P [76] Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang,
Simoncelli. Image quality assessment: from error visibility Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh,
to structural similarity. TIP, 2004. 2 Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify-
[64] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten ing clip data. arXiv preprint arXiv:2309.16671, 2023. 5, 7,
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 8
Chain-of-thought prompting elicits reasoning in large lan- [77] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
guage models. Advances in Neural Information Processing Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn
Systems, 35:24824–24837, 2022. 4 of lmms: Preliminary explorations with gpt-4v (ision). arXiv
[65] Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, preprint arXiv:2309.17421, 9, 2023. 4
Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast- [78] Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei
vqa: Efficient end-to-end video quality assessment with frag- Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou.
ment sampling. In ECCV, 2022. 3, 5, 7, 8 mplug-owl2: Revolutionizing multi-modal large language
[66] Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, model with modality collaboration. arXiv preprint
Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin. Neigh- arXiv:2311.04257, 2023. 4, 7
bourhood representative sampling for efficient end-to-end [79] Joong Gon Yim, Yilin Wang, Neil Birkbeck, and Balu
video quality assessment, 2023. 3, 5 Adsumilli. Subjective quality assessment for youtube ugc
[67] Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, dataset. In ICIP, 2020. 3
Wenxiu Sun, Qiong Yan, and Weisi Lin. Discovqa: Temporal [80] Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang,
distortion-content transformers for video quality assessment. Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie
IEEE Transactions on Circuits and Systems for Video Tech- Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffu-
nology, 33(9):4840–4854, 2023. 3 sion over diffusion for extremely long video generation.
arXiv:2303.12346, 2023. 2
[68] Haoning Wu, Liang Liao, Chaofeng Chen, Jingwen Hou
[81] Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram,
Hou, Erli Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, and
and Alan Bovik. Patch-vq: ’patching up’ the video quality
Weisi Lin. Exploring opinion-unaware video quality assess-
problem. In CVPR, 2021. 3, 5, 6
ment with semantic affinity criterion. In International Con-
ference on Multimedia and Expo (ICME), 2023. 3 [82] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui
Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng
[69] Haoning Wu, Liang Liao, Annan Wang, Chaofeng Chen,
Shou. Show-1: Marrying pixel and latent diffusion models
Jingwen Hou Hou, Erli Zhang, Wenxiu Sun Sun, Qiong Yan,
for text-to-video generation. arXiv:2309.15818, 2023. 1, 2,
and Weisi Lin. Towards robust text-prompted semantic cri-
3
terion for in-the-wild video quality assessment, 2023. 3, 5
[83] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An
[70] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- instruction-tuned audio-visual language model for video un-
wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi derstanding. arXiv preprint arXiv:2306.02858, 2023. 7
Lin. Exploring video quality assessment on user generated
[84] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
contents from aesthetic and technical perspectives. In ICCV,
man, and Oliver Wang. The unreasonable effectiveness of
2023. 3, 5, 7, 8
deep features as a perceptual metric. In Proceedings of the
[71] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- IEEE conference on computer vision and pattern recogni-
wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi tion, pages 586–595, 2018. 2, 5
Lin. Towards explainable video quality assessment: A [85] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xi-
database and a language-prompted approach. In ACM MM, aopeng Zhang, Wangmeng Zuo, and Qi Tian. Con-
2023. 5, 6, 7, 8 trolvideo: Training-free controllable text-to-video genera-
[72] Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, tion. arXiv:2305.13077, 2023. 3
Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong [86] Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao
Yan, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng
for general-purpose foundation models on low-level vision. Shou. Motiondirector: Motion customization of text-to-
2023. 3 video diffusion models. arXiv preprint arXiv:2310.08465,
[73] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, 2023. 2, 3
Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and [87] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv,
Mike Zheng Shou. Tune-a-video: One-shot tuning of im- Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video
age diffusion models for text-to-video generation. In ICCV, generation with latent diffusion models. arXiv:2211.11018,
2023. 2, 3 2022. 2
[74] Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jin- [88] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
bin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Liu. Learning to prompt for vision-language models. Inter-
Huang, Yuanxi Sun, et al. Cvpr 2023 text guided video edit- national Journal of Computer Vision (IJCV), 2022. 5
ing competition. arXiv preprint arXiv:2310.16003, 2023. [89] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
[75] Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, and Yu-Gang hamed Elhoseiny. Minigpt-4: Enhancing vision-language
Jiang. Simda: Simple diffusion adapter for efficient video understanding with advanced large language models. arXiv
generation. arXiv:2308.09710, 2023. 3 preprint arXiv:2304.10592, 2023. 4

11
Towards A Better Metric for Text-to-Video Generation
Supplementary Material
7. More Details for Subjective Study
Annotation Interface. In Fig. 11, we show the annotation
interface for the subjective study in the TVGE dataset. The
text alignment scores and quality scores are annotated sep-
arately to avoid distraction from each other. The input text
prompt is shown only for the alignment annotation (which
is necessary), but not displayed for quality annotation, so
that the quality score can ideally only care about the visual
quality of the generated video.
Training Materials. Before annotation, we provide clear
criteria with abundant examples of 5-point Likert scale to
train the annotators. For text alignment, we specifically
instruct annotators to evaluate the videos based solely on
the presence of each element mentioned in the text descrip-
tion, intentionally ignoring video quality. For video qual-
ity, we ask the subjects to focus exclusively on technical
distortions. We provide five examples for each of the six
common distortions: 1) noises; 2) artifacts; 3) blur; 4) un-
natural motion; 5) inconsistent structure; and 6) flickering.
Samples of the annotated videos can be viewed in Fig. 8.
Figure 7. Quantitative Examples for Auxiliary Trajectory. Us-
8. Additional Results ing an auxiliary trajectory effectively enhances multimodal large
language models (MLLMs) for fine-grained temporal understand-
Effect of Auxiliary Trajectory for T2VScore-A. As
ing. Please click and play using Adobe Acrobat.
mentioned in Sec. 3.1, we utilize the auxiliary point-level
trajectory generated by CoTracker [23] to enhance fine- Model Overall Temporal QA Spatial QA
random guess 0.2369 0.2452 0.2327
grained temporal understanding. Fig. 7 presents video sam-
Otter† 0.1460 0.1059 0.1636
ples that exhibit temporal nuances, which state-of-the-art Video-LLaMA† 0.4074 0.3459 0.4435
multimodal language models (MLLMs) often fail to de- mPLUG-OWL2-V† 0.5305 0.4280 0.5880
tect. Using the trajectory as auxiliary information effec- InstructBLIP∗ 0.5013 0.4762 0.5127
tively improves the MLLMs’ understanding of subtle tem- mPLUG-OWL2-I∗ 0.5107 0.4333 0.5600
poral changes in camera and object motion. For instance, GPT-4V∗ (w/o trajectory) 0.4791 0.4411 0.5589
GPT-4V∗ 0.5765 0.5077 0.6308
the snake in row 5 appears motionless, though the camera is
† via Video QA; ∗ via Image QA
moving. Upon ablating the auxiliary trajectory, we observe
a decrease in visual question answering (VQA) accuracy Table 6. Accuracy of Visual Question Answering.
from 0.58 to 0.48, as shown in Tab. 6. This reduction in unique videos per prompt) sampled from our TVGE dataset.
VQA accuracy further leads to a diminished alignment with Two annotators are tasked with responding to the generated
human judgment (see Tab. 5). questions, and a third, more experienced annotator is as-
signed to verify these responses. We compare the accuracy
Spearman’s Kendall’s Pearson’s of visual question answering (VQA) across a range of mul-
T2VScore-A
ρ ϕ ρ
timodal large language models (MLLMs), focusing on spa-
GPT-4V w/o tarjectory 0.4454 0.3289 0.4416
tial and temporal QA. As shown in Tab. 6, current MLLMs
GPT-4V 0.4859 0.3600 0.4882
generally demonstrate weak performance in open-domain
Table 5. Effect of Auxiliary Trajectory. Spearman’s ρ, Kendall’s VQA tasks, with temporal QA faring even worse. Notably,
ϕ, and Pearson’s ρ are included for correlation calculation. video-based MLLMs are inferior in temporal QA compared
to their image-based counterparts. A similar observation is
Performance of state-of-the-art MLLMs in VQA. We made in SEED-Bench [25], indicating significant room for
setup an evaluation set of 500 videos (100 prompts with 5 further improvement in video-based MLLMs.

1
Figure 8. Human Annotation. Generated videos and their human ratings of text alignment and video quality. The scores are the mean of
10 annotators’ ratings. Please click and play using Adobe Acrobat.

2
Figure 9. More Examples for T2VScore-A. We showcase more examples illustrating how T2VScore-A is computed. Please click and
play using Adobe Acrobat.

3
# Task Description:
The T2VScore-A is an evaluator for assessing the text alignment of video content generated from textual descriptions. It
scrutinizes the video descriptions and formulates structured questions and answers to ensure the video content aligns precisely
with the provided description.

# Task Steps and Format Specification:


## Input Processing:
On receiving a video description, the T2VScore-A decomposes it into atomic tuples, ensuring each tuple is the smallest unit of
meaning that accurately represents an aspect of the video. Each atomic tuple consists of a correlation: the first element
indicates a global or local Object of the video, and the second element specifies the attribute or detail of that Object,
including but not limited to' activity', 'attribute', 'counting', 'color', 'material', 'spatial', 'location', 'shape', 'OCR',
etc. Questions are formulated based on the atomic tuples' count and order.

## Question Generation:
Generate questions for each atomic tuple, targeting a specific video aspect indicated by the tuple. Questions should reflect the
atomicity principle, avoiding over-fragmentation or excessive aggregation of concepts.

## Answer Formulation:
Provide direct, relevant choices for each question based on the atomic tuple. Include “NONE” as an option where the existence of
an entity or attribute is uncertain. Position the answer immediately following the question for clarity.

## Response Format:
Start with a list of the derived atomic tuples from the video description. Follow with each question and its corresponding
choices. Present the correct answer immediately after each question.

## Consideration for Entity Non-Existence:


Questions related to entities must account for the possibility that the entity may not be present in the video. Hence, the
format "If there is a [entity], ..." should be used where applicable.

Input: Iron Man is walking towards the camera in the rain at A: whale
night, with a lot of fog behind him. Science fiction movie, Q: How many whales are present in the video?
close-up Choices: 1, 2, 3, more than 3, NONE
Atomic Tuples: A: 1
(entity, Iron Man) Q: What is the theme of the video?
(Iron Man, walking) Choices: city adventure, forest adventure, ocean adventure,
(global, towards the camera) mountain adventure
(global, in the rain) A: ocean adventure
(global, at night)
(global, a lot of fog behind him) Input: There's a person, likely in their mid-twenties, with
(global, Science fiction movie) short brown hair and a tattoo on their left forearm. They're
(global, close-up) wearing a blue hoodie and seem to be checking their phone.
Questions and Answers: Atomic Tuples:
Q: What is the name of the character in the video? (entity, person)
Choices: Iron Man, Captain America, Thor, Hulk (person, a)
A: Iron Man (person, mid-twenties)
Q: What is the character doing in the video? (person, short hair)
Choices: walking, jumping, flying, NONE (person, brown hair)
A: walking (person, tattoo)
Q: What direction is the character moving in the video? (tattoo, left forearm)
Choices: towards the camera, away from the camera, left to (person, wearing a hoodie)
right, right to left (hoodie, blue)
A: towards the camera (person, checking their phone)
Q: What is the weather condition in the video? Questions and Answers:
Choices: sunny, rainy, snowy Q: What is the main entity in the video?
A: rainy Choices: person, dog, cat, bird
Q: What time of day is depicted in the video? A: person
Choices: morning, afternoon, evening, night Q: How many persons are present in the video?
A: night Choices: 1, 2, 3, more than 3, NONE
Q: What is behind Iron Man in the video? A: 1
Choices: a lot of fog, a cityscape, a forest, a desert Q: If there is a person, what is their approximate age?
A: a lot of fog Choices: teenager, mid-twenties, thirties, forties, NONE
Q: What genre does the video belong to? A: mid-twenties
Choices: comedy, drama, science fiction, horror Q: If there is a person, what is the length of their hair?
A: science fiction Choices: short, medium, long, NONE
Q: What type of shot is used in the video? A: short
Choices: wide shot, medium shot, close-up, extreme close-up Q: If there is a person, what is the color of their hair?
A: close-up Choices: black, brown, blonde, red, NONE
A: brown
Input: 2 Dog and a whale, ocean adventure Q: If there is a person, do they have a tattoo?
Atomic Tuples: Choices: yes, no
(entity, dog) A: yes
(dog, 2) Q: If there is a person, where is their tattoo located?
(entity, whale) Choices: left forearm, right forearm, back, chest, NONE
(whale, a) A: left forearm
(global, ocean adventure) Q: If there is a person, what are they wearing?
Questions and Answers: Choices: hoodie, t-shirt, sweater, NONE
Q: Which type of animal appear in the video except for the A: hoodie
whale? Q: If there is a person, what color is the hoodie they are
Choices: dog, cat, bird, fish, NONE wearing?
A: dog Choices: blue, red, black, white, NONE
Q: How many dogs are present in the video? A: blue
Choices: 1, 2, 3, more than 3, NONE Q: If there is a person, what are they doing in the video?
A: 2 Choices: checking their phone, reading a book, eating,
Q: Which type of animal appears in the video except for the sleeping, NONE
dog? A: checking their phone
Choices: whale, cat, bird, fish, NONE

Figure 10. Prompt for Question/Answer Generation in T2VScore-A. Top: task instruction; Bottom: in-context learning examples.
4
(a) Annotation Interface for Video Quality. The video is presented to (b) Annotation Interface for Text Alignment. The video and its text prompt
subjects to be rated a quality score among [1,5]. are presented to subjects to be rated an alignment score among [1,5].

Figure 11. Annotation Interface for Video Quality (a) and Text Alignment (b).

You might also like