SIPaper RG
SIPaper RG
net/publication/335127938
Analysis of Spatial and Temporal Information Variation for 10-Bit and 8-Bit
Video Sequences
CITATIONS READS
28 401
3 authors, including:
All content following this page was uploaded by Nabajeet Barman on 27 March 2021.
Abstract—Spatial Information (SI) and Temporal Information Spatial and/or Temporal Information
(TI) have been used widely as an approximate estimation of video Example Applications
complexity. Recently, SI (and TI) have found use in many other
applications such as Quality of Experience modeling, Bandwidth
and Rate-distortion modeling, etc., for both traditional and non-
traditional (gaming, dynamic vision sensors, etc.) videos. It is
often assumed that SI and TI only depend on video content, Rate-distortion Clustering and QoE-based objective Bandwidth modelling
while instead factors such as resolution, bit depth, compression modelling for classification of content Video Quality metrics of Dynamic Vision
Scalable Video complexity for gaming (Statistical and machine Sensors for Visual
have an impact on the values of SI and TI for a specific video Coding and non-gaming videos learning based metrics) Sensor Networks
content. A systematic study on SI and TI for videos, investigating [5] [3, 6, 7] [8, 9, 10, 11] [12, 13]
the effect of different video encoding and processing steps on
SI and TI values has been missing so far. Also, SI and TI Fig. 1: Example applications of the spatial and temporal
calculation has been limited to 8-bit videos, while there has been
increasing popularity and usage of 10-bit videos. Towards this information.
end, we present in this paper a comprehensive evaluation of
the variation of SI and TI for different 8-bit and 10-bit videos.
Results and insights into the variation of SI and TI values for by specific video sensors, such as neuromorphic sensors ([12],
different encoding settings, choice of encoders, temporal pooling
methods, resolution, etc. are presented in this study.
[13]). For instance, in [8] scene complexity information is
Index Terms—Spatial Information, Temporal Information, used in terms of the spatial content of frames and temporal
Video Streaming information calculated between consecutive frames to derive a
rate-distortion model for video sequences. The authors in [10]
I. I NTRODUCTION measure the video quality objectively by utilizing the spatial
The measurement of scene complexity can be used to content of the sequences. The authors in [2], [11] and [1]
determine the expected data rate and hence the bandwidth proposed machine learning based QoE models, where spatial
requirement or the required compression level of diverse and temporal information values are used along with other
content types. In fact, more spatially and temporally complex influence factors for quality estimation of gaming videos.
videos require a higher data rate to achieve a satisfactory The applications of spatial and temporal information are not
quality. Measuring the scene complexity plays an important only limited to traditional video sequences and have found
role in key applications ranging from the design of video application in other fields such as neuromorphic engineering.
quality metrics well representative of the quality experienced For example, the authors in [12], [13] proposed several spatial
by the actual users [1][2] to the clustering and classification information based models to predict the data rate output by
of different video sequences [3]. The metrics to measure Dynamic Vision Sensors (DVS).
scene complexity are widely varied, ranging from subjective
complexity measures [4] to diverse objective metrics [5]. The A. Spatial and Temporal Information
spatial information of an image [6], as a measure of edge
energy, is one of the most widely-used metrics for scene In this section, we report the mathematical definitions of
complexity estimation. Spatial Information (SI) and Temporal the spatial and temporal information. Let gh and gv denote
Information (TI), as defined by ITU-T Rec. P.910 [7] as an horizontal and vertical gradients, respectively, of a grey-
approximate measure of video content complexity, have been scale image, evaluated via filtering the grey-scale image with
widely used in the field of quality assessment, in particular for horizontal and vertical Sobel kernels. The magnitude of spatial
the selection of the video content to be used for the subjective information calculated at pixel p, SIp , is represented as:
tests, that should be representative of different complexity q
classes. SIp = gh2 + gv2 . (1)
Figure 1 highlights some of the applications which use spa-
tial and temporal information – ranging from rate-distortion The SI statistics used for pooling, to characterizePthe Spatial
modeling ([8]) to clustering and classification ([6], [9], [3]) Index of an image, are the mean (SImean = P1 SIp ) and
to QoE evaluation metrics ([10], [2], [11], [1]) to data rate the standard
q Pdeviation of the magnitude of spatial information
estimation and bandwidth modelling for information acquired (SIstd = P1 (SIp − SImean )2 , where P is the number of
pixels in the image. For video sequences, ITU-T Rec. P.910 TABLE I: Summary of the eight reference video sequences
[7] defines spatial information as: Sequence ID Sequence Resolution Frame rate Duration(s)
C1 ChimeraEP01 4096x2160 59.94 10
n o
SI = maxtime SIstd . (2) C2 ChimeraEP10 4096x2160 59.94 10
C3 ChimeraEP11 4096x2160 59.94 10
According to (2), SIstd is computed for each of the frames C4 ChimeraEP16 4096x2160 59.94 10
in the video sequence and the maximum of SIstd , among C5 Campfire 3840x2160 25 12
all the frames, is taken (over the whole time duration of the C6 Fountains 3840x2160 25 12
sequence). ITU-T Rec. P.910 [7] defined temporal information C7 Runners 3840x2160 25 12
as: n o C8 Suzie 3840x2160 25 9.6
T I = maxtime std[Mpn ] (3)
Mpn = Fpn − Fpn−1 (4) The remainder of this paper is organized as follows. We
present the evaluation methodology describing the source
where Mpn is the pixel intensity difference between Fpn , video sequences and encoding/processing settings in Sec-
current frame n, and Fpn−1 , previous frame n − 1. For the tion II. Results and observations, addressing each objective
difference frame the standard deviation is applied across all mentioned above, are presented in Section III. Section IV
the pixels. According to (3), the standard deviation of Mpn is concludes the work along with a brief discussion of possible
computed for every frame and the maximum is taken over the future works.
entire time duration of the video sequence.
II. E VALUATION M ETHODOLOGY
B. Contributions
A. Source Sequences
The SI and TI measures, as defined by ITU-T P.910 have
been widely used in the research community as an approxi- In this work, we used a total of eight pristine, uncompressed
mate measure of content complexity, but such an evaluation videos of 4K/UHD resolution from [14], [15] and [16].
so far has been limited to 8-bit videos. There exist a research Fig. 2 shows screenshots of the considered video sequences,
gap in the evaluation and analysis of SI and TI values for highlighting the different types of content utilized in this work
videos of higher bit-depth (10/12 bits, etc.). Towards this end, and Table I summarizes the characteristics of the selected
our evaluation of open source tools1 revealed that these are reference video sequences. In order to neglect any effect of the
incompatible for SI and TI calculation of videos of bit-depth device used to capture the video and/or the capture settings,
higher than 8-bits. Also, a systematic study of SI and TI source video framerate and/or duration, the reference video
values asosciated to different choices of encoder, encoding sequences were selected from different content providers and
settings, temporal pooling method, etc. is missing from the are of different genres, duration, framerates and resolution. It
literature so far. An in-depth analysis of SI and TI values is important to note that the selected content is representative
would help researchers in the design of better QoE estimation of the commonly streamed video sequences on YouTube,
models, calculation of SI and TI for higher bit-depth videos, Netflix, Amazon Prime Video, etc. Sequences C1-C4 are
clustering and classification strategies, etc. and possibly find from different episodes of the famous Netflix video sequence
new application areas of SI and TI. Towards this end, we Chimera depicting various activities (Bar scene, Netflix card
define the following five objectives which address some of twirl, Seaside and pier and Toddler and fountain). Sequences
the research gaps discussed: C5-C7 are from the SJTU 4K dataset and depict a campfire
scene at night, fountains and runners running at a competition.
1) To evaluate SI and TI values for 10-bit uncompressed
The last sequence, C8, is the reconstructed, 4K sequence of
video sequences.
the famous video clip Suzie which depicts a girl answering a
2) To study the relationship between SI and TI values for
telephone.
8-bit and for 10-bit representation of a video sequence.
3) To evaluate the effect of different temporal pooling B. Video Processing
methods (Mean, Median and Minimum), other than the Table II summarizes the video characteristics and encoding
currently used “maximum” pooling method, on SI and settings used in this work. We restrict our analysis to short du-
TI values and on their capability to serve as an indicator ration video sequences of YUV planar colorspace with 4:2:0
for video complexity. chroma subsampling (YUV420) which is currently the most
4) To study the effect of different compression standards widely used chroma subsampling across all video streaming
(H.264/MPEG-AVC and H.265/MPEG-HEVC) on SI and and broadcast applications. All video processing tasks such
TI values. as encoding, 10-bit to 8-bit conversion, chroma subsampling
5) To evaluate the effect of different encoding settings on conversion, etc. are done using FFmpeg2 . For the first four
the behavior of SI and TI values. video sequences (C1-C4), both 8-bit and 10-bit versions are
6) To study the variations of SI and TI with different made available by Netflix at 59.94 fps and YUV422 pixel
resolutions. format. Such video sequences were cut into four sequences
1 https://github.com/Telecommunication-Telemedia- of approximately 10 seconds at original resolution and frame
Assessment/SITI/blob/master/python/siti.py, and
https://github.com/slhck/siti 2 https://ffmpeg.org/
(a) ChimeraEP01 (b) ChimeraEP10 (c) ChimeraEP11 (d) ChimeraEP16
60 240
Runners Runners
ChimeraEP16 ChimeraEP16
50 Fountains
200 Fountains
Campfire ChimeraEP11 Campfire ChimeraEP11
ChimeraEP10
ChimeraEP10
30 Suzie 120 Suzie
100
0 20 40 60 0 80 160 240
(a) SI vs. TI plot for the eight Reference 8-bit videos (b) SI vs. TI plot for the eight Reference 10-bit videos
Fig. 3: SI vs. TI plot for 8-bit and 10-bit reference video sequences.
TABLE II: Video characteristics and encoding settings TI calculations on the Y channel for all the frames of the
Parameter Value YUV video. To address objectives 3) - 6), the encoding
Number of Reference Videos 4 (8-bit) + 4 (10-bit) = 8 of the reference video sequences was required. For brevity
Chroma Subsampling YUV420 and based on our findings that all selected reference video
Frame rate 59.94, 25 sequences exhibited the same behaviour during our initial
Encoder FFmpeg studies addressing objective 1 and 2, we restricted this analysis
Encoding Mode CRF (23, 30), Fixed Bitrate (1, 5 Mbps)
only to the first four video sequences (C1-C4). Both the 8-
Video Compression Standard H.264, H.265
Preset Medium (default)
bit and 10-bit versions of the video sequences C1-C4 were
then encoded at two different encoding settings (constant
bitrate and constant rate factor) using two of the most widely
used video compression standards (H.264/MPEG-AVC and
rate and were subsampled to YUV420 pixel format. The H.265/MPEG-HEVC). For the encoders we used the FFmpeg
remaining four sequences (C5-C8) were already of shorter library libx264 and libx265 which are the H.264/MPEG-4
duration, 10 bit-depth and different pixel formats. They were AVC and H.265/HEVC encoder wrapper respectively. The
first processed to create the YUV420 pixel format, 10- and encoded video sequences were decoded back to rawvideo
8- bit-depth versions. As defined in ITU-T Rec. P.910, SI (YUV) format for SI TI calculations, as is commonly done
and TI calculations are performed only on the luminance in the literature. In order to not influence the results due to
(Y) channel of the YUV colorspace. We used MATLAB choice of other encoding settings (preset, GOP size, codec
to read the YUV videos and then performed all SI and
TABLE III: Ratio of SI10bit to SI8bit and T I10bit to T I8bit also holds true when other temporal pooling methods such as
for the eight video sequences. minimum, mean and median are considered instead of max as
Sequence ID SI10bit SI8bit RSI TI10bit TI8bit RTI defined in the ITU standard. Towards this end, we define the
C1 150.20 37.55 4.00 229.83 57.46 4.00 following:
C2 135.30 33.83 4.00 262.37 65.59 4.00 n o
C3 190.98 47.75 4.00 218.14 54.54 4.00 SIStd−M in = mintime SIstd , (5)
C4 195.98 48.99 4.00 216.15 54.04 4.00
C5 179.82 44.95 4.00 142.08 35.52 4.00 n o
C6 187.63 46.89 4.00 47.25 11.82 4.00 SIStd−M ean = meantime SIstd , and (6)
C7 220.35 55.08 4.00 101.38 25.35 4.00
C8 113.49 28.36 4.00 123.89 30.97 4.00
n o
SIStd−M edian = mediantime SIstd . (7)
50 50
CRF=23 CRF=23
CRF=30 CRF=30
45
SI_x265
45
40
SI
40
35
35
30
30 30 35 40 45 50
ChimeraEP01 ChimeraEP10 ChimeraEP11 ChimeraEP16 SI_x264
50 50
BR=1 Mbps BR=1 Mbps
BR=5 Mbps
BR=5 Mbps
45 45
SI_x265
SI
40 40
35
35
30
30 30 35 40 45 50
ChimeraEP01 ChimeraEP10 ChimeraEP11 ChimeraEP16 SI_x264