0% found this document useful (0 votes)
15 views31 pages

Developments in International Video Coding Standardization After AVC, With An Overview of Versatile Video Coding (VVC)

The article discusses the evolution of video coding standards, focusing on High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC), both developed by ISO/IEC and ITU-T. VVC, finalized in July 2020, aims to achieve a 50% bit-rate reduction over HEVC while supporting a wider range of applications, including high dynamic range and immersive media. The document provides a detailed overview of these standards, their functionalities, and the advancements in video compression technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views31 pages

Developments in International Video Coding Standardization After AVC, With An Overview of Versatile Video Coding (VVC)

The article discusses the evolution of video coding standards, focusing on High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC), both developed by ISO/IEC and ITU-T. VVC, finalized in July 2020, aims to achieve a 50% bit-rate reduction over HEVC while supporting a wider range of applications, including high dynamic range and immersive media. The document provides a detailed overview of these standards, their functionalities, and the advancements in video compression technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Developments in

International Video Coding


Standardization After AVC,
With an Overview of Versatile
Video Coding (VVC)
This article provides a comprehensive overview of video coding standards jointly
developed by ISO/IEC and ITU-T considering both high-efficiency video coding (HEVC)
and versatile video coding (VVC).
By B ENJAMIN B ROSS , Member IEEE, J IANLE C HEN , Senior Member IEEE,
J ENS -R AINER O HM , Member IEEE, G ARY J. S ULLIVAN , Fellow IEEE, AND Y E -KUI WANG

ABSTRACT | In the last 17 years, since the finalization of the highlighted in the title of VVC is explained. Included in VVC is
first version of the now-dominant H.264/Moving Picture Experts the support for a wide range of applications beyond the typical
Group-4 (MPEG-4) Advanced Video Coding (AVC) standard standard- and high-definition camera-captured content cod-
in 2003, two major new generations of video coding standards ings, including features to support computer-generated/screen
have been developed. These include the standards known as content, high dynamic range content, multilayer and multiview
High Efficiency Video Coding (HEVC) and Versatile Video Cod- coding, and support for immersive media such as 360◦ video.
ing (VVC). HEVC was finalized in 2013, repeating the ten-year
KEYWORDS | Compression; H.265; H.266; High Efficiency
cycle time set by its predecessor and providing about 50%
Video Coding (HEVC); Joint Video Experts Team (JVET); Mov-
bit-rate reduction over AVC. The cycle was shortened by three
ing Picture Experts Group (MPEG); standards; versatile sup-
years for the VVC project, which was finalized in July 2020,
plemental enhancement information (VSEI); Versatile Video
yet again achieving about a 50% bit-rate reduction over its
Coding (VVC); video; video coding; Video Coding Experts Group
predecessor (HEVC). This article summarizes these develop-
(VCEG); video compression.
ments in video coding standardization after AVC. It especially
focuses on providing an overview of the first version of VVC,
including comparisons against HEVC. Besides further advances I. I N T R O D U C T I O N
in hybrid video compression, as in previous development In 2013, the first version of the High Efficiency Video
cycles, the broad versatility of the application domain that is Coding (HEVC) standard was finalized [1], providing
about a 50% bit-rate reduction compared with its prede-
cessor, the H.264/MPEG-4 Advanced Video Coding (AVC)
Manuscript received March 9, 2020; revised October 31, 2020; accepted
November 29, 2020. Date of publication January 19, 2021; date of current standard [2]. Both standards were jointly developed by
version August 20, 2021. This work was supported by Fraunhofer-Gesellschaft. the ITU-T Video Coding Experts Group (VCEG) and the
(Corresponding author: Benjamin Bross.)
Benjamin Bross is with Fraunhofer Institute for Telecommunications, Heinrich
ISO/IEC Moving Picture Experts Group (MPEG). AVC itself
Hertz Institute, HHI, 10587 Berlin, Germany (e-mail: benjamin.bross@hhi. had provided about 50% bit-rate reduction compared with
fraunhofer.de).
Jianle Chen is with Qualcomm Inc., San Diego, CA 92121 USA.
the H.262/MPEG-2 Video standard, which had been pro-
Jens-Rainer Ohm is with the Institute for Communications Engineering, RWTH duced a decade earlier and was also a joint project of
Aachen University, 52062 Aachen, Germany.
the same organizations [3]–[5]. Now, as of July 2020,
Gary J. Sullivan is with Microsoft Corporation, Redmond, WA 98052 USA.
Ye-Kui Wang is with Bytedance Inc., San Diego, CA 92130 USA (e-mail: VCEG and MPEG have also finalized the Versatile Video
[email protected]). Coding (VVC) standard [6], aiming at yet another 50%
Digital Object Identifier 10.1109/JPROC.2020.3043399 bit-rate reduction and providing a range of additional
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1463


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

functionalities. The VVC standard is accompanied by an


associated metadata specification called the versatile sup-
plemental enhancement information (VSEI) standard [7].
Currently, along with the major increases in the reach
and speed of broadband Internet services, the share of
the video in global data traffic is already about 80% and
is continuing to grow [8]. In addition, the proportion of
household TV sets with 4k (3840 × 2160) resolution is Fig. 1. Scope of a video coding standard (only the decoder).
steadily growing, and these higher resolution TVs require
higher quality video content in order to reach their full
potential. Although practically every 4k TV is equipped
with an HEVC decoder to play back high-quality 4k video,
is on picture quality, constant or moderately varying
the data rates necessary to deliver that content are still
channel bit rate, moderate delay, and frequent ran-
rather high, stretching the limits of broadband capacity.
dom access points for channel tuning-in and channel
This illustrates the need for even more efficient compres-
switching;
sion than the current HEVC standard can provide—a need
3) Video on demand, for example, video streaming over
now further addressed by VVC.
Internet protocol (IP) where the picture quality, bit
In addition to its high compression performance, VVC
rate, and adaptation to transmission channels matter
was designed to facilitate efficient coding for a wide range
most;
of video content and applications, including the following:
4) Capture, streaming, and storage by digital cam-
1) video beyond the standard and high definitions,
eras, for example, as used in smartphones, drones,
including even higher resolution (up to 8k or larger),
actions, security cameras, and professional camera
high dynamic range (HDR), and wide color gamut;
systems.
2) computer-generated or screen content, as occurs
especially in computer desktop screen sharing and End-to-end video compression technology involves,
online gaming applications; at the source, an encoder to compress the video into a
3) 360◦ video for immersive and augmented reality. bitstream and, at the sink, a decoder to decompress the
Furthermore, the first version of VVC includes flexible bitstream for consumption. The combination of an encoder
mechanisms for resolution adaptivity, region-based access, and a decoder is commonly referred to as a codec. How-
layered coding scalability, coding of various chroma sam- ever, the term is somewhat misleading since encoders and
pling formats, and flexible bitstream handling, such as the decoders are typically implemented as entirely separate
extraction and merging of regions from different coded products, and in most applications, the number of encoders
video bitstreams. is very different from the number of decoders. As depicted
The remainder of this article is organized as follows. in Fig. 1, video coding standards have been specifying
Section II lays out the motivation, scope, and common only the format of the coded data and the operation of
basic hybrid video coding design of the major standards. the decoder. This includes the structure and syntax of the
Section III briefly reviews the HEVC standard and its bitstream and the processes required to reconstruct the
extensions. The most recent advances in video coding decoded video from it, but not the operations performed
technology, as incorporated in the VVC standard, are by an encoder.
described in Section IV. Section V presents coding efficiency Having the decoder standardized ensures interoperabil-
results comparing VVC and HEVC to each other and to ity with all compliant decoder devices while allowing
AVC. Finally, this article is concluded with an outlook in encoders to be designed and operated under application-
Section VI. specific constraints on efficiency, computational complex-
ity, power consumption, latency, and other considerations.
II. V I D E O C O D I N G S T A N D A R D S For example, in a real-time communication scenario, any
Modern video coding standards have been developed to particular encoder is unlikely to have the time or comput-
efficiently transmit and store digital video with a variety ing resources to test all possible coding modes and may,
of requirements on bit rate, picture quality, delay, random thus, sacrifice some coding efficiency for lower latency
accessibility, complexity, and so on. The support for the and/or complexity. The types and degrees of such algorith-
following applications is of particular importance. mic optimizations are deliberately left outside the scope of
1) Real-time conversational services, for example, video the standard.
telephony, video conferencing, screen sharing, and All video coding standards since H.261 in 1988 [9] have
cloud gaming, where low delay/latency and reason- been based on the so-called hybrid video coding principle,
able complexity are key requirements (an application which is illustrated in Fig. 2. The term hybrid refers to
recently brought to the forefront by the COVID-19 the combination of two means to reduce redundancy in
pandemic); the video signal, that is, prediction and transform coding
2) Live broadcast, for example, TV over satellite, cable, with quantization of the prediction residual. Although
and terrestrial transmission channels where the focus prediction and transforms reduce redundancy in the video

1464 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Fig. 2. Block diagram of a hybrid video encoder, including the modeling of the decoder within the encoder.

signal by decorrelation, quantization decreases the data of and decide which to select also increases compared with
the transform coefficient representation by reducing their a fixed size or limited partitioning set. However, fast
precision, ideally by removing only imperceptible details; partitioning algorithms and advances in computing power
in such case, it serves to reduce irrelevance in the data. have allowed recent standards to provide a high degree
This hybrid video coding design principle is also used in of flexibility. AVC, HEVC, and VVC all employ tree-based
the two most recent standards HEVC and VVC. For a more partitioning structures with multiple depth levels and the
detailed review of the previous standards, spanning from blocks as leaf nodes, and VVC additionally provides the
H.120 [10] to AVC and also including H.261, MPEG-1 ability to use nonrectangular partitions.
Video [11], H.262/MPEG-2 Video [12], H.263 [13], and Motion-compensated or inter-picture prediction
MPEG-4 Visual [14], the reader is referred to [3]. takes advantage of the redundancy that exists between
Referring to Fig. 2, a modern hybrid video coder can be (hence “inter”) pictures of a coded video sequence (CVS).
characterized by the following building blocks. A key concept is block-based motion compensation, where
Block partitioning is used to divide the image into the picture is divided into blocks, and for each block,
smaller blocks for the operation of the prediction and a corresponding area from a previously decoded picture,
transform processes. The first hybrid video coding stan- that is, the reference picture, is used as a prediction for the
dards used a fixed block size, typically 16 × 16 samples for current block. Assuming that the content of a block moves
the luma prediction regions and 8 × 8 for the transforms. between pictures with translational motion, the displace-
Starting with H.263, and especially starting with AVC, ment between the current block and the corresponding
partitioning became a major part of the design focus. area in the reference picture is commonly referred to
Over the subsequent generations, block partitioning has by a 2-D translational motion vector (MV). Finding the
evolved to become more flexible by adding more and best correspondence is typically done at the encoder by a
different block sizes and shapes to enable adaptation to the block-matching search that is referred to as motion estima-
local region statistics. In the prediction stage, this allows tion. The encoder then signals the estimated MV data to
an encoder to trade off high accuracy for the prediction the decoder. H.261 used only integer-valued MVs, and this
(using small blocks) versus a low data rate for the side or concept of translational motion compensation was later
prediction information to be signaled (using large blocks). generalized by using fractional-sample MV accuracy with
For the coding of residual differences, small blocks enable interpolation (with half-sample precision in MPEG-1 and
the coding of fine detail, whereas the large ones can code MPEG-2 videos and quarter-sample from MPEG-4 Visual
smooth regions very efficiently. With increasing possibili- onward), averaging two predictions from one temporally
ties for partitioning a picture into blocks, the complexity preceding and one succeeding picture (bidirectional pre-
of an encoder that needs to test the possible combinations diction in MPEG-1 and MPEG-2 videos) or from multiple

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1465


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

reference pictures with arbitrary relative temporal posi- sion and is also used in the well-known JPEG image
tions (in standards since AVC). Moreover, the usage of mul- compression standard (which was designed around the
tiple reference pictures from different temporal positions same time as H.261) [17]. The DCT decorrelates about
enables hierarchical prediction structures inside a group of as well as the KLT for highly-correlated auto-regressive
pictures (GOP), which further improves coding efficiency. sources and is easier to compute. In later standards starting
However, when succeeding pictures are used, a structural with H.263 version 3 and AVC, integer-based reduced-
delay is introduced by requiring a different ordering of complexity transforms are used that are often informally
the pictures for coding and display [15]. The most recent called DCTs although a true DCT uses trigonometric basis
standard, VVC, even goes beyond the translational motion functions involving irrational numbers and supports addi-
model by approximating affine motion and using another tional factorizations. In order to account for different
motion estimation process for motion refinement at the statistics in the source signal, it can be beneficial to choose
decoder side. between multiple transforms as in HEVC and VVC. Further-
Intra-picture prediction exploits the spatial redun- more, applying an additional transform on the transform
dancy that exists within a picture (hence “intra”) by deriv- coefficients as in VVC can further decorrelate the signal.
ing the prediction for a block from already coded/decoded, Quantization aims to reduce the precision of an input
spatially neighboring reference samples. This kind of pre- value or a set of input values in order to decrease the
diction in the spatial sample domain was introduced amount of data needed to represent the values. In hybrid
with AVC, whereas previous standards used a simplified video coding, the quantization is typically applied to indi-
transform-domain prediction. In AVC, three different types vidual transformed residual samples, that is, to transform
of prediction modes are employed, “DC,” planar, and angu- coefficients, resulting in integer coefficient levels. As can
lar, all of them using neighboring samples of previously be seen in Fig. 2, this process is applied to the encoder.
decoded blocks that are to the left and/or above the block At the decoder, the corresponding process is known as
to be predicted. The first, the so-called DC mode, averages inverse quantization or simply as scaling, which restores
the neighboring reference samples and uses this value as the original value range without regaining the precision.
a prediction for the entire block, that is, for every sample. The precision loss makes quantization the primary ele-
The second, that is, the planar mode, models the samples ment of the block diagram for hybrid video coding that
to be predicted as a plane by position-dependent linear introduces distortion. Quantization together with scaling
combinations of the reference samples. As the third option, can be seen as a rounding operation with a step size
the angular modes interpolate the reference samples along controlling the precision. In recent video coding standards,
a specific direction/angle. For example, the vertical angu- the step size is derived from a so-called quantization
lar mode just copies the above reference samples along parameter (QP) that controls the fidelity and bit rate.
each column. HEVC extended these modes, for exam- A larger step size (larger QP) lowers the bit rate but
ple, by increasing the number of angles from 8 to 33, also deteriorates the quality, which, for example, results
whereas the most recent VVC standard not only further in video pictures exhibiting blocking artifacts and blurred
extended the number of modes but also incorporates new details. Typically, each sample is quantized independently,
methods, such as a matrix-based intra-picture prediction which is referred to as scalar quantization. In contrast to
(MIP), which was designed using machine learning [16]. this, vector quantization processes a set of samples jointly,
Similar to motion information in inter-picture prediction, for example, by mapping a block onto a vector from a
the encoder signals the estimated prediction information, codebook. At least from the decoder perspective, all recent
that is, the intra-picture prediction mode, to the decoder. video coding standards prior to HEVC have employed only
Transformation decorrelates a signal by transforming it scalar quantization. HEVC includes a trick known as sign
from the spatial domain to a transformed domain (typically data hiding that can be viewed as a form of vector quanti-
a frequency domain), using a suitable transform basis. zation, and VVC introduces dependent quantization (DQ),
Hybrid video coding standards apply a transform to the which can be interpreted as a kind of sliding-block vector
prediction residual (regardless of whether it comes from quantization because the quantization of a sample depends
inter- or intra-picture prediction), that is, the difference on the states of previous samples. Advanced techniques
between the prediction and the original input video signal, for optimized encoding with prior standards can also be
as shown in Fig. 2. In the transform domain, the essential viewed as vector quantization while appearing to be scalar
information typically concentrates into a small number of quantization from the decoder perspective.
coefficients. At the decoder, the inverse transform needs to Entropy coding assigns codewords to a discrete-valued
be applied to reconstruct the residual samples. One exam- set of source symbols by taking into account their statistical
ple of a transform basis is the Karhunen–Loève transform properties, that is, relative frequency. All recent video
(KLT), which is considered an optimal decorrelator but coding standards use variable-length coding (VLC) tables
depends on correlation characteristics of the input signal that assign shorter codewords to symbols with a higher fre-
that are ordinarily not known at the decoder. Another quency of occurrence in order to approach the entropy. The
example is the discrete cosine transform (DCT), which way to design codeword tables in earlier standards was
has been used since H.261 for hybrid video compres- based on the Huffman coding (with minor adjustments).

1466 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

VLC is typically applied to encode and decode the vast color gamut and HDR, both of which require an increased
majority of the data, including control data, motion data, bit depth from 8 to 10 bits per color component sam-
and coefficient levels. AVC further improved the VLC ple. At the same time, other formats, such as interlace-
scheme for coefficient level coding by using a context- scanned video, became less relevant due to advances in dis-
adaptive VLC (CAVLC). A context is determined by the play technology (with digital flat panels replacing analog
value or a combination of values of previous symbols, cathode-ray tube displays). While AVC incorporates block-
which can be used to switch to a VLC table designed level features optimized for interlaced video, HEVC does
for that context. Furthermore, AVC was the first video not burden decoders with additional complexity for this
coding standard that introduced context-adaptive binary and, instead, only provides a basic, yet efficient, picture-
arithmetic coding (CABAC) as a second, more efficient level method to encode interlaced video using the same set
entropy coding method. CABAC still uses VLC tables to of block-level coding tools as for progressive-scan video.
map symbols, such as the coefficient levels to binary strings
(codewords). However, the binary strings are not written A. First Version
directly to the bitstream, but, instead, each bit in the binary
The first version (v1) of HEVC generalized and improved
string is further coded using binary arithmetic coding with
hybrid video coding beyond the concepts of AVC with a
context-adaptive probability models. Due to its high effi-
focus on higher resolutions and improved coding efficiency,
ciency, CABAC has become the sole entropy coding method
in general. The following provides an overview of the main
in the succeeding HEVC and VVC standards.
features for each part of the hybrid video coding design
In-loop filtering is a filtering process (or combination of
and a brief description of its high-level picture partitioning
such processes) that is applied to the reconstructed picture,
and the interfaces to systems and transport layers. For a
as illustrated in Fig. 2, where the reconstructed picture
more detailed description of HEVC and a discussion of its
is the combination of the reconstructed residual signal
coding efficiency, the reader is referred to [18] and [19].
(which includes quantization error) and the prediction.
The reconstructed picture after in-loop filtering can be 1) Block Partitioning: As previously mentioned, HEVC
stored and used as a reference for inter-picture predic- introduces a flexible, quadtree-based partitioning scheme
tion of subsequent pictures. The name in-loop filtering that includes larger block sizes. This partitioning scheme is
is motivated by this impact on other pictures inside the characterized by the following elements.
hybrid video coding prediction loop. The main purpose Coding tree units and quadtree-based block partitioning:
of the filtering is to reduce visual artifacts and decrease In AVC, as well as in previous standards since H.261, a
reconstruction errors. H.263 version 2 is the first standard macroblock represents the basic processing unit for further
that used a deblocking in-loop filter, which became a core segmentation for the prediction and subsequent transform
feature in version 1 of AVC. This filter was designed to be steps of the hybrid coding scheme. The size of the mac-
adaptive to the quantization fidelity, so it can attenuate roblock, which is the maximum size used in prediction,
the blocking artifacts introduced by the quantization of is fixed to 16 × 16 luma samples. The color video has
block-based prediction residuals while preserving sharp three color component planes, so, in addition to the luma
edges in the picture content. HEVC adds a second in-loop samples, the macroblock also has two blocks of chroma
filtering stage called sample adaptive offset filtering, which samples, which typically have half the width and half the
is a nonlinear filter applied after deblocking to attenu- height of the luma block—a sampling format known as the
ate ringing and banding artifacts. In the emerging VVC 4:2:0 chroma format. Other, less widely used formats are
standard, an adaptive loop filter (ALF) was introduced as 4:4:4, in which the chroma planes have the same resolu-
a third filter, where, typically, the filter coefficients are tion as luma, 4:2:2, in which the chroma has half the width
determined by minimizing the reconstruction error using but the same height as the luma. The monochrome video
a Wiener filter optimization approach. Moreover, in VVC, has only a single component plane and is sometimes called
another process known as luma mapping with chroma 4:0:0. With increasing picture resolution, homogeneous
scaling (LMCS) can also be applied before the others in areas can cover areas larger than this, and the 16 × 16 size
the in-loop processing stage. prevents such areas from being coded efficiently. Hence,
The next two sections describe the recent developments increasing the maximum block size becomes important for
made over earlier hybrid video coding designs for the coding higher-resolution video. In HEVC, the macroblock
HEVC standard and, in more detail, for the most recent is replaced by the coding tree unit (CTU). The picture area
VVC standard. that a CTU covers is selected by the encoder for the entire
CVS and can be set to 16 × 16, 32 × 32, or 64 × 64 luma
III. H I G H E F F I C I E N C Y V I D E O C O D I N G samples. The CTU constitutes the root of a coding quadtree
The first version of the HEVC standard was finalized in that splits each CTU area recursively into four smaller
January 2013 and approved as ITU-T H.265 and ISO/IEC square areas. The recursive splitting is signaled efficiently
23008-2. At that time, new types of digital video and by sending a series of binary-valued splitting flags until a
applications had been emerging. These include picture leaf node indication or a maximum allowed splitting depth
resolutions beyond HD, such as 4k/UHD, as well as wider is reached. In HEVC (and VVC), a unit contains blocks of

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1467


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Transform unit (TU) and residual quadtree


transform (RQT) are used to further split a CU for
the purpose of transforming the prediction residual using
another nested quadtree partitioning with the CU as
root and the TUs as leaves. While the most efficient
AVC profile (its high profile) defines 4 × 4 and 8 × 8
(integerized) DCTs, the RQT in HEVC further allows
larger transform sizes for the DCT, that is, 16 × 16 and
32 × 32. This additional flexibility enables an encoder
to adapt to varying space–frequency characteristics for
the DCT. Because each TU has three color components,
each TU contains three transform blocks (TBs). TBs in
HEVC are always square and have widths and heights
Fig. 3. HEVC quadtree-based block partitioning with white lines
that are powers of 2. However, for inter-picture coded
for CTUs and blue lines for CUs.
CUs, a single TU can span over multiple PUs, for example,
two rectangular PUs. This is not allowed in previous
standards, such as AVC, where the transform size is
always a subset of the prediction size. For intra-picture
samples and syntax to code them. Consequently, a CTU for
coded CUs, the prediction is actually performed at the
nonmonochrome video contains three coding tree blocks
TU level, with the prediction of each TU relying on
(CTBs), one for each color component.
neighboring samples in another TU that first needs to be
Coding unit (CU) is the leaf of the coding quadtree and
reconstructed, and the reconstruction requires performing
defines whether the corresponding area of the picture is
both the prediction and inverse transform for the neighbor
predicted using inter- or intra-picture prediction. CU block
block. Since the splitting of a CU into smaller TUs also
sizes can range in powers of 2 from the maximum CTU
increases the correlation between the smaller blocks
area of 64 × 64 luma samples to the minimum block size of
and the neighboring reference samples, the TU-based
8 × 8. An example quadtree partitioning of CTUs into CUs
prediction process also brings additional coding efficiency
is shown in Fig. 3. It can be seen that flat, homogeneous
for intra-picture prediction.
areas of the picture are covered by large blocks, whereas
details and structures with edges are approximated using
2) Motion-Compensated or Inter-picture Prediction:
smaller blocks.
Motion-compensated prediction in HEVC, as in AVC, uses
Prediction unit (PU) is the result of a potential fur-
a translational motion model with luma MVs in quarter-
ther split of a CU for the purpose of having different
luma-sample precision, with the ability to reference mul-
sets of prediction data, that is, motion information or an
tiple stored reference pictures using either uniprediction
intra-picture prediction mode, for different parts of the CU.
(a motion-compensated prediction generated using one
For CUs coded in an inter-picture prediction mode, eight
MV and one reference picture) or biprediction (a predic-
different splitting modes are defined, as depicted in Fig. 4.
tion generated by averaging the predictions from using two
This allows motion-compensated prediction with different
MVs and reference pictures in this manner). Beyond this,
rectangular shapes, even with narrow ones, that is, when
HEVC includes the following improvements.
one side is more than twice larger than the other side,
Higher quality interpolation filtering is achieved by
which was not possible in AVC. These modes are shown
introducing longer filters and removing an intermediate
in Fig. 4 and referred to as asymmetric motion partitioning.
rounding step. In AVC, quarter-sample values for luma
For intra-picture coded CUs, only a quad split into PUs
are calculated by applying a six-tap filter to generate
is allowed. However, intra-picture coded PUs only define
the neighboring half-sample value or values, rounding
the intra-picture prediction mode, whereas the size of the
the results to the sample bit depth, and then averaging
prediction is defined by the transform size as described
two neighboring values. For chroma samples, AVC only
below.
applies two-tap filtering for all positions. HEVC introduces
a consistent separable interpolation process without inter-
mediate rounding for all positions, using an eight-tap filter
(specifically, an eight-tap filter for the luma half-sample
positions and a seven-tap filter for the quarter-sample
positions). For chroma fractional positions, different (four-
tap) filters are used.
Improved motion data coding is realized by predicting
MV values using a list of predictors and block merging
Fig. 4. The eight different modes in HEVC for partitioning a CU that derives the complete motion information based on
into PUs. neighboring motion data. Typically, the components of an

1468 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

MV are differentially coded using an MV prediction (MVP) horizontal and vertical directions. The prediction accuracy
and an MV difference (MVD). In AVC, a single MVP is is also improved by using bilinear interpolation between
derived using either median or directional prediction from the reference sample positions with 1/32 sample precision.
up to four already coded, spatially neighboring MVs. HEVC Improved most probable mode (MPM) coding: It is moti-
replaces the implicit derivation by explicitly signaling one vated by the increased number of prediction modes.
of two potential MVP candidates that are derived from In AVC, the prediction mode can be either signaled using
five spatially neighboring and two temporally colocated a flag indicating to use the mode inferred from neighbors
MVs, where “temporally colocated” refers to MVs used as the MPM or with a fixed-length code to select among
when coding a corresponding location in a particular pre- the less probable modes. HEVC extends the MPM concept
viously decoded picture. This use of explicit signaling to by constructing a list of three MPMs from the modes of
select among MVP candidates is known as advanced MVP the neighboring blocks to the left and above the current
(AMVP). In both AVC and HEVC, MVP-based motion data block. An MPM index indicates which MPM is selected, and
coding still requires an indication of whether uniprediction in case a non-MPM mode is selected, a fixed-length code
or biprediction is applied and, for each MV, an indication of indicates one of the remaining 31 modes.
which stored reference picture it refers to. Two reference
picture lists (RPLs) are constructed for inter-picture refer- 4) Transform and Quantization: As mentioned earlier,
encing purposes, called list 0 and list 1, where one picture the introduction of the coding quadtree with nested RQT
from one list is used for performing uniprediction and one allows variable power-of-2 transform sizes from 4 × 4 to
picture from list 0 and one from list 1 are used for bipredic- 32 × 32. As in AVC, integer transforms are applied to avoid
tion. A reference picture in such a list is selected by an implementation-dependent precision issues. The 2-D core
index into the list called a reference picture index. The so- transforms in HEVC are integer approximations of scaled
called direct or skip modes in AVC do not signal any motion DCT basis functions, realized by applying 1-D transforms
data; instead, the MVs and reference indices are derived sequentially for rows and columns. The basis functions for
from spatially and temporally neighboring blocks. The skip all four DCT-based integer transforms have been designed
mode in unipredictive slices derives the list 0 MV from the such that they can be extracted from those of the 32-point
MVP, and the list 0 reference picture index is 0, referring to transform by subsampling. Besides these new DCT-based
the first reference picture in the list. In bipredictive slices, integer transforms, the following additional transform-
the spatial direct or skip modes derive list 0 and list 1 MVs related features are introduced in HEVC.
and reference picture indices from spatially neighboring Discrete sine transform (DST) replaces the DCT for
blocks, whereas the temporal direct or skip modes derive prediction residuals resulting from directional intra-picture
list 0 and list 1 MVs and reference indices from the prediction when the block size is 4×4. It was found that the
temporally colocated block. The selection of the skip mode DST provides better energy compaction in cases where the
further indicates that the current block does not have a prediction error increases with increasing distance from
coded residual. HEVC replaces the direct and skip modes one of the block boundaries, which is typically the case for
by introducing block merging, which derives motion data intra-picture prediction due to increasing distance from the
from one of five merging candidates. The candidates are reference boundary. Like the DCT, the DST is also simplified
derived from spatially or temporally neighboring blocks, and incorporated as a 2-D separable transform. Its bases
and only a merge index is signaled to select among the are integer approximations of scaled DST basis functions.
merging candidates. This creates regions of equal motion Due to the limited compression benefit for larger block
data, thus enabling us to jointly code regions with equal sizes and associated implementation complexity, the DST
motion across block boundaries from different quadtree is restricted to 4 × 4 luma TBs.
branches. The combination of AMVP and the merge mode Transform skip is another mode that skips the trans-
is quite effective at establishing a coherent motion repre- form step and, instead, directly quantizes and codes
sentation in the decoded video. The skip mode in HEVC the residual samples in the spatial domain. For certain
applies block merging without coded residual data. video signals, such as computer-generated screen content
with sharp edges, applying a transform can, sometimes,
3) Intra-Picture Prediction: In principle, HEVC decrease the coding efficiency. Skipping the transform
intra-picture prediction employs the same types of modes for such content addresses this issue and can also avoid
as in AVC, namely DC, planar, and directional angular “ringing” artifacts.
modes. The more flexible block structures with larger Transform and quantization bypass allows an encoder
block sizes allow for the following main improvements. to skip both the transform and quantization to enable
Increased number of angles: From eight angles in AVC mathematically lossless end-to-end coding. A CU-level flag
to 33 in HEVC for the directional prediction, exploiting controls this mode, thereby enabling efficient regionwise
the increased number of reference samples available with lossless coding.
larger block sizes. The increase comes from adding bottom- Sign data hiding is used to conditionally skip the sig-
left to top-right diagonal directions and using a finer naling of the sign bit of the first nonzero coefficient in a
resolution of angles, with a denser coverage around the 4 × 4 subblock. The sign bit is inferred from the parity

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1469


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

of the sum of the coefficient amplitudes when it is not parameter sets (SPSs), multipicture-level picture para-
coded. To implement this, the encoder needs to select one meter sets (PPSs), slice-level header syntax, and lower
coefficient and alter its amplitude in cases where the parity level coded slice data. In the following, the systems and
does not indicate the correct sign of the first coefficient. transport interface aspects in HEVC v1 that are essentially
different from AVC are briefly summarized. An overview of
5) Entropy Coding: The higher coding efficiency of the
the AVC designs on these aspects can be found in [3]. More
AVC CABAC entropy coding method compared with CAVLC
details on the HEVC designs of these aspects can be found
motivated the decision to have CABAC as the only entropy
in [20]. For simplicity in this description, “HEVC” means
coding method in HEVC. The basic CABAC design is the
HEVC v1, unless otherwise stated.
same as in AVC, with the following:
Random access support: Random access refers to starting
1) increased parsing throughput by reducing intersym- the decoding of a bitstream from a picture that is not the
bol dependencies, especially for parallel-processing first picture in the bitstream in decoding order. To support
hardware architectures; tuning in and channel switching in broadcast/multicast
2) memory reduction by reducing the number of con- and multiparty video conferencing, seeking in local play-
texts used to store and adapt probability models; back and streaming, and stream adaptation in streaming,
3) improved transform coefficient coding with coeffi- the bitstream needs to include relatively frequent random
cient scanning and context modeling designed for access points that are typically intra-picture coded pictures
larger block sizes to increase the coding efficiency. but may also be inter-picture coded pictures (e.g., in the
case of gradual decoding refresh (GDR) as further dis-
6) In-Loop Filtering: The in-loop filtering from AVC was
cussed in the following).
kept in HEVC (with a slightly modified deblocking filter),
HEVC includes the signaling of intra random access
and a nonlinear in-loop filter was added as an additional
point (IRAP) pictures in the NAL unit header through
filtering stage, as follows.
NAL unit types. Three types of IRAP pictures are sup-
Parallel processing friendly deblocking is enabled in
ported, namely instantaneous decoder refresh (IDR), clean
HEVC by aligning the horizontal and vertical block edges,
random access (CRA), and broken link access (BLA) pic-
to which the deblocking filter is applied, on an 8 × 8 grid,
tures. IDR pictures constrain the inter-picture prediction
in contrast to the 4 × 4 grid used in AVC. Given the
structure to not reference any picture before the current
maximum filtering extent of four samples on each side of
GOP and are conventionally referred to as closed-GOP
an edge, each 8 × 8 block can be filtered in parallel.
random access points. CRA pictures are less restrictive by
Sample adaptive offset (SAO) is introduced in HEVC
allowing certain pictures to reference pictures that precede
and consists of two selectable nonlinear filters that are
the current GOP, all of which are discarded in the case of
designed to attenuate different artifacts in the recon-
random access. CRA pictures are conventionally referred to
structed picture after deblocking. Both filters involve clas-
as open-GOP random access points. BLA pictures usually
sifying samples and applying amplitude mapping functions
originate from splicing together two bitstreams or parts,
that add or subtract offsets to the samples that belong to
thereof, at a CRA picture, for example, during stream
the same class. The first one is called edge offset that aims
switching. To enable better systems usage of IRAP pictures,
to attenuate ringing artifacts. Edge offset classifies each
altogether six different NAL unit types are defined to signal
sample into one of five categories (flat area, local mini-
the properties of the IRAP pictures, which can be used to
mum, left or right edge, or local maximum) for four gradi-
enable various types of bitstream access points, such as
ents (horizontal, vertical, and two diagonals). The second
those defined in the ISO base media file format (ISOBMFF)
one is called band offset and is designed to attenuate
[21], which are used for random access support in dynamic
banding artifacts. It subdivides the range of sample values
adaptive streaming over HTTP (DASH) [22].
(e.g., 0–255 for 8-bit video) into 32 equally spaced bands.
Video parameter set (VPS): A new type of parameter
For four consecutive bands, a band-specific offset value is
set, called the VPS, was introduced in HEVC. Although
added to each sample inside each of the four bands. The
introduced in HEVC v1, the VPS is especially useful to
gradient direction for edge offset, the first of the four con-
provide a “big picture” of the characteristics of a multilayer
secutive bands for band offset, and the four offset values
bitstream, including what types of operation points are
are estimated at the encoder and signaled on a CTU basis.
provided, the profile, tier, and level (PTL) of the operation
7) Systems and Transport Interfaces: HEVC inherited points, layer dependence information, and so on.
the basic systems and transport interface designs from Temporal scalability support: HEVC supports temporal
AVC. These include the network abstraction layer (NAL) scalability (e.g., for extracting lower frame-rate video from
data unit syntax structuring, the hierarchical syntax a high-frame-rate bitstream) by signaling a temporal ID
relationships, the video usability information (VUI) and variable in the NAL unit header and imposing a restriction
supplemental enhancement information (SEI) message that pictures of a particular temporal sublayer cannot be
mechanisms, and the video buffering model based on used for inter-picture prediction referencing by pictures
a hypothetical reference decoder (HRD). The hierarchi- of a lower temporal sublayer. A subbitstream extraction
cal syntax and data unit structures consist of sequence process is also specified, with a requirement that each

1470 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

subbitstream extraction output must be a conforming not been widely embraced by industry and, thus, were
bitstream. Media-aware network elements (MANEs) can not carried over from AVC. Instead, new concepts have
use the temporal ID in the NAL unit header for stream been introduced to HEVC, which mainly facilitate paral-
adaptation purposes based on temporal scalability. lel processing (an important feature given that HEVC is
Profile, tier, and level: In order to restrict the feature designed for higher-resolution videos).
set to be supported for specific applications, video cod- Tiles represent an alternative, rectangular grouping of
ing standards define so-called profiles. HEVC v1 defines CTUs to divide a picture into tile rows and tile columns.
the following three profiles: 1) the main profile that is The tiles in a picture are processed in raster-scan order,
restricted to support only the 4:2:0 chroma format and a and the CTUs in each tile are processed in raster-scan
bit depth of 8 bits per sample; 2) the Main 10 profile that order within the tile before the CTUs in the next tile are
is based on the main profile with the supported bit depth processed. A slice can either contain an integer number of
extended to 10 bits per color component; and 3) the main complete tiles such that all the tiles share the same slice
still picture profile that is also based on the main profile header (SH) information, or a tile can contain an integer
but restricted to have only one picture in a bitstream. number of slices with each of these slices being a subset of
In addition to profiles, HEVC also defines so-called levels the tile. The original intent of tiles was enabling parallel
and tiers. A level imposes restrictions on the bitstream encoding and decoding for higher-resolution video [23].
based on the values of syntax elements and their arithmetic However, with emerging 360◦ immersive videos, tiles
combinations, for example, as combinations of spatial res- turned out to also be useful for omnidirectional video
olution, bit rate, frame rate, and picture buffering capac- streaming when used in combination with encoder restric-
ity. The AVC and HEVC level specifications are generally tions and metadata [24]. If an encoder restricts the MVs
similar in spirit, with a couple of notable differences: 1) a that it uses to avoid referring to any regions of the refer-
smaller number of levels is specified in HEVC than in AVC, ence pictures that are outside of a particular set of tiles,
particularly for the levels with lower picture resolution lim- the slices containing these tiles can still be decodable
its and 2) the highest supported frame rate for operation if this set of tiles is extracted from each picture in the
with picture sizes that are relatively small is 172 frames/s bitstream. Such a set is known as a motion-constrained tile
for AVC in most levels, while, for HEVC, this is increased set (MCTS). Recent system-level functionalities, especially
to 300 frame/s. Both of these differences are in response for immersive videos, have made extensive use of MCTSs.
to the general trend of video picture resolutions and frame Wavefront parallel processing (WPP) allows multiple
rates becoming higher as time passes. The concept of tiers CTU rows to be processed in parallel for decoding (or
was newly introduced in HEVC, mainly to establish high encoding). When WPP is enabled, the internal state of the
bit rate capabilities for video contribution applications that CABAC context variables is not carried over to the start of
require higher quality than video distribution applications. a CTU row from the right-most CTU in the previous row,
Hypothetical reference decoder: AVC specifies a buffer but rather from the second CTU in the previous row. This
flow model using picture-based HRD operations with a allows the decoder (or encoder) to start processing the
picture being contained in an access unit (AU) with spec- next row with a two-CTU offset [25]. It should be noted
ified timing. In HEVC, for improved support of ultralow- that the WPP term does not appear in the HEVC specifica-
delay applications, an alternative mode of HRD operation tion since it is a matter of implementation choice whether
was introduced, which operates on smaller units of data. the decoder (and/or encoder) actually takes advantage of
It specifies a conforming behavior for encoders to send the feature’s parallelism opportunity; in the standard, this
only part of a picture as a decoding unit (DU) with accom- is called entropy coding synchronization.
panying timing information before the encoding of the Dependent slice segments have been introduced to
remaining areas of the same picture, as well as for decoders provide a separate framing of a coded slice into multiple
to be able to use the timing of DUs to start decoding the NAL units. A slice is split into an initial, independent slice
received areas before receiving the remaining parts of the segment that contains a full SH and subsequent dependent
picture. slice segments that each contain an abbreviated SH [20].
Dependent slice segments are particularly useful for MTU
8) High-Level Picture Partitioning: In AVC, the coded size matching in systems that limit the maximum amount
macroblocks of a picture are grouped together in slices, of data in an NAL unit or in combination with WPP,
each of which can be decoded independent of the other where each CTU row can be packed and transmitted in
slices in the same picture. When introduced, one of a dependent slice segment.
the main purposes of slices was for maximum transfer
unit (MTU) size matching for improved channel loss
resilience although they could be useful for parallel encod- B. Extensions
ing as well. In HEVC, the basic slice concept was kept, The first version of HEVC was limited to video signals
with slices that group together consecutive CTUs in raster- in 4:2:0 chroma format with up to 10 bits per sam-
scan order. The more complex slice concepts of flexible ple and was optimized for consumer-oriented applica-
macroblock ordering and arbitrary slice ordering have tions with 2-D, single-layer camera-captured content in

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1471


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

the Y  C B C R color space. In October 2014, the second bit-masking and shift operation and can be enabled for
version (v2) of HEVC was finalized, in which the format increasing the CABAC parsing throughput at very high bit
range extensions (RExt) add support for more demanding rates.
higher quality applications [26], the multilayer extensions
2) Scalable HEVC Extensions (SHVCs): In HEVC v2,
for scalability [27], and 3-D multiview video coding [28].
the temporal scalability from v1 is extended by spatial,
The third version (v3) of HEVC was finalized in Febru-
quality, bit depth, and color gamut scalability, as well as the
ary 2015 and added support for combined coding of 3-D
combinations of these. The scalability is based on a multi-
video with depth maps [28]. In February 2016, the last
layer architecture that relies on multiple single-layer HEVC
major extension, for the coding of screen content mate-
v1 decoders, that is, it does not modify block-level decod-
rial [29], was added in the fourth version (v4). A short
ing tools. The reconstruction of a higher enhancement
summary of these extensions is given in the following.
layer from a lower layer, for example, reconstructing UHD
from an HD base layer for spatial scalability, is enabled
1) Range Extensions (RExt): The main goal of the
through picture referencing with added interlayer refer-
HEVC range extensions was to extend the 4:2:0 8–10-bit
ence picture-processing modules, including texture and
consumer-oriented scope of HEVC v1 by supporting high-
motion resampling and color mapping. On the one hand,
quality distribution broadcast (4:2:0, 12 bit), contribu-
this allows reusing HEVC v1 decoder cores but, on the
tion (4:2:2, 10 and 12 bit), production and high-fidelity
other hand, implementing an SHVC-compliant decoder
content acquisition (4:4:4, 16 bit, RGB, high bit rate),
with this architecture increases processing requirements
medical imaging (4:0:0 monochrome, 12–16 bit, near-
by needing multiple HEVC v1 cores plus the additional
lossless), alpha channels and depth maps (4:0:0 mono-
modules.
chrome, 8-bit), high-quality still pictures (4:4:4, 8–16 bit,
arbitrarily high picture size), and many other applications. 3) Multiview (MV-HEVC) and 3-D Extensions (3-D-HEVC):
The modifications introduced by RExt can be divided into Based on the same multilayer design introduced in HEVC
the following three categories. v2 together with the scalable extension, the multiview
Video format modifications to support chroma formats and 3-D extensions significantly improve the coding of
beyond 4:2:0 and bit depths beyond 10 bits per sample 3-D video compared with multicast or frame packing
have been kept to a minimum. Here, a rather conservative with HEVC v1. Similar to the AVC multiview extension,
approach to support the 4:2:2 and 4:4:4 chroma formats MV-HEVC (in v2 of HEVC), each view of a picture is to
without diverging unnecessarily from HEVC v1 was cho- be coded in a separate layer with interlayer prediction.
sen. The modifications include the extension of TB parti- 3-D-HEVC (in v3 of HEVC) extends this by coding the
tioning with existing syntax and transform logic, as well view plus its depth map, which allows rendering additional
as the adjustment of the intra-picture prediction angles to intermediate views. Especially for the depth map coding,
support the nonsquare rectangular blocks occurring in the statistical dependencies between video texture and depth
4:2:2 chroma format. For higher bit depths, only the SAO maps are exploited. This introduces new block-level coding
and interpolation precision are extended. tools, which requires new decoder cores for 3-D-HEVC
Coding efficiency improvements for extended formats, compared with HEVC v1.
lossless, and near-lossless coding are achieved by means
of modified HEVC v1 tools, as well as by introducing new 4) Screen Content Coding (SCC) Extensions: Applications
tools. From HEVC v1, mainly, the transform skip mode was such as screen sharing and gaming are mainly based on
extended to larger block sizes and coupled with a modified computer-generated or mixed content. All video coding
residual coding (with a separate CABAC context model and standards up to HEVC v1 had been mainly designed
residual rotation). Apart from that, RExt includes three for camera-captured video, which results in suboptimal
new tools to increase coding efficiency: adaptive chroma exploitation of the different signal characteristics present
QP offset allows more flexibility in chroma quantization, in screen content. These characteristics are exploited in
cross-component prediction (CCP) exploits remaining sta- HEVC SCC (in version 4 of HEVC) by introducing new
tistically redundancies between luma and chroma channels tools, including intra-picture block copy (IBC), palette
for 4:4:4 video by predicting the chroma spatial residuals mode, adaptive color transform (ACT), and adaptive MV
from luma using a linear model, and residual differential resolution (AMVR). Further detail on these tools is pro-
pulse code modulation (RDPCM) aims to reduce remain- vided in Section IV-B7, as VVC contains a rather similar
ing redundancies in the spatial residual signal when the design for these aspects.
transform is skipped.
Precision and throughput optimizations for very high IV. V E R S A T I L E V I D E O C O D I N G
bit rates and bit depths are achieved mainly by two This section describes the most recent standard, VVC,
methods. First, extended precision for the transform coef- in more detail. It is formally approved as ITU-T H.266 and
ficients and inverse transform processing enable efficient ISO/IEC 23090-3. The VSEI standard, that is, ITU-T
coding with high bit depths. Second, a modification of H.274 and ISO/IEC 23002-7, specifying the VUI and
CABAC allows to decode multiple coded bits with a single some of the SEI messages used with VVC bitstreams,

1472 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

was developed and approved at the same time [7]. For A. Standardization and Development
HEVC and AVC, these aspects are specified directly within
the same video coding standard that specifies the cod- The development of VVC can be split into two phases,
ing tools. Apart from achieving major bit-rate savings which are summarized in the following. The first phase was
over its HEVC and AVC predecessors for camera-content the exploration phase, which started in 2015, primarily
video sequences, VVC was designed to provide and focusing on investigating the potential for increased cod-
improve functionalities and coding efficiency for a ing efficiency without as much consideration of practical
broadened range of existing and emerging applications, complexity constraints. The exploration phase provided
including: evidence that technology with sufficient compression capa-
bility beyond HEVC existed, justifying the start of the
1) Video beyond the standard and high defini- official standardization phase (the second phase) spanning
tions is greatly improved by using more flexible from 2018 to 2020. This phase targeted to maintain and
and larger block structures (see Section IV-B1) for even increase the coding efficiency while taking implemen-
higher resolutions and by a luma adaptive deblock- tation and complexity aspects into full consideration and
ing filter designed for HDR video characteristics fulfilling a broadened range of application scope.
(see Section IV-B6). Furthermore, profiles that sup-
port chroma formats beyond 4:2:0, such as 4:2:2 and 1) Exploration Phase (2015–2017): The need for even
4:4:4, are defined already in the first version of VVC more efficient compression than the current HEVC stan-
(see Section IV-C8). dard motivated ITU-T VCEG and ISO/IEC MPEG is study-
2) Computer-generated or screen content motivated ing the potential in 2014 and to join forces again in
the inclusion of techniques derived from the HEVC October 2015 for exploring coding technology beyond
SCC extensions, such as IBC block-level differen- HEVC in a new team called the Joint Video Explo-
tial pulse code modulation (BDPCM), ACT, palette ration Team (JVET). Based, initially, on VCEG key tech-
mode coding, and full-sample adaptive MV precision, nical area (KTA) software that began being developed in
as well as an alternative residual coding for transform January 2015, by the end of 2017, the JVET had developed
skip modes (see Section IV-B7). the joint exploration model (JEM) software codebase [30],
3) Ultralow-delay streaming is facilitated by built-in which demonstrated up to 30% bit-rate reduction com-
GDR handling that can avoid bit rate peaks intro- pared with HEVC.
duced by intra-picture coded pictures and vir- The coding efficiency improvements achieved in this
tual boundaries for improved support of GDR exploration effort were considered sufficient evidence to
(see Section IV-C1). issue a formal Joint Call for Proposals (CfP) for new video
4) Adaptive streaming with resolution changes ben- coding technology in October 2017, and it was agreed that,
efits from reference picture resampling (RPR) (see once the drafting of a formal standard began, the joint
Section IV-C6), which allows switching resolutions team would be renamed to reflect its change of mission,
within a CVS by resampling reference pictures to becoming the Joint Video Experts Team, without changing
the picture resolution of the current picture for the its JVET abbreviation.
purpose of inter-picture prediction.
2) Standardization Phase (2018–2020): The CfP attra-
5) 360◦ video for immersive and augmented
cted the submission of proposals from 32 organizations
reality applications is efficiently coded by the
for the coding of three categories of video content: stan-
motion-compensated prediction that can wrap
dard dynamic range (SDR), HDR, and 360◦ video [31].
around picture boundaries, by disabling in-loop
An independent subjective evaluation conducted in
filtering across virtual boundaries (see Section IV-B8)
April 2018 showed that all submissions were superior in
and by subpictures with boundary padding
terms of subjective quality to HEVC in most test cases
(see Section IV-C5).
and that several submissions were superior to the tech-
6) Multilayer coding is supported already in the
nology previously explored in the JEM framework in a
first version of VVC using a complexity-constrained,
relevant number of cases. Starting with the analysis of
single-layer-friendly approach that enables temporal,
the best-performing proposals among all the submissions,
spatial, and quality scalabilities, as well as multiview
the VVC development started in April 2018 with the first
coding (see Section IV-C7).
draft of the specification document and test model soft-
In the following, the initial steps toward establishing a ware. After a large number of coding tools had been on
new standardization project with compression efficiency the table from the CfP, it was decided to start with a
beyond HEVC, as well as a short review of the VVC “clean slate” approach. This first draft only included an
standard development, are covered in Section IV-A. Then, advanced quadtree with multitype tree (QT+MTT) block
the novel coding tools in VVC that contributes to the over- partitioning, which was identified as a common element
all bit-rate savings are described in Section IV-B. Finally, among almost all proposals, and its implementation would
advances and novelties in the systems and transport inter- heavily affect the design of all other block-based cod-
faces are presented in Section IV-C. ing tools. On top of that, more coding tools from the

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1473


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

CfP responses and new ones were studied extensively in Table 1 Overview of Coding Tools in HEVC and VVC

“core experiments” with regard to coding efficiency and


implementation complexity. In cases where a reasonable
tradeoff between coding efficiency and complexity was
found, additional tools were then adopted to the VVC
design.

B. Coding Tools
VVC applies the classic block-based hybrid video cod-
ing architecture known from its predecessors. Although
the same framework is applied, novel tools are included
in each basic building block to further improve the
compression.
Table 1 provides an overview of the coding tools in
HEVC version 1 and VVC version 1. In the following,
the VVC tools will be explained in more detail.
1) Block Partitioning: In VVC, the QT+MTT scheme
using quaternary splits followed by binary and ternary
splits for its partitioning structure replaces the quadtree
with multiple partition unit types that were used in HEVC,
that is, it removes the concept of splitting a CU into PUs
and TUs and provides a more flexible CU partitioning.
Rectangular PU shapes are replaced by rectangular CU
shapes resulting from binary and ternary tree splits.
The RQT-based TU partitioning is removed as well, and
multiple TUs in a CU can only occur from an implicit
split of CUs that have a larger size than the maximum
transform length and from CUs with intra sub-partitions
(see Section IV-B3). Furthermore, the maximum CTU
size is increased to 128 × 128 luma samples, and the
maximum supported transform length is increased to 64.
This tree-based CU partitioning scheme forms the block
partitioning structure for VVC, together with sometimes
using a separate tree for the chroma components and
easing implementation with the concept of virtual pipeline
data units, as will be further described in the following.
Coding quadtree with multitype tree: A CTU is first par-
titioned by a quadtree structure. Then, the quadtree tree
leaf nodes can be further partitioned by a multitype tree
structure. There are four splitting types in the multitype
tree structure: vertical binary splitting, horizontal binary
splitting, vertical ternary splitting, and horizontal ternary
splitting. The multitype tree leaf nodes are called CUs,
and unless the CU is too large for the maximum trans-
form length, this segmentation is used for the prediction
and transform processing without any further partition-
ing. This means that, in most cases, the CU, PU, and
TU have the same block size in the QT+MTT coding
block structure. Other than when the CU is too large
for the maximum transform size, exceptions also occur dotted edges represent multitype tree partitioning with
when intra sub-partitions (see Section IV-B3) or subblock either binary or ternary splits. The size of the CU may
transforms (SBTs) (see Section IV-B4) are employed. This be as large as the CTU or as small as 4 × 4 in units of
also means that VVC supports nonsquare TBs in addition luma samples. The QT+MTT partitioning provides a very
to square ones. Fig. 5 shows a CTU divided into multiple flexible block structure to adapt to the local character-
CUs with a QT+MTT coding block structure, where the istics, as can be seen in the example overlay in Fig. 6.
solid block edges represent quadtree partitioning and the Furthermore, at the leaf node of the multitype tree, there

1474 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Fig. 7. Disallowed ternary splitting and binary splitting in VVC


when the luma coding block width or height is 128 to enable
Fig. 5. Example of quadtree with nested multitype tree coding
64 × 64 VPDU operation.
block structure.

is an option to further split a CU into two nonrectangular blocks of all three components, whereas a CU in an inter-
prediction block partitions in the case of inter-picture pre- picture coded slice always consists of coding blocks of all
diction, selecting one of 64 geometric partitioning modes three color components.
(see Section IV-B2). Local dual-tree: In typical video encoder and decoder
Chroma separate tree: In VVC, the coding tree scheme implementations, the average processing throughput drops
supports the ability for luma and chroma to use separate when many small blocks (more specifically, small intra-
partitioning tree structures. For inter-picture coded slices, picture coded blocks since these need to be decoded
the luma and chroma CTBs in one CTU have to share sequentially) are present in the coded picture. In the
the same coding tree structure. However, for intra-picture single-coding tree structure, a CU can be as small as 4×4 in
coded slices, the luma and chroma can have separate trees. units of luma samples, which results in 2×2 chroma coding
When the separate tree mode is applied, the luma CTB blocks if the video uses 4:2:0 sampling. To avoid small
is partitioned into CUs by one QT+MTT structure, and chroma blocks, a local dual-tree structure is used. With the
the chroma CTBs are partitioned into CUs by another local dual-tree design, chroma intra-picture coded coding
QT+MTT structure. This means that, when the video is blocks with a size of less than 16 chroma samples or with
not monochrome, a CU in an intra-picture coded slice may 2×N sizes are prevented by using a separate tree locally for
consist of a coding block of the luma component only, the chroma when necessary to prevent such small chroma
coding blocks of two chroma components only, or coding blocks.
Virtual pipeline data units (VPDUs) are block units in
a picture that needs to be held in memory for processing
while decoding. In hardware decoders, successive VPDUs
can be processed by operating multiple pipeline stages at
the same time. The VPDU size would be roughly propor-
tional to the memory buffering size in most pipeline stages,
so it is important to keep the VPDU size reasonably small.
In the VVC QT+MTT scheme, ternary tree and binary tree
splits for CUs with the size of 128×128 luma samples could
have led to a VPDU size that was considered too difficult
to support. In order to keep the VPDU size at 64 × 64 luma
samples, normative partitioning restrictions (with syntax
signaling modification) are applied, disallowing certain
splits for CUs with width or height equal to 128, as shown
by dashed lines in Fig. 7. The VPDU concept was used to
establish these implementation-oriented split restrictions
but is not explicitly discussed in the standard.

2) Motion-Compensated or Inter-Picture Prediction:


VVC retains and enhances many of the inter-picture
prediction features from HEVC, including the two most
important motion information coding methods described
earlier: AMVP and the merge mode. Furthermore,
Fig. 6. Example of partitioning using the QT+MTT scheme in VVC. HEVC’s eight-tap high-precision motion compensation

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1475


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

interpolation filter (IF) for luma fractional positions and the CUs coded with affine mode, geometric partitioning,
four-tap IF for chroma fractional positions are also used. or subblock-based TMVP, the associated motion informa-
On top of these core features, new coding tools are intro- tion is added to the table in a first-in-first-out (FIFO)
duced in VVC for increasing the efficiency of inter-picture manner. The HMVP table size is 6.
prediction. VVC introduces subblock-based motion inher- The pairwise average MVP candidate is generated by
itance, in which the current CU is divided into subblocks averaging the MVs of the first two candidates in the exist-
with equal size (8 × 8 luma samples) and the MV for ing merge candidate list. The averaged MVs are calculated
each subblock is derived based on temporally colocated separately for each RPL. When the merge list is not full
blocks in a reference picture. Merge mode with additional after the pairwise average merge candidate is added, zero
MVD coding is added to further enhance the efficiency of MVPs are appended at the end until the maximum merge
the merge mode. A local CU-based affine motion model is candidate number is encountered.
used to represent higher-order motion, such as scaling and Subblock-based temporal MVP (SBTMVP): TMVP in
rotation, where only one set of parameters is coded per CU, merge mode inherits one set of motion information from
while the motion compensation is performed individually a temporal colocated CU. The SBTMVP method in VVC
per 4 × 4 subblock using six-tap IFs. VVC also increases allows inheriting the motion information from the colo-
the MV precision to 1/16 luma sample in some modes to cated picture at a finer granularity, that is, in units of
improve the prediction efficiency for video content with 8 × 8 subblocks. This requires storing the MVs of the
locally varying and nontranslational motion, such as in the colocated picture on an 8 × 8 luma sample grid (in contrast
case of the affine mode, while HEVC uses only quarter- to a 16 × 16 grid in HEVC). SBTMVP attains MVPs for the
luma-sample precision. On top of the higher precision MV subblocks within the current CU in two steps. In the first
representations, a block-level AMVR method is applied to step, the motion displacement to determine the colocated
customize the balance between the prediction quality and CU is set to the MV of the neighboring CU to the left
the bit cost overhead for MV signaling. The geometric if it uses the colocated picture as its reference picture.
partitioning mode splits a CU into two nonrectangular Otherwise, it is set to (0, 0). In the second step, the MVP for
partitions to better match motion at object boundaries. The each subblock is derived from the MV of its corresponding
biprediction with CU-level weights (BCW) mode extends subblock inside the colocated CU from the first step.
simple averaging to allow weighted averaging of the two Merge with MVD (MMVD): The VVC merge mode is
prediction signals at the block level. To further improve the extended by allowing signaling an MMVD, which only
prediction quality, decoder-side MV refinement (DMVR) allows a small number of difference values and, therefore,
and bidirectional optical flow (BDOF) are introduced, has less bit overhead than AMVP. When one of the first
which improves the motion compensation without increas- two merge candidates is selected for a CU, an MVD can be
ing bit overhead. Finally, VVC provides a mode for combin- signaled to further refine the MV. A set of MVD ranges are
ing inter-picture and intra-picture prediction to form the predefined, and an index is signaled to indicate how far
final prediction. the final MV can deviate from the predicted MV.
For a CU coded in merge mode, a merge candidate list Symmetric MVD (SMVD): When the motion of the
is constructed, and an index is signaled to specify which current block is on a constant motion trajectory between a
candidate MVP is used to form the prediction. In VVC, temporally past and a temporally future reference picture
the merge candidate list consists of five types of candi- in display order, corresponding MVs and reference picture
dates in the order: 1) MVPs from spatial neighboring CUs; indices tend to be symmetrical. SMVD exploits this to save
2) temporal MVP (TMVP) from colocated CUs; 3) history- bits for MVDs and reference picture index signaling. When
based MVP from an FIFO table; 4) pairwise average MVP; SMVD is applied for a CU, only the MVD for list 0 is
and 5) zero MVs. The length of the merge list is signaled signaled. The MVD for list 1 is set to the reverse of the list
in SPS, where the maximum allowed length is 6. The way 0 MVD, and the list 0 and list 1 reference picture indices
MVs from spatial neighboring CUs and colocated CUs are are implicitly derived at the slice level.
used is identical to the way that these are handled in the Adaptive MV resolution (AMVR): In inter-picture pre-
HEVC merge candidate list. diction, MVs with higher resolution, that is, higher frac-
History-based MV prediction (HMVP) provides can- tional sample position accuracy, usually lead to better
didates beyond the local spatial–temporal neighborhood prediction and, thus, smaller residual energy. However,
to allow usage of MV information from CUs that are more bits are required to represent the MVs with higher
more remote. The HMVP candidates can be used in both accuracy. In the HEVC SCC extension, the precision of the
merge and AMVP candidate list construction processes. MVs is switchable at the slice level between a quarter of
The motion information of previously coded blocks is a luma sample as in HEVC v1 and integer luma sample
stored in a table of MVP candidates for the current CU. precision. The benefit of being able to select integer luma
The table with multiple HMVP candidates is maintained sample precision is clear for SCC (e.g., for computer desk-
during the encoding/decoding process and is reset (all top screen sharing), where the motion in the computer
candidates removed) when a new CTU row is encountered. graphics synthesis is often using only integer sample dis-
Whenever there is an inter-picture coded CU, excluding placements. In such an instance, the integer-only option

1476 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

avoids wasting bits on sending fractional precision that BCW is only applied to CUs with 256 or more luma samples
is not needed. However, to enable a more flexible adap- (i.e., CU width times CU height is greater than or equal to
tation for camera-captured video and mixed content and 256). If all reference pictures are temporally preceding the
screen content, a CU-level AMVR scheme is supported in current picture in display order, for example, for low-delay
VVC. MVDs of a CU with translational motion in AMVP applications, all five weights are used. Otherwise, only
mode can be coded in units of quarter luma samples, three weights w ∈ {3, 4, 5} are used.
half luma samples, integer luma samples, or four luma Combined inter-/intra-picture prediction (CIIP): In
samples. For the affine AMVP mode, MVDs can be switched VVC, when a CU is coded in merge mode, an additional
among quarter, integer, or 1/16 luma samples. In the flag is signaled to indicate whether a CIIP mode is applied
case of IBC (see Section IV-B7), the precision of the block to the current CU. The CIIP mode can be applied to a CU
displacement vectors can either be an integer or four containing at least 64 luma samples when both the CU
luma samples. In order to ensure that the final MV (i.e., width and CU height are less than 128 luma samples. As its
the sum of the MVP and MVD) uses the same precision as name indicates, the CIIP prediction combines an inter-
the MVD, the MVP is rounded to the indicated precision. picture prediction signal with an intra-picture prediction
With CU-level switching of MV resolution, a good tradeoff signal. The intra-picture prediction signal is generated
between prediction quality and MV bit overhead can be using the planar mode. The intra-picture and inter-picture
achieved. The CU-level MV resolution indication is condi- prediction signals are combined using weighted averaging,
tionally signaled if the current CU has at least one nonzero where the weight value is calculated depending on the
MVD component. When half-luma-sample MV accuracy is coding modes of the top and left neighboring blocks.
used in AMVP mode, a six-tap smoothing IF (SIF) is used Decoder-side MV refinement (DMVR) is used to
instead of the eight-tap IF from HEVC. improve the accuracy of the MVs of the merge mode.
Geometric Partitioning Mode (GPM) enables motion It searches candidate MVs around the initial MVs in list
compensation on nonrectangular partitions of blocks as 0 and list 1 and, like SMVD, is used only with temporally
one variant of the merge mode in VVC. When this mode bidirectional prediction. The DMVR searching process con-
is used, a CU is split into two partitions by a geometrically sists of an integer sample MV offset search and a fractional
located straight line, and two merge indices (one for sample MV refinement process. The integer sample MV
each partition) are further signaled. In total, 64 different searching calculates the distortion between each pair of
partition layouts are supported by geometric partitioning candidate reference blocks in list 0 and list 1, and the
for each possible CU size from 8 × 8 to 64 × 64, excluding search range is ±2 integer luma samples from the ini-
8 × 64 and 64 × 8. The location of the splitting line is tial MVs. The fractional sample refinement is derived by
mathematically derived from the angle and offset para- using a parametric error surface approximation instead of
meters of a specific partition. Each part of a geometric using additional searching with distortion measurement
partition in the CU is inter-picture predicted using its comparisons. When the width or height of a CU is larger
own motion, and only uniprediction is allowed for each than 16 luma samples, the CU is split, and DMVR is
partition, that is, each part has one MV and one refer- processed for each 16 × 16 block separately. The refined
ence picture index. The uniprediction motion constraint MVs are used to generate the inter-picture prediction
is applied to ensure that, as in conventional biprediction, samples and are also used in TMVP for the coding of
only two motion-compensated predictions need to be com- subsequent pictures. However, the original MVs are used
puted for each CU. After predicting each of the parts, in the deblocking process and are also used in spatial MVP
the sample values are combined using a blending process- for subsequent CU coding to ease potential pipelining in
ing with adaptive weights along the geometric partition hardware implementations.
edge. Bi-directional optical flow before (BDOF) is another
Biprediction with CU-level weights (BCW): In HEVC, technique for improving temporally bidirectional motion
the biprediction signal is generated by averaging two representation and is used to refine the biprediction signal
prediction signals obtained from two reference pictures of a CU at the 4×4 subblock level. It is applied to CUs coded
and/or using two MVs. Weighted averaging of the two either in the merge mode or the AMVP mode. Similar to
prediction signals is supported in HEVC but with a PROF for affine motion, the BDOF refinement is based
somewhat cumbersome scheme that required establishing on the optical flow concept and assumes homogeneous
weights at the slice level and using the reference picture motion of an object within the current CU. For each
index to control the weight selection. In VVC, this legacy 4 × 4 subblock, a motion difference relative to CU MVs
explicit-weighted prediction scheme is kept and extended is calculated by minimizing the difference between the list
with CU-level syntax control for weighted averaging. Five 0 and list 1 prediction subblocks using the cross-correlation
weights are allowed in this weighted averaging bipredic- and autocorrelation of the horizontal and vertical gradients
tion, w ∈ {−2, 3, 4, 5, 10}/8. For each bipredicted CU, for each prediction sample. The motion difference together
the weight w is determined in one of two ways: 1) for a with the prediction sample gradients is then used to adjust
nonmerge CU, the weight index is signaled after the MVD the bipredicted sample values in the 4 × 4 subblock.
or 2) for a merge CU, the weight index is inferred from Affine motion: In HEVC, only a translational motion
neighboring blocks based on the merge candidate index. model is applied in motion-compensated prediction, which
Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1477
Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

cannot efficiently represent many kinds of motion, for


example, zoom in/out, rotation, perspective shifts, and
other nontranslational motion effects that often occur in
the real-world video. In VVC, a CU-based affine motion
mode is introduced to represent nontranslational motion
more efficiently. The affine motion model for a CU is
described by MVs of two control points located at the top-
left and top-right corners (a four-parameter model) or MVs
of three control points located at the top-left, top-right, and
bottom-left corners (a six-parameter model). In the four-
and six-parameter affine AMVP modes, the control-point
MVs for the current CU are signaled in the bitstream. Sim-
ilar to the merge mode for translational motion, the affine
merge mode in VVC directly inherits the affine motion Fig. 8. Wide-angular intra-picture prediction for an example
model from a neighboring block. In this mode, the control- 8 × 4 nonsquare block.
point MVs of the current CU are derived based on the
motion information of the neighboring CUs. To balance
the complexity of the motion-compensated prediction of
the affine mode against the accuracy of the affine motion tion modes, and employs an “MPM” list with six candidates
representation, the affine motion model is approximated to efficiently code the selection among the 67 choices.
using translational motion for each 4 × 4 luma subblock, In HEVC, 33 angular prediction directions are defined from
where the translational MV is computed as the displace- 45◦ to −135◦ in a clockwise direction. In VVC, the angular
ment of the center of the subblock, calculated according precision is basically doubled to produce 65 angles within
to the affine motion model and rounded to 1/16 sample that same range, and another 28 “wide-angle” predic-
fractional accuracy. A set of six-tap IFs, instead of eight-tap tion modes beyond this angular range can be used for
filters, is used in order to reduce the computational and nonsquare blocks. Fig. 8 illustrates an example for an
memory bandwidth complexities. The motion compensa- 8 × 4 (W × H ) block where angular prediction modes
tion IFs are applied to generate the prediction of the referencing beyond 2H + 1 samples from the shorter side
4 × 4 luma subblock with the derived MV. The motion to the left (close to 45◦ ) are replaced with wide-angle
compensation for the chroma components also uses 4 × 4 prediction modes referencing up to 2W + 1 samples from
subblocks. For 4:2:0 video, the MV of a 4 × 4 chroma the longer side above (beyond −135◦ ). There are 14 such
subblock is calculated as the average of the MVs of the selectable wide angles when W > H and another 14 for
top-left and bottom-right 4 × 4 luma subblocks in the H > W , bringing the total number of wide angles to
corresponding 8 × 8 luma region. 28. The replaced modes are signaled using the origi-
Prediction refinement with optical flow (PROF): To nal mode indices, which are adaptively remapped to the
achieve a finer granularity of motion compensation, PROF indices of wide angular modes depending on the block size
can additionally be applied to refine each luma prediction after parsing. The total number of intra-picture prediction
subblock, targeting the effect of samplewise motion com- modes for any particular block size is constant, that is,
pensation. Each prediction sample in a luma subblock is 67, and the mode coding method is the same for all
refined by adding a difference derived based on a simpli- block shapes, but the addition of the wide angles for the
fied optical flow equation using the horizontal and vertical nonsquare blocks brings the total number of supported
gradients of each prediction sample and sample-based directions to 93 and, thus, brings the total number of
MVD relative to the centered subblock MV. PROF is not modes to 95.
applied to chroma samples. Two sets of four-tap interpolation filters IFs with
different frequency cutoffs and 1/32-sample precision
3) Intra-Picture Prediction: The samples of an intra- are used to generate the prediction samples located at
picture coded block are predicted from reference samples fractional-sample positions for the angular modes. The two
in neighboring blocks to the left and above the current sets of four-tap IFs replace lower precision linear interpola-
block, which has previously been decoded (prior to in-loop tion as in HEVC, where one is a DCT-based IF (DCTIF) and
filtering) in the same picture. HEVC uses 35 intra-picture the other one is a four-tap SIF. The DCTIF is constructed
prediction modes, including planar, reference sample aver- in the same way as the one used for chroma compo-
aging (also referred to as the DC mode), and 33 directional nent motion compensation in both HEVC and VVC. The
angular modes. VVC expands the possibilities with tools SIF is obtained by convolving the two-tap linear IF with
further described in the following. [1 2 1]/4 filter. The selection of the IF depends on the
93 intra-picture directional prediction angles: For block size and the angular distance to the horizontal and
each luma coding block size, VVC offers a set of 65 vertical modes. In general, the sharpening DCTIF is applied
directional angular modes, plus the DC and planar predic- more for smaller blocks and for the modes around the

1478 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

horizontal and vertical directions where the correlation


between the reference and original samples tend to be
higher. For nonfractional diagonal angles and selected
wide angles for blocks with more than 32 samples, luma
reference samples are smoothed using a [1 2 1]/4 filter.
Compared with HEVC, where an additional strong smooth-
ing can be applied depending on the “flatness” of the
Fig. 9. MIP process.
reference samples, this is a simplification. Furthermore,
reference sample smoothing is applied only to integer-
slope modes in luma blocks so that it is not cascaded
with interpolation filtering, which is applied to fractional
and decoder without explicit signaling. For 4:2:0 video,
slope modes.
four neighboring chroma samples at specific locations and
Position-dependent prediction combination (PDPC)
their corresponding downsampled luma samples are used
further modifies the prediction of the planar, DC, horizon-
in the derivation process, and three CCLM modes are
tal, vertical, the bottom-left angular mode and its eight
defined based on the locations of the reference samples
adjacent angular modes, and the top-right angular mode
for the derivation of the model parameters.
and its eight adjacent angular modes. PDPC invokes a
Intra Sub-Partition (ISP) mode divides a luma CU
combination of prediction with unfiltered boundary refer-
vertically or horizontally into two or four subpartitions
ence samples and prediction with filtered boundary ref-
depending on the block size. In this mode, all subpartitions
erence samples. The final prediction sample is a linear
share the coding mode information, while the prediction
combination of the initial prediction sample and the ref-
and transform are processed separately. The minimum
erence samples with the combination weights dependent
block size for ISP is 4 × 8 or 8 × 4, and the maximum
on prediction modes and sample location.
block size is 64 × 64. If the block size is 4 × 8 or 8 × 4,
Multiple reference line (MRL) prediction uses more
the corresponding block is divided into two subpartitions.
reference lines besides the nearest spatial neighboring
Otherwise, it is divided into four subpartitions. Each sub-
reconstructed samples for intra-picture prediction. In this
partition corresponds to a TB, with each TB having at least
mode, instead of using the nearest line of neighboring
16 samples.
samples as the reference line for intra-picture prediction,
samples from two other lines (a reference line two lines 4) Transforms and Quantization: In HEVC, an integer
away and a reference line three lines away) can be approximation of the DCT type-II transform is used as the
used. major transform applied to residual signals with square
Matrix-based intra-picture prediction (MIP) is a block sizes from 4 × 4 to 32 × 32, and as an exception,
newly added prediction mode in VVC. It was first pro- an integer approximation of the DST type-VII transform is
posed as a neural-network-based prediction but was later applied for 4 × 4 intra-picture prediction residual blocks.
simplified to use a matrix multiplication and an indexed The conventional uniform reconstruction quantizer design
table of matrices [16]. For predicting the W × H samples for scalar quantization of the transformed residual can be
of a rectangular block, MIP performs the following three extended in HEVC by sign data hiding. To achieve better
steps, as shown in Fig. 9: 1) averaging is applied to energy compaction of the residual signals and further
one left column of H reconstructed neighboring boundary reduce the quantization error of the transformed coeffi-
samples and one top line of W reconstructed neighbor- cients, VVC introduces new tools, which will be reviewed
ing boundary samples to get the reduced (downsampled) in the following.
boundary samples bdryred ; 2) a subsequent matrix–vector Non-square transforms are supported for the non-
multiplication with a matrix Ai and an offset vector bi square TBs in VVC by applying different length transform
generates the intermediate prediction signal pred red ; and kernels in horizontal and vertical directions. The maximum
3) linear interpolation generates the prediction signal pred transform size is extended to 64 × 64 to have better energy
by upsampling pred red . The matrix coefficients for each MIP compaction for the residual signals of large-sized smooth
mode i are pretrained with 8-bit precision. Overall, 16 areas.
16×4 matrices, eight 16×8 matrices, and six 64×7 matrices Multiple transform selection (MTS) is used for resid-
are specified for MIP. ual coding for both inter-picture and intra-picture coded
Cross-component linear model (CCLM) prediction blocks. It provides the ability to select among a predefined
modes are a prediction method specifically for chroma subset of (integerized) sinusoidal transforms that include
components to exploit cross-component redundancy, DCT type-II, DST type-VII, and DCT type-VIII transforms
in which the chroma samples at positions (x, y) are for CUs with both width and height smaller than or equal
predicted based on the reconstructed luma samples to 32. As shown in Table 2, five combinations of horizontal
recY (x, y) of the same CU by using a linear model and vertical transform kernels can be signaled as the
predC (x, y) = α recY (x, y)+β, where the CCLM parameters (encoder-side) primary transform for a CU. To reduce the
(α and β) are derived the same way in both the encoder complexity of large-size DST type-VII and DCT type-VIII

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1479


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Table 2 Mapping of MTS Modes to Transform Kernels flag is signaled to indicate whether the whole residual
block or only a subpart of it is coded. In the former case,
inter-MTS information is further parsed to determine the
transform type of the CU. In the latter case, a part of the
residual block is coded with an inferred primary transform
type, and the other part of it is zeroed out. The part
with coded residual can be one-half or one-quarter the
size of the CU and can be located in the left, right, top,
or bottom region of the CU, which results in a total of eight
computation, for blocks with size (width or height, or both SBT modes.
width and height) equal to 32, only the coefficients within Adaptive chroma QP offset allows extending block-
the 16 × 16 lower frequency region are retained, and based quantization control for luma, which is similar in
the high-frequency transform coefficients are zeroed out spirit as the one introduced in HEVC version 2 by the
for these transforms. For the TBs with size (width or range extensions. Block-level QP control is widely used
height, or both width and height) equal to 64, only DCT in practical implementation for rate control and perceptu-
type-II is used, where only the coefficients within the ally optimized encoding approaches. In addition to signal
32 × 32 lower frequency region are retained and the high- luma QP changes for an area of blocks (quantization
frequency transform coefficients are zeroed out. In case a groups), chroma QPs are derived from the luma QP of
low-complexity encoder does not have the resources to test the colocated block via lookup tables. To support a wide
and signal the MTS, an implicit MTS can be used as an range of transfer functions and color formats, the lookup
alternative. In that case, a combination of DCT type-II and tables are defined by piecewise linear mapping functions
DST type-VII is derived based on the width and the height that are determined by an encoder and coded in the
of the current TB. SPS. Furthermore, VVC extends the range of QP values
Low-frequency non-separable transform (LFNST) can from 0 to 63 + 6∗ (BitDepth−8) in order to achieve low
be applied to the low-frequency components of the primary bit rates.
transform to better exploit the directionality characteristics Dependent Quantization (DQ) refers to an approach in
particularly of intra-picture coded CUs with DCT type-II as which the set of available reconstruction values for a given
the primary transform. It is applied between the forward transform coefficient depends on the reconstruction values
primary transform and quantization at the encoder side that were selected for transform coefficients that precede it
and between the inverse quantization scaling and inverse in scanning order. The main effect of this approach, in com-
primary transform at the decoder side. In LFNST, a 4 × 4 or parison to conventional independent scalar quantization
8 × 8 nonseparable transform is applied according to the as used in HEVC, is that the average distortion between
TB size. The 4 × 4 LFNST is applied to the low-frequency an input vector given in an M-dimensional vector space
transform coefficients of the TBs with width or height, (all transform coefficients in a TB) and the closest recon-
or both width and height equal to 4, and the 8 × 8 LFNST struction vector can be globally reduced. The approach
is applied for low-frequency transformed coefficients of of dependent scalar quantization in VVC is realized by:
the TBs with both width and height greater than 4. All 1) defining two scalar quantizers, denoted by Q0 and Q1,
transform coefficients outside the 4 × 4 or 8 × 8 LFNST with different sets of reconstruction levels and 2) defining
zone are discarded (set to zero). To further reduce the a process for switching states between the use of the
computational complexity and storage size of transform two scalar quantizers. The location of the available recon-
matrices, in the case of 8 × 8 LFNST, only 48 coefficients struction levels is uniquely specified by a quantization
from the primary transform are used as inputs, and only 16 step size . The scalar quantizer used (Q0 or Q1) is not
coefficients are generated as outputs from the secondary explicitly signaled in the bitstream. Instead, the quantizer
transform. Thus, a maximum of 16 coefficients needs to used for a current transform coefficient is determined by
be coded for any TB with LFNST mode enabled. For 4×N , the parities (k & 1) of the transform coefficient levels k that
N ×4, and 8 × 8 blocks, only eight coefficients are output precede the current transform coefficient in the scanning
from the secondary transform. order. As shown in Fig. 10, the switching between the two
In LFNST, a total of four transform sets and two non- scalar quantizers is realized via a state machine with four
separable transform matrices (kernels) per transform set states.
are predefined. The transform set to be used is determined Joint coding of chroma residual (JCCR) is used
based on intra-picture prediction modes. For each trans- to further reduce the redundancy of the two chroma
form set, the selected nonseparable secondary transform components’ residual signals when they are similar to
candidate is further specified by an explicitly signaled each other. Instead of signaling the residual for the two
LFNST index that is signaled for the CU. chroma components separately, one of three JCCR modes
Subblock Transform (SBT) is introduced for inter- with various weighting combinations of a single-coded
picture predicted CUs in VVC. In this transform mode, chroma residual can be selectively applied at the
only a subpart of the residual block is coded. A CU-level CU level.

1480 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

than 0 (significant), greater than 1, or greater than 2 and


Golomb–Rice coding of the remaining absolute values.
In VVC, the truncated unary part was modified by adding
an additional parity flag to facilitate the state transition
for DQ. Compared with HEVC, VVC introduced a more
advanced probability model selection for the syntax ele-
ments related to absolute values of transform coefficient
levels, depending on the values of the absolute levels or
partially reconstructed absolute levels in a local neighbor-
hood template. The template comprises two neighboring
positions to the right, two below, and one below-right
relative to the current scan position.

Fig. 10. State transition and quantizer selection. 6) In-Loop Filtering: In VVC, a remapping operation and
three in-loop filters can be applied sequentially to the
reconstructed picture to modify its representation domain
5) Entropy Coding: As in HEVC, CABAC is used as the and alleviate different types of artifacts. First, a new
single entropy coding method in VVC. The CABAC design sample-based process called LMCS is performed. Then,
in VVC contains various coding efficiency improvements a deblocking filter is used to reduce blocking artifacts.
compared with the design in HEVC. The changes in the two SAO is then applied to the deblocked picture to attenuate
main parts of entropy coding, namely the CABAC engine ringing and banding artifacts. Finally, an ALF reduces other
and transform coefficient coding, are further described in potential distortion introduced by the quantization and
this section. transform processes. The deblocking filter design is based
CABAC engine with multihypothesis probability esti- on the one in HEVC but is extended with longer deblocking
mate: The CABAC engine in AVC and HEVC uses a filters and a luma-adaptive filtering mode designed specif-
table-based probability transition process between 64 dif- ically for HDR video. While SAO is the same as in HEVC,
ferent representative probability states. The range repre- and the deblocking is very similar, LMCS and ALF are new
senting the state of the coding engine is quantized to a set compared with previous standards. The design of ALF in
of four values prior to the calculation of the new interval VVC consists of two operations: 1) ALF with block-based
range. The state transition is implemented using a table filter adaption for both luma and chroma samples and 2) a
containing all the precomputed values to approximate cross-component ALF (CC-ALF) for chroma samples.
the values of the new probability interval range. In VVC, Luma mapping with chroma scaling (LMCS): Unlike
the basic concept is kept, but the binary arithmetic coder is other in-loop filters that, in general, apply filtering
applied with a multihypothesis probability update model, processes for a current sample by using the information
based on two probability estimates P0 and P1 that are of its spatial neighboring samples to reduce the coding
associated with each context model and are updated inde- artifacts, LMCS involves modifying the input signal before
pendently with different adaptation rates. The probability encoding by redistributing the amplitudes across the entire
estimate P that is used for the interval subdivision in the representation dynamic range for improved compression
binary arithmetic coder is the average of the estimates from efficiency. LMCS has two main components: 1) in-loop
the two hypotheses. The adaptation rates of P0 and P1 for mapping of the luma component based on adaptive piece-
each context model are pretrained based on the statistics wise linear models and 2) luma-dependent chroma resid-
of the associated binary events. ual scaling for the chroma components. Luma mapping
Improved transform coefficient coding: In HEVC, makes use of a forward mapping function and a corre-
transform coefficients of a coding block are coded by cate- sponding inverse mapping function. The forward mapping
gorizing them into coefficient groups (CGs or subblocks) function is a piecewise linear function with 16 equally
such that each CG contains the coefficients of a 4 × 4 sized segments that is signaled in the bitstream. The
subblock inside a square, power-of-2 sized TB. VVC also inverse mapping function does not need to be signaled
adopts the concept of CGs for coefficient coding. Besides and is instead derived from the forward mapping function.
the legacy 4 × 4 CG, additional CG sizes (1 × 16, 16 × 1, The luma mapping model is signaled in an adaptation
2 × 8, 8 × 2, 2 × 4, and 4 × 2) are introduced due to parameter set (APS; see Section IV-C2), and up to four
narrow luma TBs resulting from ISP and small chroma TBs. LMCS APSs with different mapping models can be used
The CGs inside a TB and the transform coefficients within in a CVS. When LMCS is enabled for a slice, the inverse
a CG are coded following a single reverse diagonal scan mapping function is applied to all the reconstructed luma
order. Similar to HEVC, the transform coefficient levels blocks to convert the samples back to the original domain
are coded using a combination of different binarizations. for display output and for storage as reference pictures.
This includes truncated unary coding with a cascade of For an inter-picture coded block, the forward mapping
flags that indicate whether the absolute value is greater function needs to be applied to the luma prediction signal

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1481


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

within the decoding process, as the reference pictures are greater than or equal to 8 (in units of chroma samples),
in the original domain. This is not required for intra-picture and three chroma samples from each side are filtered.
prediction because the reconstructed signal before inverse Luma-adaptive deblocking further adjusts tC and β of
mapping is used as a prediction in that case. Chroma resid- the deblocking filter based on the averaged luma level of
ual scaling is designed to compensate for the interaction the reconstructed samples. When luma-adaptive deblock-
between the luma signal and its corresponding chroma ing is enabled, an offset qpOffset, which is derived based
signals. When luma mapping is enabled, an additional flag on the average luma level around the filtering boundary,
is signaled to indicate whether a luma-dependent chroma is added to the average QPs of the two adjacent blocks.
residual scaling is enabled or not. The chroma residual The value of qpOffset as a function of average luma level
scaling factor depends on the average value of top and/or is determined by a table of thresholds signaled in the SPS,
left reconstructed neighboring luma samples of the current which may typically be chosen according to the transfer
CU. Once the scaling factor is determined, the forward characteristics (the electro-optical transfer function and
scaling is applied to both the intra-picture and inter-picture opto-optical transfer function) of the source video content.
predicted residual at the encoding stage, and the inverse Adaptive loop filter (ALF): Two filter shapes are used
scaling is applied to the reconstructed residual. in block-based ALF. A 7 × 7 diamond shape is applied for
Deblocking filter boundary handling modifications: the luma component, and a 5 × 5 diamond shape is applied
The deblocking filter is applied to the samples adjacent to for the chroma components. One among up to 25 filters
a CU, TU, and subblock boundary except for the case when is selected for each 4 × 4 block, based on the direction
the boundary is also a picture boundary, or when deblock- and activity of local gradients. Each 4 × 4 block in the
ing is disabled across slice, tile, or subpicture boundaries picture is classified based on directionality and activity.
(which is an option that can be signaled by the encoder). Before filtering each 4 × 4 block, simple geometric trans-
The deblocking filtering process is applied on a 4 × 4 grid formations, such as rotation or diagonal and vertical flip,
for CU boundaries and transform subblock boundaries and can be applied to the filter coefficients, depending on the
on an 8 × 8 grid for prediction subblock boundaries. The gradient values calculated for that block. This is equivalent
prediction subblock boundaries include the PU boundaries to applying these transformations to the samples in the
introduced by the SBTMVP and affine modes, and the filter support region. The idea is to make different blocks
transform subblock boundaries include the TU bound- to which ALF is applied more similar by aligning their
aries introduced by SBT and ISP modes and transforms directionality. Block-based classification is not applied to
due to implicit splits of large CUs. As done in HEVC, the chroma components.
the processing order of the deblocking filter is defined as ALF filter parameters are signaled in an APS. In one APS,
horizontal filtering for vertical edges for the entire picture up to 25 sets of luma filter coefficients and clipping value
first, followed by vertical filtering for horizontal edges. This indices and up to eight sets of chroma filter coefficients
specific order enables either multiple horizontal filtering or and clipping value indices can be signaled. To reduce
vertical filtering processes to be applied in parallel threads bit overhead, filter coefficients of different classifications
or can still be implemented on a CTB-by-CTB basis with for the luma component can be merged. In the PH or SH,
only a small processing latency. the IDs of up to seven APSs can be signaled to specify the
Deblocking long filters: The deblocking filtering luma filter sets that are used for the current picture or
process is similar to that of HEVC. The boundary filter slice. The filtering process is further controlled at the CTB
strength (bS) of the deblocking filter is controlled by level. For each luma CTB, a filter set can be chosen among
the values of several syntax elements of the two adja- 16 fixed-value filter sets and the filter sets signaled in APSs.
cent blocks, and according to the filter strength and the For the chroma components, an APS ID is signaled in the
average QP of the adjacent blocks, two thresholds, tC PH or SH to indicate the chroma filter sets being used for
and β, are determined from predefined tables. For luma the current picture or slice. At the CTB level, a filter index
samples, one of four cases, no filtering, weak filtering, is signaled for each chroma CTB if there is more than one
short strong filtering, and long strong filtering, is chosen chroma filter set in the APS. When ALF is enabled for a
based on β and block size. There are three cases: no CTB, for each sample within the CTB, the diamond-shaped
filtering, normal filtering, and strong filtering for chroma filter selected for the respective 4 × 4 block is used, with a
samples. Compared with HEVC, long strong filtering for clipping operation applied to limit the difference between
luma samples and strong filtering for chroma samples are each neighboring sample and the current sample. The
newly introduced in VVC. Long luma strong filtering is clipping operation introduces a nonlinearity by reducing
used when the samples on either side of a boundary belong the impact of neighbor sample values that are too different
to a large block. A sample belonging to a large block is from the current sample value.
defined as when the width is larger than or equal to 32 for Cross-component adaptive loop filter (CC-ALF) can
a vertical edge or when the height is larger than or equal to further enhance each chroma component on top of the
32 for a horizontal edge. Up to seven samples at one side of previously described ALF. The goal of CC-ALF is to use luma
a boundary are filtered in the strong filter. Strong chroma sample values to refine each chroma component. This is
filtering is applied when both sides of the chroma edge are achieved by applying a diamond-shaped high-pass linear

1482 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

for trailing zeros or insignificant levels at the bottom right


corner of the block (due to the lack of energy compaction
by a transform); 2) sign indicators can be coded more
efficiently using context models due to nonstationarities in
the sequence of sign flags even when the global empirical
distribution is still almost uniformly distributed; and 3) the
binarization of absolute level values is changed resulting
in a higher cutoff for the unary binarization prefix, that is,
more context-coded “greater than X” flags, and a modified
Rice parameter derivation for the Golomb–Rice code suffix.
This is motived as well by larger nonstationarities in the
Fig. 11. ALF and CC-ALF diagrams. empirical distribution of spatial residuals compared with
transform coefficients.
Intra-picture block copy (IBC) makes use of repeated
filter and then using the output of this filtering operation
patterns inside a picture. It can be seen as a very basic
for chroma refinement. Fig. 11 provides a system-level
form of motion-compensated prediction with integer MVs
diagram of the CC-ALF process with respect to the other
(called block vectors) referencing previously coded regions
loop filters. As shown in Fig. 11, CC-ALF uses the same
of the same picture instead of previously coded refer-
inputs as the luma ALF in order to avoid an additional
ence pictures. Compared with the HEVC SCC extensions,
sequential processing stage of loop-filter processing.
the IBC in VVC was simplified with regard to the reference
7) Screen Content Coding Tools: One of the design goals sample buffers. In HEVC, IBC relies on the inter-picture
for VVC is the efficient coding of computer-generated video design with minor modifications, such as that the RPL only
content, which exhibits different signal characteristics than contains the current picture and that a motion or block
camera-captured video. The characteristics mainly include vector is always in integer precision and has a restriction
a lack of high-frequency sensor noise, large uniformly flat of the area that it refers to, for example, restricting it
areas with sharp edges, repeated patterns, highly saturated to already-decoded samples. However, the IBC in VVC is
colors, or a limited number of different colors. Tools to simplified and decoupled from inter-picture prediction by
efficiently exploit these characteristics had been added to storing reference samples in a smaller local buffer. This
the HEVC RExt and SCC extensions. These tools, with buffer is restricted to contain only the previously coded
some refinements, have also been used as the basis for the samples in the current CTU and the CTU to its left. Another
following SCC tools in VVC. difference is having a dedicated IBC merge mode for
Block-level differential pulse code modulation block vector coding, which is simpler than the VVC inter-
(BDPCM) is targeting better decorrelation of the screen picture merge mode. Furthermore, the integer block vector
content prediction residuals by applying samplewise DPCM precision from HEVC SCC is extended in VVC to use block-
to the residual instead of a typical frequency transform. level AMVR as well (see Section IV-B2) but with only full-
Similar to the RDPCM introduced in HEVC RExt, the DPCM or four-integer sample precision.
can be applied in the horizontal (along rows) or verti- Palette mode is used to represent the sample values in
cal direction (along with columns). For intra-picture pre- a CU by a set of representative color values. This set is
dicted CUs, the direction is explicitly signaled, and the referred to as the palette. For a CU coded in the palette
intra-picture prediction mode is derived from it, for exam- mode, a palette is first signaled, and then, for each sample
ple, vertical DPCM implies vertical intra-picture prediction. in the CU, a palette index is signaled. In VVC, for the
However, while the RDPCM in HEVC can be applied to slices with separate luma/chroma coding trees, the palette
inter-picture prediction residuals, the BDPCM in VVC is is applied on luma (the Y component) and chroma (Cb
restricted to only intra-picture predicted CUs. and Cr components) separately, with the luma palette
Transform skip residual coding (TSRC) adapts the entries containing only Y values and the chroma palette
CABAC entropy coding of the spatial transform skip entries containing both Cb and Cr values. For slices with
residual block to screen-content-specific characteristics. a single coding tree, palette coding is applied on three
In the HEVC RExt extensions, this statistical difference was color components jointly, that is, each entry in the palette
already partly considered by using 180◦ rotation of intra- contains Y, Cb, and Cr values. It is also possible to specify a
picture predicted transform skip residuals and a dedicated sample that is outside the palette by signaling an escape
context model for the flag that indicates an absolute value symbol. For samples within the CU that is coded using
greater than zero (the significance flag). In VVC, this the escape mechanism, their quantized values are directly
includes three main aspects: 1) the explicit signaling of the signaled. Although it can be applied to all chroma formats,
position that indicates the first nonzero value when reverse the palette mode can only be enabled in the profiles that
scanning diagonally from bottom right to top-left is omit- support the 4:4:4 video (see Section IV-C8).
ted and the scanning direction is inverted (from top-left Adaptive Color Transform (ACT) can be applied to
to bottom-right), as motivated by the higher probability reduce the correlation between the three color components

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1483


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

in the 4:4:4 chroma format, which is especially effective principle as in HEVC and contain similar types of header
for video sequences represented in RGB color spaces. The parameters. The support of temporal scalability in VVC is
ACT in VVC is the same as in the HEVC SCC extension. also basically the same as in HEVC. Other aspects of the
It performs in-loop color-space conversion in the prediction systems and transport interfaces in VVC are summarized in
residual domain by adaptively converting the residuals the following, focusing on the differences compared with
from the input color space (presumed to be RGB) to the HEVC.
YCgCo-R luma–chroma color representation [32]. A flag at
the CU level is used to indicate whether the residuals of the 1) Random Access Support: VVC supports three types of
CU are coded with the YCgCo-R transformation or in the IRAP pictures, two types of IDR pictures (one type with
original color space. The YCgCo-R transformation is fully and one type without associated with other pictures that
reversible, so it can even be applied for lossless coding. precede them in display order), and one type of CRA
In order to reduce cache storage requirements, when ACT picture. These are basically the same as in HEVC. The BLA
is enabled for a CVS, the maximum transform size cannot picture types in HEVC are not included in VVC, mainly
exceed 32 × 32 samples since ACT requires temporarily because: 1) the basic functionality of BLA pictures can be
storing all three TBs. realized using CRA pictures and an end of sequence NAL
unit, the presence of which indicates that the next picture
8) 360◦ Video Coding Tools: Another design goal for
starts a new CVS in a single-layer bitstream and 2) there
VVC is the efficient coding of immersive video. This
was a desire for specifying fewer NAL unit types than in
includes 360◦ video, which is typically coded by rep-
HEVC to simplify the design understanding, as reflected by
resenting a 2-D picture that has been generated by a
the use of five instead of six bits for the NAL unit type field
projection mapping from a 3-D sphere. One example of
in the NAL unit header.
such a mapping is the equirectangular projection format
Another key difference in random access support
(ERP), in which the sphere is projected onto a rectangu-
between VVC and HEVC is the support of GDR in a more
lar picture with some geometric distortions, especially at
normative manner in VVC. In GDR, the decoding of a
the poles. Another mapping is the cube map projection
bitstream can start from an inter-picture coded picture,
(CMP), where the sphere is mapped onto the six faces of
and although, in the beginning, some parts of the pic-
a cube, which are then packed together into one picture.
ture region cannot be correctly decoded, after decoding
The ability to indicate such formats and the following
a number of additional pictures, the entire picture region
two techniques have been added to VVC to increase the
would become correct for decoding later pictures in the
coding efficiency for video pictures using these projection
bitstream. (AVC and HEVC can also support a form of
formats:
GDR, using a recovery point indication SEI message for
MV wrap-around allows for prediction samples to
signaling the GDR random access points and the recovery
“wrap-around” from the opposite left or right boundary in
points.) In VVC, a new NAL unit type is specified for
cases where an MV points outside of the coded area. In ERP
an indication of GDR pictures, and the recovery point is
pictures, the content tends to be continuous across such
signaled in the picture header (PH) syntax structure, and
a wrap-around due to the 360◦ nature of the projection
a bitstream or a CVS within a bitstream is allowed to
mapping, which can result in having a moving object that is
start with a GDR picture. This means that it is allowed
partly at the left boundary and partly at the right boundary
for an entire bitstream to contain only inter-picture coded
of a picture.
pictures without a single intra-picture coded picture. The
Virtual boundaries for in-loop filtering prevents
main benefit of specifying GDR support in this way is to
applying in-loop filtering across certain “virtual” bound-
provide a conforming behavior for GDR operation. GDR
aries, for example, not slice or tile boundaries but corre-
enables encoders to smooth out the bit rate of a bitstream
sponding to the CMP face boundaries in CMP pictures. The
by distributing intra-picture coded slices or blocks across
locations of these boundaries are typically signaled at the
multiple pictures that also contain inter-picture predicted
CVS level.
slices or blocks, as opposed to intra-picture coding of entire
pictures, thus allowing significant end-to-end delay reduc-
C. Systems and Transport Interfaces tion to improve behavior for ultralow-delay applications,
VVC inherited many aspects of the systems and transport such as wireless display, online gaming, and drone-based
interfaces from HEVC and the associated header syntax. applications.
The bitstream structure is the same as in HEVC except that Another GDR-related feature in VVC is the vir-
the concept of an elementary stream is not included. The tual boundary signaling discussed earlier. The boundary
NAL unit syntax and NAL unit header are both similar as between the refreshed region (i.e., the correctly decoded
in HEVC, with a small difference in the NAL unit header region) and the unrefreshed region at a picture between a
syntax, where HEVC uses six bits for the NAL unit type GDR picture and its recovery point can be signaled as a vir-
field, while VVC uses only five bits, thus allowing half tual boundary, and when signaled, in-loop filtering across
of the maximum number of specified NAL unit types. the boundary would not be applied; thus, a decoding
The VPS, SPS, PPS, and SH followed the same design mismatch for some samples at or near the boundary would

1484 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

not occur. This can also be useful when the application basic concept of slices was kept in VVC but designed in an
involves displaying the correctly decoded regions during essentially different form. VVC introduces subpictures that
the GDR process. provide the same region extraction functionality as MCTSs
but are designed in a different way to have better coding
2) Adaptation Parameter Set: VVC introduced a new
efficiency and to be friendlier for usage in application
type of parameter set called the APS. An APS conveys
systems. More detail about these differences is described
picture- and/or slice-level information that may be shared
in the following.
by multiple slices of a picture and/or by slices of different
Tiles and WPP: As in HEVC, a picture can be split
pictures but can change frequently from picture-to-picture
into tile rows and tile columns in VVC, intra-picture pre-
with the total number of variants potentially being high
diction across tile boundaries is disallowed, and so on.
and thus not suitable for inclusion into the PPS. Three
However, the syntax for signaling the tile partitioning
types of parameters are included in APSs: ALF parame-
has been simplified, by using a unified syntax design for
ters, LMCS parameters, and scaling list parameters for
both the uniform and the nonuniform use cases. The
frequency-specific inverse quantization scaling. The main
WPP design in VVC has two differences compared with
purpose of introducing APSs is to save signaling overhead.
HEVC: 1) the CTU row delay is reduced from two CTUs
3) Picture Header: VVC also uses a PH, which contains to one CTU and 2) the signaling of entry point offsets for
header parameters for a particular picture. Each picture WPP in the SH is optional in VVC, while it is mandatory
must have exactly one PH. The PH basically carries those in HEVC.
parameters that would have been in the SH if the PH were Slices: In VVC, the support of conventional slices based
not introduced but would have the same value for all slices on CTUs (as in HEVC) or macroblocks (as in AVC), that is,
of a picture. These include IRAP/GDR picture indications, such that each slice consists of an arbitrary number of CTUs
flags indicating whether inter-picture and intra-picture or macroblocks in raster scan order within a tile or within
coded slices are allowed, picture ordering position syntax, a picture, has been removed. The main reasoning behind
information on RPLs, deblocking, SAO, ALF, QP selection, this architectural change is as follows. The advances in
weighted prediction control, coding block partitioning video coding since 2003 (the publication year of AVC v1)
information, virtual boundaries, colocated picture infor- have been such that slice-based error concealment has
mation, and so on. It often occurs that each picture in become practically impossible due to the ever-increasing
an entire sequence of pictures contains only one slice. number and efficiency of intra-picture and inter-picture
To avoid needing to have at least two NAL units for each prediction mechanisms. An error-concealed picture is the
picture, the PH syntax structure can be included either in decoding result of a transmitted coded picture for which
the PH NAL unit or in the SH in this case. The main purpose there has been some data loss (e.g., loss of some slices)
of introducing the PH was for saving signaling overhead for of the coded picture or a reference picture so that at least
cases where pictures are split into multiple slices. some part of the decoded picture is not error-free (e.g.,
because one or more reference pictures were lost or were
4) Reference Picture Management: Reference picture
error-concealed pictures). For example, when one of the
management is core functionality that is necessary for
multiple slices of a picture is lost, it may be error-concealed
any video coding scheme that uses multipicture buffering
using interpolation of the neighboring slices. While AVC
with generalized inter-picture prediction. It manages the
prediction mechanisms provide significantly higher cod-
storage and removal of reference pictures into and from
ing efficiency, they also make it harder for algorithms to
a decoded picture buffer (DPB) and puts reference pic-
estimate the quality of an error-concealed picture, which
tures in their proper order in the RPLs. Reference picture
was already a hard problem with the use of simpler
management in VVC is more similar to HEVC than AVC
prediction mechanisms. Advanced intra-picture prediction
but is somewhat simpler and more robust. As in those
mechanisms also function much less well if a picture is split
standards, two RPLs, called list 0 and list 1, are derived,
into multiple slices. Furthermore, network conditions have
but they are not based on the reference picture set concept
become significantly better in the meantime. As a result,
used in HEVC or the automatic sliding window process
very few implementations have recently used slices for
used in AVC; instead, they are signaled more directly.
MTU size matching. Instead, substantially, all applications
Reference pictures are listed for the RPLs as either active
where low-delay error/loss resilience is required (e.g.,
or inactive entries, and only the active entries may be used
video telephony and video conferencing) have come to rely
as reference indices for inter-picture prediction of CTUs of
on system/transport-level error resilience (e.g., retransmis-
the current picture. Inactive entries indicate other pictures
sion and forward error correction) and/or picture-based
to be held in the DPB for potential referencing by other
resilience tools (feedback-based resilience, insertion of
pictures that arrive later in the bitstream.
IRAPs, scalability with uneven protection of the base layer,
5) High-Level Picture Partitioning: VVC also includes four and so on). With all these, it is very rare that a picture
different high-level picture partitioning schemes but not that cannot be correctly decoded is passed to the decoder,
the same set as in HEVC. VVC inherited the tiles and WPP and when such a rare case occurs, the system can afford to
from HEVC, with some minor-to-moderate differences. The wait for an error-free picture to be decoded and available

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1485


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Fig. 12. Picture with 18 × 12 luma CTUs that are partitioned into
Fig. 14. Picture with 18 × 12 luma CTUs that are partitioned into
24 tiles and nine rectangular slices.
12 tiles and three raster-scan slices.

for display without frequent and long periods of picture


rectangular slices that collectively cover a rectangular
freezing.
region of the picture, as shown in Fig. 15. A subpicture
Slices in VVC have two modes: rectangular slices and
may be either specified to be extractable (i.e., independent
raster-scan slices. As the name implies, rectangular slices
coded of other subpictures of the same picture and of other
always have a rectangular shape, typically consisting of a
subpictures of earlier pictures in decoding order) or not
number of complete tiles that collectively cover a rectan-
extractable. Furthermore, the encoder can control whether
gular region of the picture, as shown in Fig. 12. However,
in-loop filtering (including deblocking, SAO, and ALF)
it is also possible that a rectangular slice is a subset of
across the subpicture boundaries is enabled individually
a tile and consists of one or more consecutive, complete
for each subpicture.
CTU rows within a tile, as shown in Fig. 13. A raster-scan
Functionally, subpictures are the same as the MCTSs
slice consists of one or more complete tiles in tile raster
that have been supported with SEI messages in HEVC.
scan order, and hence, the region covered by a raster-scan
They both allow independent coding and extraction of a
slice is typically not a rectangle (e.g., as shown in Fig. 14)
rectangular subset of a sequence of coded pictures, for use
although it may also happen to be a rectangle.
cases such as viewport-dependent 360◦ video streaming
The layout of rectangular slices (including the position
optimization and region-of-interest (ROI) applications.
and the size of each of the slices) is signaled in the PPS
In streaming of 360◦ video, also known as omnidirec-
based on the layout of tiles. Information on which tiles are
tional video, at any particular moment, only a subset (i.e.,
included in a raster-scan slice is signaled in the SH.
the current viewport) of the entire omnidirectional video
Subpictures: As mentioned earlier, the subpictures’ fea-
sphere would be rendered to the user, while the user
ture was newly introduced during the development of
can turn their head at any time to change their viewing
VVC. Each subpicture consists of one or more complete
orientation and, consequently, the current viewport. While

Fig. 13. Picture partitioned into four tiles and four rectangular
slices (note that the top-right tile is split into two rectangular Fig. 15. Picture partitioned into 18 tiles, 24 slices, and
slices). 24 subpictures.

1486 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

rewriting of SHs can be a significant burden for application


systems. Furthermore, VVC specifies HRD and level defin-
itions for subpicture sequences; thus, the conformance of
the subbitstream of each extractable subpicture sequence
can be relied upon for system functionalities, such as
subpicture-based bitstream extraction and merging.
The layout of subpictures in VVC is signaled in the SPS,
and thus, it is constant within a CVS. The trick that enables
the extraction of subpicture sequences without rewriting
SHs and PHs is through the signaling of subpicture IDs.
The subpicture ID of a subpicture can be different from
the value of the subpicture index, and the subpicture ID
mapping (a list of subpicture IDs, one for each subpicture)
is signaled, which may either be constant within a CVS
(in which case it is signaled in the SPS) or allowed to
Fig. 16. Subpicture-based viewport-dependent 360◦ video change in the pictures within a CVS (in which case it
delivery scheme. is signaled in the PPS). In the SH, the subpicture ID of
the subpicture containing the slice is signaled, and the
subpicture-level slice index is also signaled. The subpicture
ID and the subpicture-level slice index together tell the
it is desirable to have at least some lower quality repre- decoder where to place the decoded tiles or in-tile CTU
sentation of the area not covered by the current viewport rows in the slice into the decoded picture. In an extracted
available at the client and ready to be rendered to the user subbitstream containing a subset of the subpictures in each
in case they suddenly change their viewing orientation to picture of an original bitstream, the same subpicture ID
somewhere else on the sphere, a high-quality representa- value would still be signaled in the rewritten SPS or PPS,
tion of the omnidirectional video is only needed for the even when the subpicture now has a different subpicture
current viewport that is actively being rendered to the index value. Therefore, even when the raster-scan CTU
user. Splitting the high-quality representation of the entire address of the first CTU in a slice in the subpicture has
omnidirectional video into subpictures at an appropriate changed compared with the value in the original bitstream,
granularity can enable such an optimization. the unchanged subpicture ID and subpicture-level slice
An example subpicture-based viewport-dependent 360◦ index in the SH can still correctly determine the position
video delivery scheme is shown in Fig. 16, wherein a higher of each CTU in the decoded picture of the extracted
resolution representation of the full video scene consists of bitstream.
subpictures, while a lower resolution representation of the
full video scene does not use subpictures and can be coded 6) Picture Resolution Changes With Inter-Picture Predic-
with less-frequent random access points than the higher tion: In AVC and HEVC, the spatial resolution of pictures
resolution representation. The client receives the full video cannot change unless a new CVS is started using a new
in the lower resolution, and for the higher resolution video, SPS and an IRAP picture. VVC enables picture resolution
it only receives and decodes the subpictures that cover the changes within a CVS without encoding an IRAP picture,
current viewport. thus allowing inter-picture prediction with references to
One key difference between VVC subpictures and MCTSs pictures having a different resolution than the current pic-
is that the subpicture feature in VVC allows the MVs of a ture that is being decoded. This feature is often referred to
coding block to point outside of the subpicture even when as RPR, as it requires resampling of the reference pictures
the subpicture is extractable, relying on decoder padding that are used for inter-picture prediction when they have a
at subpicture boundaries in this case, similarly as at picture different resolution than that of the current picture.
boundaries. This allows higher coding efficiency compared The scaling ratio for RPR is restricted to be larger than or
with the tight encoder-side motion constraints applied for equal to 1/2 (factor-of-2 downsampling from the reference
MCTSs. Another important aspect in the VVC design is that picture to the current picture) and less than or equal to 8
rewriting of the SHs (and PH NAL units, when present) is (factor-of-8 upsampling). Three sets of resampling filters
not needed when extracting one or more VVC subpictures with different frequency cutoffs are specified to handle
from a sequence of pictures to create a subbitstream that is various scaling ratios between a reference picture and
a conforming bitstream. In subbitstream extraction based the current picture. The three sets of resampling filters
on HEVC MCTSs, rewriting of SHs is needed. Although are applied for the scaling ratios ranging from 1/2 to
rewriting of SPSs and PPSs is needed in both extraction 1/1.75, 1/1.75 to 1/1.25, and 1/1.25 to 8, respectively.
cases, the number of SPSs and PPSs in a bitstream is low, Each set of resampling filters has 16 phases for luma and
while each picture has at least one slice and the amount 32 phases for chroma, which are the same as the number
of data in the slices can be very large; therefore, the of phases for the filters used for motion compensation

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1487


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

interpolation. In fact, the conventional motion compensa-


tion interpolation process is a special case of the resam-
pling process with the scaling ratio in the range from
1/1.25 to 8. The horizontal and vertical scaling ratios
are derived based on picture width and height, and left,
right, top, and bottom scaling offsets are specified for the
reference picture and the current picture.
7) Scalability Support: Due to the support of RPR,
in VVC, the support of a bitstream containing multiple
spatial scalability layers, for example, two layers with SD
and HD resolution, does not require any additional signal
processing coding tools, as the upsampling process needed
for spatial scalability support can just use the RPR upsam-
pling filter. Nevertheless, some high-level syntax changes
(compared with not supporting scalability) are needed for
scalability support.
Scalability support is specified in VVC v1, and com- Fig. 17. Subpicture-based viewport-dependent 360◦ video
pared with the scalability support methods in the earlier delivery scheme making use of inter-layer prediction.
video coding standards, including in extensions of AVC
and HEVC, the design of VVC scalability has been made
friendlier to single-layer decoder designs. The decoding
capability for multilayer bitstreams is specified in a manner a subprofile with a subset of the tools/features contained
as if there was only a single layer in the bitstream. For in an existing VVC profile, by just going through a simple
example, the decoding capability, such as DPB size, is spec- identifier registration process (as specified by Rec. ITU-T
ified in a manner that is independent of the number of T.35 [33]).
layers in the bitstream to be decoded. Basically, a decoder VVC version 1 defines six profiles: 1) two single-layer
designed for single-layer bitstreams does not need much video profiles, the Main 10 profile and the Main 10
modification to be able to decode multilayer bitstreams. 4:4:4 profile, which basically support all the coding
Compared with the designs of multilayer extensions of tools but restrict the bitstream to contain only one layer
AVC and HEVC, the HLS aspects have been significantly (although there is no restriction on temporal scalability
simplified at the sacrifice of some flexibility. For example, support of sublayers); 2) two multilayer video profiles,
an IRAP AU is required to contain a picture for each of the the Multilayer Main 10 profile and the Multilayer Main 10
layers present in the CVS. 4:4:4 profile, with the only difference compared with the
With the scalability support, not only conventional two single-layer video profiles being that the bitstream can
spatial scalability, quality scalability, and multiview contain multiple layers; and 3) two still picture profiles,
scalability are enabled but also some combinations of scal- the Main 10 Still Picture profile and the Main 10 4:4:4 Still
ability and subpictures are enabled in VVC v1. For exam- Picture profile, with the only difference compared with the
ple, the subpicture-based viewport-dependent 360◦ video two single-layer video profiles being that the bitstream can
delivery scheme shown in Fig. 16 can be improved by contain only one picture, which needs to be intra-picture
allowing interlayer prediction, as shown in Fig. 17. coded.

8) Profile, Tier, and Level (PTL) Aspects: Two new aspects


V. V V C C O D I N G E F F I C I E N C Y
regarding PTL in VVC have been introduced: the general
A. Objective
constraints and the subprofile concept. In HEVC, the PTL
syntax structure includes a few general constraint fields The JVET has specified some common test conditions
for indications, such as whether the bitstream may contain (CTCs) [34] to conduct experiments in a well-defined
interlaced source content. In VVC, almost every substantial manner to allow for a fair comparison of the outcome of
tool or feature has a corresponding general constraint flag. experiments. The CTCs were used to evaluate the propos-
The main reason for this is to enable third parties, for als during VVC development. The CTC definition includes
example, an application system standards body or even a three mandatory test conditions, reflecting all-intra, ran-
company, to be able to easily indicate that certain tools dom access, and low-delay settings, and the random access
are not used in the bitstream, in case these tools are case is considered more important than the others due to
not conveniently useable by them, without the need of its much broader usage in applications. A set of 18 video
going through the time-consuming process and difficult sequences, including Classes A1 and A2 (3840 × 2160),
consensus negotiations that are required for specifying a Class B (1920 × 1080), Class C (832 × 480), Class D (416 ×
new VVC profile. The subprofile concept was introduced 240), Class E (1280×720), and Class F (variant resolution),
for a similar purpose. This enables a third party to define is employed in the experiments. Classes A–D represent

1488 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

camera-captured video, Class E has video conferencing Table 3 YUV BD-Rate Savings of VVC (VTM-9.0) Over AVC and HEVC

sequences, and Class F has screen content sequences.


Class E is not tested for the random access case (since
that case has a higher delay than would be acceptable
for video conferencing). Class A sequences are not tested
in the low-delay case (since such source material would
seldom be used in low-delay applications). All of these test
materials are progressively scanned and use 4:2:0 color
sampling with 8 or 10 bits per sample. For the random
access case, the structural delay is set to 16 frames, and
the IRAP random access interval is set to be approximately
1 s. Four rate points are tested with constant QP settings,
with the base QP set to 22, 27, 32, and 37 and with the
QP of higher temporal sublayers derived using fixed offsets
from these values. The experiments in this article use the
JVET CTC conditions and the Bjøntegaard delta bit rate
(BD-rate) measurement method [35], [36] to evaluate the
compression performance based on the following weighted
average of peak signal-to-noise ratio (PSNR) values per
color component:

1
PSNR YUV = (6 ∗ PSNRY + PSNR CB + PSNR CR ).
8

The heavier weighting of PSNRY is to somewhat com-


pensate for the fact that most of the bits are used to encode
the luma component of the video pictures (and it is the
most perceptually important component).
In this article, the coding efficiencies of VVC, HEVC,
and AVC are compared. A more detailed comparison of
HEVC coding efficiency with its predecessors can be found
in [19]. In the experiments, the respective reference soft-
ware encoders were used, that is, the VVC Test Model Fig. 18. Rate-distortion plots of VVC, HEVC, and AVC for the
(VTM-9.0) [37], the HEVC Test Model (HM-16.20) [38], CatRobot1 video test sequence (random access configuration).

the HEVC SCC Extension Test Model (SCM-8.8) [39] for


Class F, and the AVC Joint Test Model (JM-19.0) [40]. Due
to level constraints on the DPB capacity for AVC, a random
access configuration with a structural delay of eight frames efficiency for screen content can be achieved with the
is employed, which can be found in the “cfg/HM-like” HEVC Screen-Extended Main 10 profile, introduced in
configuration files in the JM software package. Average version 4 of HEVC (see Section III-B4). Comparing the
BD-rate savings of VVC over AVC and HEVC for each VTM to the SCM in HEVC Screen-Extended Main 10 profile
class of sequences are tabulated in Table 3. The overall configuration (the last row of the table), VVC still provides
average BD-rate savings are based on class A/B/C/E test 26.4% YUV BD-rate savings for random access, 32.5% YUV
sequences, which are considered as representing target BD-rate savings for low delay, and 17.8% YUV BD-rate
user scenarios for VVC. It can be seen that the BD-rate savings for all-intra. Fig. 18 shows an example of random
savings of VVC over AVC for random access reach 65% access rate-distortion plots of VVC, HEVC, and AVC, for the
on average with up to 72% for 4k resolutions. Compared CatRobot1 UHD video test sequence.
with HEVC in various configurations, VVC provides the A so-called tool-off test has been used to investigate the
highest coding gain in the random access case, where an impact of new coding tools within different modules of
average 36.9% YUV BD-rate saving is achieved, and for VVC. In a tool-off test, a specific set of tools is turned off
test sequences with 4k resolutions, the savings is more in VTM-9.0, while all other new coding features remain
than 40%. For low-delay and all-intra configurations, VVC enabled, and the results are compared with those of
achieves 31.1% and 25.5% average YUV BD-rate savings, VTM-9.0 with all tools turned on. Table 4 shows the gain
respectively. For Class F, representing SCC, VVC achieves of the inter-picture coding tools (affine, SBTMVP, AMVR,
even higher BD-rate savings when comparing the VTM GPM, BDOF, CIIP, MMVD, BCW, DMVR, and SMVD), intra-
to the HM in HEVC Main 10 profile configuration. While picture coding tools (MIP, MRL, ISP, and CCLM), trans-
Main 10 is the most deployed HEVC profile, higher coding form & quantization tools (DQ, MTS, LFNST, SBT, and

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1489


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Table 4 Random Access YUV BD-Rate Savings of VVC (VTM-9.0) Over Table 5 MOS and PSNR-YUV BD-Rate Savings of VVC (VTM-10.0) Over
VVC Without Specific Tool Sets HEVC (HM-16.22) and of an Optimized VVC Encoder (VVenC-0.1) Over
VTM

JCCR), and loop filtering tools (ALF, CC-ALF, and LMCS) with the HEVC reference software (HM), an open-source
for the random access case. The coding gain of the VVC encoder implementation (VVenC) was also included in
QT+MTT block partitioning scheme can be approximated the tests as well [43]. The tested VVenC version 0.1 in
by comparing the first version of the VTM [41], which is “medium” preset runs significantly faster (110×) than VTM
basically adding QT+MTT on top of HEVC, to HM-16.20. and additionally includes subjective quality enhancement
VTM-1.0 provides around 10% YUV BD-rate savings for techniques, that is, temporal filtering of the input video
random access over the HEVC HM. It should be noted and perceptually tuned bit allocation [44]. Table 5 sum-
that some VVC coding features, for example, the improved marizes the subjective mean opinion score (MOS) and
CABAC engine, transform coefficient coding, intra-picture objective PSNR-YUV-based BD-rate savings for all five
prediction mode coding, and PDPC, cannot be turned off in test sequences. This test verifies that the VTM and VVenC
the VVC reference software. Hence, their respective gains encoders for VVC significantly improve compression, with
are not included in this experiment. Table 4 further lists the VTM reducing the bit rate by 43% on average relative
relative encoding and decoding runtimes for the averages, to the HM for the same perceived quality and VVenC
where 100% represents the runtime of the respective reducing the bit rate by an additional 12% relative to the
anchor. The presented results show that VVC’s coding VTM. On the other hand, the PSNR-YUV BD-rate savings
efficiency improvement over HEVC stems from multiple are much lower and even negative (i.e., a bit rate increase)
new coding features in each major module. In addition, for VVenC versus the VTM. For both tested VVC encoders,
the combined gains of all four tool sets (inter, intra, trans- the measured subjective quality benefit relative to the HM
form and quantization, and loop filtering) are just slightly somewhat exceeds the benefit measured by PSNR-YUV
lower than the sum of the individual gains. An additional BD-rate numbers—a phenomenon that was also observed
tool-on test, where each specific tool set is enabled on top for HEVC relative to its AVC predecessor [19]. Fig. 19
of a version of VTM with all tools off, has been performed shows pooled results for all five test sequences by plotting
as well and the results are not significantly different than
for the tool-off test.

B. Subjective
The compression capability goal of the HEVC and VVC
projects has been to reduce the bit rate for a given level of
subjective video quality, that is, the quality perceived by
human observers. While PSNR is a convenient objective
measurement method, it is not an adequate substitute
for subjective quality measurement. This motivated the
JVET to initiate formal testing activities using rigorous
subjective assessment methods in order to verify the coding
efficiency of the final standard. The first such verification
test was completed in October 2020, covering UHD SDR
content in a random access configuration, as may be
used in newer streaming and broadcast television appli-
cations [42]. Here, five challenging UHD SDR sequences
outside the JVET test set were selected and encoded over
a range of five quality levels spanning from annoying Fig. 19. Average (arithmetic) MOS and (geometric mean) bit rates
to almost imperceptible impairments. Although the main of VVC (VTM and VVenC encoders) and HEVC (HM encoder) pooled
focus was on comparing the VVC reference software VTM over the five UHD SDR sequences used in the verification test.

1490 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

the arithmetic average of the MOS values over the geo- shown great promise in that direction, but this work has
metric average of the corresponding rate points. It can just begun to emerge, and such techniques are typically
be seen that the quality levels of the VTM and HM are difficult to implement at the high speeds and low costs that
well matched. At the time of writing, testing of HD SDR are necessary for widespread deployment in many video
(random access and low delay), HDR, and 360◦ video applications. Another promising direction is the develop-
content is ongoing and expected to be completed in ment of improved methods of measuring perceptual video
April 2021 [45]. quality. Given some improved method of measuring qual-
ity, there may be improved compression technologies that
VI. C O N C L U S I O N A N D O U T L O O K can optimize that quality. Yet another interesting direction
VVC is a major advance in both video compression capa- is the concept of video coding for machines, where the
bility and the versatility of the application domain, again key difference compared with conventional video coding is
demonstrating about 50% bit rate reduction for equal that the decoded video quality measurement needs to take
subjective quality—a characteristic that it shares with its into account the performance of a nonhuman usage of the
HEVC and AVC predecessors as a new milestone gener- decoded video for some particular purposes, for example,
ation of video coding technology. In terms of applica- by self-driving vehicles.
tions, it has substantial new features for such uses as The breadth of applications of video coding technology
the coding of HDR and 360◦ video content, streaming also continues to expand, as in recent and emerging work
with adaptive picture resolution, support for compressed- on the coding of point clouds, textures mapped onto mov-
domain bitstream extraction and merging, and, practically, ing 3-D meshes, and plenoptic light field coding. Such tech-
all of the features of the prior international video coding nologies will bring new requirements to the compression
standards and their extensions (e.g., extended chroma for- technology although the VVC standard seems quite flexible
mats, scalability, multiview coding, and SCC). Optimized to address the stable and well-understood applications that
encoder and decoder implementations of VVC have begun have driven the current demand for a new international
to emerge and have clearly demonstrated that the standard standard.
is feasible to implement with good compression perfor-
mance and practical levels of complexity. While the first
version of VVC has included only bit depths up to 10 bits Acknowledgment
per sample, the first extension work for VVC has begun The authors would like to thank the experts of ITU-T
to extend it to support higher bit depths and enhance VCEG, ISO/IEC MPEG, and their ITU-T/ISO/IEC Joint
its performance in the very high (near lossless) fidelity Video Experts Team (JVET) for their contributions. Their
range. work has not only led to the development of the new
Further research will result in further improvements in Versatile Video Coding (VVC) standard but also made
video compression, but it may be difficult to significantly a large archive of innovative contributions available for
surpass the capability of the VVC design for quite a few further study. The archive of JVET documents can be found
years to come. Artificial intelligence technologies have online at https://www.jvet-experts.org/.

REFERENCES
[1] High Efficiency Video Coding, Recommendation [8] Cisco Systems, “Cisco visual networking index: [16] J. Pfaff et al., “Data-driven intra-prediction modes
ITU-T H.265 and ISO/IEC 23008-2 (HEVC), ITU-T Forecast and trends, 2017–2022,” Cisco Syst., in the development of the versatile video coding
and ISO/IEC JTC 1, Apr. 2013. White Paper, 2019. [Online]. Available: http://web. standard,” ITU J. ICT Discoveries, vol. 3, no. 1,
[2] Advanced Video Coding for Generic Audio-Visual archive.org/web/20181213105003/https:/www. May 2020.
Services, Recommendation ITU-T H.264 and cisco.com/c/en/us/solutions/collateral/service- [17] Information Technology—Digital Compression and
ISO/IEC 14496-10 (AVC), ITU-T and ISO/IEC provider/visual-networking-index-vni/white-paper- Coding of Continuous-Tone Still Images—Part 1:
JTC 1, May 2003. c11-741490.pdf Requirements and Guidelines, Recommendation
[3] G. J. Sullivan and T. Wiegand, “Video [9] Video Codec for Audiovisual Services at P x 64 kbit/s, ITU-T T.81 and ISO/IEC 10918-1, ITU-T and
compression—From concepts to the H.264/AVC Recommendation ITU-T H.261, ITU-T, 1993. ISO/IEC JTC 1, 1992.
standard,” Proc. IEEE, vol. 93, no. 1, pp. 18–39, [10] Codecs for Videoconferencing Using Primary Digital [18] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and
Jan. 2005. Group Transmission, Recommendation ITU-T T. Wiegand, “Overview of the high efficiency video
[4] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and H.120, ITU-T, 1993. coding (HEVC) standard,” IEEE Trans. Circuits Syst.
A. Luthra, “Overview of the H.264/AVC video [11] Information Technology—Coding of Moving Pictures Video Technol., vol. 22, no. 12, pp. 1649–1668,
coding standard,” IEEE Trans. Circuits Syst. Video and Associated Audio for Digital Storage Media at up Dec. 2012.
Technol., vol. 13, no. 7, pp. 560–576, to About 1,5 Mbit/s—Part 2: Video, ISO/IEC [19] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan,
Jul. 2003. 11172-2, ISO/IEC JTC 1, 1993. and T. Wiegand, “Comparison of the coding
[5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and [12] Information Technology—Generic Coding of Moving efficiency of video coding standards—Including
G. J. Sullivan, “Rate-constrained coder control and Pictures and Associated Audio Information: Video, High Efficiency Video Coding (HEVC),” IEEE Trans.
comparison of video coding standards,” IEEE Trans. Recommendation ITU-T H.262 and ISO/IEC Circuits Syst. Video Technol., vol. 22, no. 12,
Circuits Syst. Video Technol., vol. 13, no. 7, 13818-2, ITU-T and ISO/IEC JTC 1, 1995. pp. 1669–1684, Dec. 2012.
pp. 688–703, Jul. 2003. [13] Video Coding for Low Bit Rate Communication, [20] R. Sjöberg et al., “Overview of HEVC high-level
[6] Versatile Video Coding, Recommendation ITU-T Recommendation ITU-T H.263, ITU-T, Mar. 1996. syntax and reference picture management,” IEEE
H.266 and ISO/IEC 23090-3 (VVC), ITU-T and [14] Information Technology—Coding of Audio-Visual Trans. Circuits Syst. Video Technol., vol. 22, no. 12,
ISO/IEC JTC 1, Jul. 2020. Objects—Part 2: Visual, document ISO/IEC pp. 1858–1870, Dec. 2012.
[7] Versatile Supplemental Enhancement Information 14496-2, ISO/IEC JTC 1, 2001. [21] Information Technology—Coding of Audio-Visual
Messages for Coded Video Bitstreams, [15] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of Objects—Part 12: ISO Base Media File Format,
Recommendation ITU-T H.274 and ISO/IEC hierarchical B pictures and MCTF,” in Proc. IEEE Int. document ISO/IEC 14496-12, ISO/IEC JTC 1,
23002-7 (VSEI), ITU-T and ISO/IEC JTC 1, Conf. Multimedia Expo (ICME), Toronto, ON, 2004.
Jul. 2020. Canada, Jul. 2006, pp. 1929–1932. [22] Information Technology—Dynamic Adaptive

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1491


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Streaming Over HTTP (DASH)—Part 1: Media [30] J. Chen, M. Karczewicz, Y.-W. Huang, K. Choi, fraunhofer.de/jvet/VVCSoftware_VTM/-/tags/
Presentation Description and Segment Formats, J.-R. Ohm, and G. J. Sullivan, “The joint VTM-8.0
document ISO/IEC 23009-1, ISO/IEC JTC 1, exploration model (JEM) for video compression [38] HEVC Reference Software Version 16.20.
2012. with capability beyond HEVC,” IEEE Trans. Circuits Accessed: Sep. 2018. [Online]. Available:
[23] K. Misra, A. Segall, M. Horowitz, S. Xu, A. Fuldseth, Syst. Video Technol., vol. 30, no. 5, pp. 1208–1225, https://vcgit.hhi.fraunhofer.de/jct-vc/HM/-
and M. Zhou, “An overview of tiles in HEVC,” IEEE May 2020. /tags/HM-16.20
J. Sel. Topics Signal Process., vol. 7, no. 6, [31] B. Bross et al., “General video coding technology in [39] HEVC Screen Content Coding Extension Reference
pp. 969–977, Dec. 2013. responses to the joint call for proposals on video Software version 16.21+SCM8.8. Accessed:
[24] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl, compression with capability beyond HEVC,” IEEE Mar. 2020. [Online]. Available: https://vcgit.
“Tile based HEVC video for head mounted Trans. Circuits Syst. Video Technol., vol. 30, no. 5, hhi.fraunhofer.de/jct-vc/HM/-/tags/HM-16.21+
displays,” in Proc. IEEE Int. Symp. Multimedia pp. 1226–1240, May 2020. SCM-8.8
(ISM), San Jose, CA, USA, Dec. 2016, pp. 399–400. [32] H. S. Malvar, G. J. Sullivan, and S. Srinivasan, [40] AVC Reference Software Version 19.0. Accessed:
[25] C. Ching Chi et al., “Parallel scalability and “Lifting-based reversible color transformations for Mar. 2019. [Online]. Available: https://vcgit.hhi.
efficiency of HEVC parallelization approaches,” image compression,” Proc. SPIE vol. 7073, fraunhofer.de/jct-vc/JM/-/tags/JM-19.0
IEEE Trans. Circuits Syst. Video Technol., vol. 22, Aug. 2008, Art. no. 707307, Paper 7073-07. [41] J. Chen and E. Alshina, Algorithm Description for
no. 12, pp. 1827–1838, Dec. 2012. [33] Procedure for the Allocation of ITU-T Defined Codes Versatile Video Coding and Test Model 1 (VTM 1),
[26] D. Flynn et al., “Overview of the range extensions for Non-Standard Facilities, document document JVET-J1002, 10th Meeting of ITU-T/ISO/
for the HEVC standard: Tools, profiles, and Recommended ITU-T T.35, ITU-T, 1988. IEC Joint Video Experts Team (JVET), Apr. 2018.
performance,” IEEE Trans. Circuits Syst. Video [34] F. Bossen, J. Boyce, X. Li, V. Seregin, and K. Sühring, [42] V. Baroncini and M. Wien, VVC Verification Test
Technol., vol. 26, no. 1, pp. 4–19, Jan. 2016. JVET Common Test Conditions and Software Report for UHD SDR Video Content, document
[27] J. M. Boyce, Y. Ye, J. Chen, and Reference Configurations for SDR Video, document JVET-T2020, 21th Meeting of ITU-T/ISO/IEC Joint
A. K. Ramasubramonian, “Overview of SHVC: JVET-N1010, 14th Meeting of ITU-T/ISO/IEC Joint Video Experts Team (JVET), Oct. 2020.
Scalable extensions of the high efficiency video Video Experts Team (JVET), Mar. 2019. [43] Fraunhofer HHI VVenC Software Repository.
coding standard,” IEEE Trans. Circuits Syst. Video [35] G. Bjøntegaard, Improvement of BD-PSNR Model, Accessed: Sep. 2020. [Online]. Available: https://
Technol., vol. 26, no. 1, pp. 20–34, Jan. 2016. document VCEG-AI11 of ITU-T SG16/Q6, Berlin, github.com/fraunhoferhhi/vvenc
[28] G. Tech, Y. Chen, K. Muller, J.-R. Ohm, A. Vetro, Germany, Jul. 2008. [Online]. Available: http:// [44] A. Wieckowski et al., Open Optimized VVC Encoder
and Y.-K. Wang, “Overview of the multiview and 3D wftp3.itu.int/av-arch/video-site/0807_Ber/ (VVenC) and Decoder (VVdeC) Implementations,
extensions of high efficiency video coding,” IEEE [36] Working Practices Using Objective Metrics for document JVET-T0099, 21th Meeting of
Trans. Circuits Syst. Video Technol., vol. 26, no. 1, Evaluation of Video Coding Efficiency Experiments, ITU-T/ISO/IEC Joint Video Experts Team (JVET),
pp. 35–49, Jan. 2016. document ITU-T HSTP-VID-WPOM and ISO/ Oct. 2020.
[29] J. Xu, R. Joshi, and R. A. Cohen, “Overview of the IEC DTR 23002-8, ITU-T and ISO/IEC JTC 1, [45] M. Wien, V. Baroncini, A. Segall, and Y. Ye, VVC
emerging HEVC screen content coding extension,” 2020. Verification Test Plan (Draft 4), document
IEEE Trans. Circuits Syst. Video Technol., vol. 26, [37] VVC Reference Software Version 8.0. Accessed: JVET-T2009, 21st Meeting of ITU-T/ISO/IEC Joint
no. 1, pp. 50–62, Jan. 2016. Feb. 2020. [Online]. Available: https://vcgit.hhi. Video Experts Team (JVET), Oct. 2020.

ABOUT THE AUTHORS

Benjamin Bross (Member, IEEE) received Jianle Chen (Senior Member, IEEE) received
the Dipl.Ing. degree in electrical engineer- the B.S. and Ph.D. degrees from Zhejiang
ing from RWTH Aachen University, Aachen, University, Hangzhou, China, in 2001 and
Germany, in 2008. 2006, respectively.
In 2009, he joined the Fraunhofer Insti- He was formerly with Samsung Electron-
tute for Telecommunications–Heinrich Hertz ics, Suwon, South Korea, Qualcomm, San
Institute, Berlin, Germany, where he is cur- Diego, CA, USA, and Huawei, Santa Clara,
rently the Head of the Video Coding Systems CA, USA, focusing on the research of video
group and a part-time Lecturer with the HTW technologies. Since 2006, he has been
Berlin University of Applied Sciences. Since 2010, he has been actively involved in the development of various video coding stan-
very actively involved in the ITU-T Video Coding Experts Group dards, including the High Efficiency Video Coding (HEVC) standard,
(VCEG)|ISO/IEC MPEG video coding standardization processes as a its scalable, format range and screen content coding extensions,
Technical Contributor, a Coordinator of core experiments, and the and, most recently, the Versatile Video Coding (VVC) standard
Chief Editor of the High Efficiency Video Coding (HEVC) standard in the Joint Video Experts Team (JVET). He has been the main
(ITU-T H.265|ISO/IEC 23008-2) and the Versatile Video Coding (VVC) developer of the recursive partitioning structure with large block
standard (ITU-T H.266|ISO/IEC 23090-3). Besides giving talks about size, which is one of the key features of the HEVC standard and its
recent video coding technologies, he is the author or a coauthor of potential successors. He is currently the Director of the Multimedia
several fundamental HEVC-related publications and the author of R&D Group, Qualcomm, Inc. His research interests include video
two book chapters on HEVC and inter-picture prediction techniques coding and transmission, point cloud coding, AR/VR, and neural
in HEVC. network compression.
Mr. Bross received the IEEE Best Paper Award of the 2013 IEEE Dr. Chen was an Editor of the HEVC specification version 2 (the
International Conference on Consumer Electronics–Berlin in 2013, scalable HEVC (SHVC) text specification) and SHVC Test Model. For
the SMPTE Journal Certificate of Merit in 2014, and the Emmy Award VVC, he has been the Lead Editor of the Joint Exploration Test Model
at the 69th Engineering Emmy Awards in 2017 as a part of the Joint (JEM) and the VVC Test Model (VTM). He is an Editor of the VVC Text
Collaborative Team on Video Coding for its development of HEVC. Specification.

1492 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 9, September 2021


Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC

Jens-Rainer Ohm (Member, IEEE) has been Ye-Kui Wang received the B.S. degree
holding the chair position of the Institute of in industrial automation from Beijing Insti-
Communication Engineering, RWTH Aachen tute of Technology, Beijing, China, in 1995,
University, Aachen, Germany, since 2000. and the Ph.D. degree in information and
He is currently the Dean of the Faculty telecommunication engineering from the
of Electrical Engineering and Information Graduate School in Beijing, University of Sci-
Technology, RWTH Aachen University. Since ence and Technology of China, Hefei, China,
1998, he has been participating in the work in 2001.
of the Moving Picture Experts Group (MPEG). His earlier working experiences and titles
He has authored textbooks on multimedia signal processing, analy- include the Chief Scientist of Media Coding and Systems, Huawei
sis, and coding on communication engineering and signal trans- Technologies, San Diego, CA, USA, the Director of Technical Stan-
mission and numerous articles in these fields. His research and dards with Qualcomm, San Diego, a Principal Member of Research
teaching activities cover the areas of multimedia signal process- Staff with Nokia Corporation, Tampere, Finland, and so on. He is
ing, analysis, compression, transmission, and content description, currently a Principal Scientist with Bytedance Inc., San Diego.
including 3-D and VR video applications, biosignal processing and He has been an active contributor to various multimedia standards,
communication, application of deep learning approaches in the including video codecs, file formats, RTP payload formats, and mul-
given fields, and fundamental topics of signal processing and digital timedia streaming and application systems, developed by various
communication systems. standardization organizations, including ITU-T Video Coding Experts
Dr. Ohm has been chairing/cochairing various standardization Group (VCEG), ISO/IEC Moving Picture Experts Group (MPEG), Joint
activities in video coding, namely the MPEG Video Subgroup 2002– Video Team (JVT), JVT on Video Coding (JCT-VC), JCT-3V, 3GPP
2018, the Joint Video Team (JVT) of MPEG and ITU-T SG 16 Video SA4, IETF, AVS, DVB, ATSC, and DECE. He has coauthored about
Coding Experts Group (VCEG) 2005–2009, and the Joint Collabora- 1000 standardization contributions and over 50 academic articles.
tive Team on Video Coding (JCT-VC) since 2010 and the Joint Video He is a listed inventor for more than 300 U.S. patents. His research
Experts Team (JVET) since 2015. He has served on the editorial interests include video coding, storage, transport, and multimedia
boards of several journals and program committees of various systems.
conferences. He has been chairing the development of OMAF at MPEG and an
editor for several standards, including Versatile Video Coding (VVC),
Versatile Supplemental Enhancement Information (VSEI), OMAF, all
versions of High Efficiency Video Coding (HEVC), VVC file format,
HEVC file format, layered HEVC file format, ITU-T H.271, SVC file
Gary J. Sullivan (Fellow, IEEE) received the format, MVC, RFC 6184, RFC 6190, RFC 7798, 3GPP TR 26.906, and
B.S. and M.Eng. degrees from the University 3GPP TR 26.948.
of Louisville, Louisville, KY, USA, in 1982 and
1983, respectively, and the Ph.D. degree
from the University of California at Los Ange-
les, Los Angeles, CA, USA, in 1991.
He is currently a Video and Image Tech-
nology Architect with Microsoft Research,
Redmond, WA, USA. He has been the Long-
standing Chairman/Co-Chairman of various video and image cod-
ing standardization activities in ITU-T Video Coding Experts Group
(VCEG), ISO/IEC MPEG, ISO/IEC JPEG, and in their joint collab-
orative teams since 1996. He has led the development of the
Advanced Video Coding (AVC) standard (ITU-T H.264|ISO/IEC 14496-
10), the High Efficiency Video Coding (HEVC) standard (ITU-T
H.265|ISO/IEC 23008-2), the Versatile Video Coding (VVC) stan-
dard (ITU-T H.266|ISO/IEC 23090-3), and various other projects.
At Microsoft, he has been the Originator and the Lead Designer of
the DirectX Video Acceleration (DXVA) video decoding feature of
the Microsoft Windows operating system.
Dr. Sullivan is a Fellow of SPIE. He received the IEEE Masaru
Ibuka Consumer Electronics Award, the IEEE Consumer Electronics
Engineering Excellence Award, two IEEE Transactions on Circuits
and Systems for Video Technology Best Paper Awards, and the
SMPTE Digital Processing Medal. The team efforts that he has led
have been recognized by three Emmy Awards.

Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1493

You might also like