Developments in International Video Coding Standardization After AVC, With An Overview of Versatile Video Coding (VVC)
Developments in International Video Coding Standardization After AVC, With An Overview of Versatile Video Coding (VVC)
ABSTRACT | In the last 17 years, since the finalization of the highlighted in the title of VVC is explained. Included in VVC is
first version of the now-dominant H.264/Moving Picture Experts the support for a wide range of applications beyond the typical
Group-4 (MPEG-4) Advanced Video Coding (AVC) standard standard- and high-definition camera-captured content cod-
in 2003, two major new generations of video coding standards ings, including features to support computer-generated/screen
have been developed. These include the standards known as content, high dynamic range content, multilayer and multiview
High Efficiency Video Coding (HEVC) and Versatile Video Cod- coding, and support for immersive media such as 360◦ video.
ing (VVC). HEVC was finalized in 2013, repeating the ten-year
KEYWORDS | Compression; H.265; H.266; High Efficiency
cycle time set by its predecessor and providing about 50%
Video Coding (HEVC); Joint Video Experts Team (JVET); Mov-
bit-rate reduction over AVC. The cycle was shortened by three
ing Picture Experts Group (MPEG); standards; versatile sup-
years for the VVC project, which was finalized in July 2020,
plemental enhancement information (VSEI); Versatile Video
yet again achieving about a 50% bit-rate reduction over its
Coding (VVC); video; video coding; Video Coding Experts Group
predecessor (HEVC). This article summarizes these develop-
(VCEG); video compression.
ments in video coding standardization after AVC. It especially
focuses on providing an overview of the first version of VVC,
including comparisons against HEVC. Besides further advances I. I N T R O D U C T I O N
in hybrid video compression, as in previous development In 2013, the first version of the High Efficiency Video
cycles, the broad versatility of the application domain that is Coding (HEVC) standard was finalized [1], providing
about a 50% bit-rate reduction compared with its prede-
cessor, the H.264/MPEG-4 Advanced Video Coding (AVC)
Manuscript received March 9, 2020; revised October 31, 2020; accepted
November 29, 2020. Date of publication January 19, 2021; date of current standard [2]. Both standards were jointly developed by
version August 20, 2021. This work was supported by Fraunhofer-Gesellschaft. the ITU-T Video Coding Experts Group (VCEG) and the
(Corresponding author: Benjamin Bross.)
Benjamin Bross is with Fraunhofer Institute for Telecommunications, Heinrich
ISO/IEC Moving Picture Experts Group (MPEG). AVC itself
Hertz Institute, HHI, 10587 Berlin, Germany (e-mail: benjamin.bross@hhi. had provided about 50% bit-rate reduction compared with
fraunhofer.de).
Jianle Chen is with Qualcomm Inc., San Diego, CA 92121 USA.
the H.262/MPEG-2 Video standard, which had been pro-
Jens-Rainer Ohm is with the Institute for Communications Engineering, RWTH duced a decade earlier and was also a joint project of
Aachen University, 52062 Aachen, Germany.
the same organizations [3]–[5]. Now, as of July 2020,
Gary J. Sullivan is with Microsoft Corporation, Redmond, WA 98052 USA.
Ye-Kui Wang is with Bytedance Inc., San Diego, CA 92130 USA (e-mail: VCEG and MPEG have also finalized the Versatile Video
[email protected]). Coding (VVC) standard [6], aiming at yet another 50%
Digital Object Identifier 10.1109/JPROC.2020.3043399 bit-rate reduction and providing a range of additional
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Fig. 2. Block diagram of a hybrid video encoder, including the modeling of the decoder within the encoder.
signal by decorrelation, quantization decreases the data of and decide which to select also increases compared with
the transform coefficient representation by reducing their a fixed size or limited partitioning set. However, fast
precision, ideally by removing only imperceptible details; partitioning algorithms and advances in computing power
in such case, it serves to reduce irrelevance in the data. have allowed recent standards to provide a high degree
This hybrid video coding design principle is also used in of flexibility. AVC, HEVC, and VVC all employ tree-based
the two most recent standards HEVC and VVC. For a more partitioning structures with multiple depth levels and the
detailed review of the previous standards, spanning from blocks as leaf nodes, and VVC additionally provides the
H.120 [10] to AVC and also including H.261, MPEG-1 ability to use nonrectangular partitions.
Video [11], H.262/MPEG-2 Video [12], H.263 [13], and Motion-compensated or inter-picture prediction
MPEG-4 Visual [14], the reader is referred to [3]. takes advantage of the redundancy that exists between
Referring to Fig. 2, a modern hybrid video coder can be (hence “inter”) pictures of a coded video sequence (CVS).
characterized by the following building blocks. A key concept is block-based motion compensation, where
Block partitioning is used to divide the image into the picture is divided into blocks, and for each block,
smaller blocks for the operation of the prediction and a corresponding area from a previously decoded picture,
transform processes. The first hybrid video coding stan- that is, the reference picture, is used as a prediction for the
dards used a fixed block size, typically 16 × 16 samples for current block. Assuming that the content of a block moves
the luma prediction regions and 8 × 8 for the transforms. between pictures with translational motion, the displace-
Starting with H.263, and especially starting with AVC, ment between the current block and the corresponding
partitioning became a major part of the design focus. area in the reference picture is commonly referred to
Over the subsequent generations, block partitioning has by a 2-D translational motion vector (MV). Finding the
evolved to become more flexible by adding more and best correspondence is typically done at the encoder by a
different block sizes and shapes to enable adaptation to the block-matching search that is referred to as motion estima-
local region statistics. In the prediction stage, this allows tion. The encoder then signals the estimated MV data to
an encoder to trade off high accuracy for the prediction the decoder. H.261 used only integer-valued MVs, and this
(using small blocks) versus a low data rate for the side or concept of translational motion compensation was later
prediction information to be signaled (using large blocks). generalized by using fractional-sample MV accuracy with
For the coding of residual differences, small blocks enable interpolation (with half-sample precision in MPEG-1 and
the coding of fine detail, whereas the large ones can code MPEG-2 videos and quarter-sample from MPEG-4 Visual
smooth regions very efficiently. With increasing possibili- onward), averaging two predictions from one temporally
ties for partitioning a picture into blocks, the complexity preceding and one succeeding picture (bidirectional pre-
of an encoder that needs to test the possible combinations diction in MPEG-1 and MPEG-2 videos) or from multiple
reference pictures with arbitrary relative temporal posi- sion and is also used in the well-known JPEG image
tions (in standards since AVC). Moreover, the usage of mul- compression standard (which was designed around the
tiple reference pictures from different temporal positions same time as H.261) [17]. The DCT decorrelates about
enables hierarchical prediction structures inside a group of as well as the KLT for highly-correlated auto-regressive
pictures (GOP), which further improves coding efficiency. sources and is easier to compute. In later standards starting
However, when succeeding pictures are used, a structural with H.263 version 3 and AVC, integer-based reduced-
delay is introduced by requiring a different ordering of complexity transforms are used that are often informally
the pictures for coding and display [15]. The most recent called DCTs although a true DCT uses trigonometric basis
standard, VVC, even goes beyond the translational motion functions involving irrational numbers and supports addi-
model by approximating affine motion and using another tional factorizations. In order to account for different
motion estimation process for motion refinement at the statistics in the source signal, it can be beneficial to choose
decoder side. between multiple transforms as in HEVC and VVC. Further-
Intra-picture prediction exploits the spatial redun- more, applying an additional transform on the transform
dancy that exists within a picture (hence “intra”) by deriv- coefficients as in VVC can further decorrelate the signal.
ing the prediction for a block from already coded/decoded, Quantization aims to reduce the precision of an input
spatially neighboring reference samples. This kind of pre- value or a set of input values in order to decrease the
diction in the spatial sample domain was introduced amount of data needed to represent the values. In hybrid
with AVC, whereas previous standards used a simplified video coding, the quantization is typically applied to indi-
transform-domain prediction. In AVC, three different types vidual transformed residual samples, that is, to transform
of prediction modes are employed, “DC,” planar, and angu- coefficients, resulting in integer coefficient levels. As can
lar, all of them using neighboring samples of previously be seen in Fig. 2, this process is applied to the encoder.
decoded blocks that are to the left and/or above the block At the decoder, the corresponding process is known as
to be predicted. The first, the so-called DC mode, averages inverse quantization or simply as scaling, which restores
the neighboring reference samples and uses this value as the original value range without regaining the precision.
a prediction for the entire block, that is, for every sample. The precision loss makes quantization the primary ele-
The second, that is, the planar mode, models the samples ment of the block diagram for hybrid video coding that
to be predicted as a plane by position-dependent linear introduces distortion. Quantization together with scaling
combinations of the reference samples. As the third option, can be seen as a rounding operation with a step size
the angular modes interpolate the reference samples along controlling the precision. In recent video coding standards,
a specific direction/angle. For example, the vertical angu- the step size is derived from a so-called quantization
lar mode just copies the above reference samples along parameter (QP) that controls the fidelity and bit rate.
each column. HEVC extended these modes, for exam- A larger step size (larger QP) lowers the bit rate but
ple, by increasing the number of angles from 8 to 33, also deteriorates the quality, which, for example, results
whereas the most recent VVC standard not only further in video pictures exhibiting blocking artifacts and blurred
extended the number of modes but also incorporates new details. Typically, each sample is quantized independently,
methods, such as a matrix-based intra-picture prediction which is referred to as scalar quantization. In contrast to
(MIP), which was designed using machine learning [16]. this, vector quantization processes a set of samples jointly,
Similar to motion information in inter-picture prediction, for example, by mapping a block onto a vector from a
the encoder signals the estimated prediction information, codebook. At least from the decoder perspective, all recent
that is, the intra-picture prediction mode, to the decoder. video coding standards prior to HEVC have employed only
Transformation decorrelates a signal by transforming it scalar quantization. HEVC includes a trick known as sign
from the spatial domain to a transformed domain (typically data hiding that can be viewed as a form of vector quanti-
a frequency domain), using a suitable transform basis. zation, and VVC introduces dependent quantization (DQ),
Hybrid video coding standards apply a transform to the which can be interpreted as a kind of sliding-block vector
prediction residual (regardless of whether it comes from quantization because the quantization of a sample depends
inter- or intra-picture prediction), that is, the difference on the states of previous samples. Advanced techniques
between the prediction and the original input video signal, for optimized encoding with prior standards can also be
as shown in Fig. 2. In the transform domain, the essential viewed as vector quantization while appearing to be scalar
information typically concentrates into a small number of quantization from the decoder perspective.
coefficients. At the decoder, the inverse transform needs to Entropy coding assigns codewords to a discrete-valued
be applied to reconstruct the residual samples. One exam- set of source symbols by taking into account their statistical
ple of a transform basis is the Karhunen–Loève transform properties, that is, relative frequency. All recent video
(KLT), which is considered an optimal decorrelator but coding standards use variable-length coding (VLC) tables
depends on correlation characteristics of the input signal that assign shorter codewords to symbols with a higher fre-
that are ordinarily not known at the decoder. Another quency of occurrence in order to approach the entropy. The
example is the discrete cosine transform (DCT), which way to design codeword tables in earlier standards was
has been used since H.261 for hybrid video compres- based on the Huffman coding (with minor adjustments).
VLC is typically applied to encode and decode the vast color gamut and HDR, both of which require an increased
majority of the data, including control data, motion data, bit depth from 8 to 10 bits per color component sam-
and coefficient levels. AVC further improved the VLC ple. At the same time, other formats, such as interlace-
scheme for coefficient level coding by using a context- scanned video, became less relevant due to advances in dis-
adaptive VLC (CAVLC). A context is determined by the play technology (with digital flat panels replacing analog
value or a combination of values of previous symbols, cathode-ray tube displays). While AVC incorporates block-
which can be used to switch to a VLC table designed level features optimized for interlaced video, HEVC does
for that context. Furthermore, AVC was the first video not burden decoders with additional complexity for this
coding standard that introduced context-adaptive binary and, instead, only provides a basic, yet efficient, picture-
arithmetic coding (CABAC) as a second, more efficient level method to encode interlaced video using the same set
entropy coding method. CABAC still uses VLC tables to of block-level coding tools as for progressive-scan video.
map symbols, such as the coefficient levels to binary strings
(codewords). However, the binary strings are not written A. First Version
directly to the bitstream, but, instead, each bit in the binary
The first version (v1) of HEVC generalized and improved
string is further coded using binary arithmetic coding with
hybrid video coding beyond the concepts of AVC with a
context-adaptive probability models. Due to its high effi-
focus on higher resolutions and improved coding efficiency,
ciency, CABAC has become the sole entropy coding method
in general. The following provides an overview of the main
in the succeeding HEVC and VVC standards.
features for each part of the hybrid video coding design
In-loop filtering is a filtering process (or combination of
and a brief description of its high-level picture partitioning
such processes) that is applied to the reconstructed picture,
and the interfaces to systems and transport layers. For a
as illustrated in Fig. 2, where the reconstructed picture
more detailed description of HEVC and a discussion of its
is the combination of the reconstructed residual signal
coding efficiency, the reader is referred to [18] and [19].
(which includes quantization error) and the prediction.
The reconstructed picture after in-loop filtering can be 1) Block Partitioning: As previously mentioned, HEVC
stored and used as a reference for inter-picture predic- introduces a flexible, quadtree-based partitioning scheme
tion of subsequent pictures. The name in-loop filtering that includes larger block sizes. This partitioning scheme is
is motivated by this impact on other pictures inside the characterized by the following elements.
hybrid video coding prediction loop. The main purpose Coding tree units and quadtree-based block partitioning:
of the filtering is to reduce visual artifacts and decrease In AVC, as well as in previous standards since H.261, a
reconstruction errors. H.263 version 2 is the first standard macroblock represents the basic processing unit for further
that used a deblocking in-loop filter, which became a core segmentation for the prediction and subsequent transform
feature in version 1 of AVC. This filter was designed to be steps of the hybrid coding scheme. The size of the mac-
adaptive to the quantization fidelity, so it can attenuate roblock, which is the maximum size used in prediction,
the blocking artifacts introduced by the quantization of is fixed to 16 × 16 luma samples. The color video has
block-based prediction residuals while preserving sharp three color component planes, so, in addition to the luma
edges in the picture content. HEVC adds a second in-loop samples, the macroblock also has two blocks of chroma
filtering stage called sample adaptive offset filtering, which samples, which typically have half the width and half the
is a nonlinear filter applied after deblocking to attenu- height of the luma block—a sampling format known as the
ate ringing and banding artifacts. In the emerging VVC 4:2:0 chroma format. Other, less widely used formats are
standard, an adaptive loop filter (ALF) was introduced as 4:4:4, in which the chroma planes have the same resolu-
a third filter, where, typically, the filter coefficients are tion as luma, 4:2:2, in which the chroma has half the width
determined by minimizing the reconstruction error using but the same height as the luma. The monochrome video
a Wiener filter optimization approach. Moreover, in VVC, has only a single component plane and is sometimes called
another process known as luma mapping with chroma 4:0:0. With increasing picture resolution, homogeneous
scaling (LMCS) can also be applied before the others in areas can cover areas larger than this, and the 16 × 16 size
the in-loop processing stage. prevents such areas from being coded efficiently. Hence,
The next two sections describe the recent developments increasing the maximum block size becomes important for
made over earlier hybrid video coding designs for the coding higher-resolution video. In HEVC, the macroblock
HEVC standard and, in more detail, for the most recent is replaced by the coding tree unit (CTU). The picture area
VVC standard. that a CTU covers is selected by the encoder for the entire
CVS and can be set to 16 × 16, 32 × 32, or 64 × 64 luma
III. H I G H E F F I C I E N C Y V I D E O C O D I N G samples. The CTU constitutes the root of a coding quadtree
The first version of the HEVC standard was finalized in that splits each CTU area recursively into four smaller
January 2013 and approved as ITU-T H.265 and ISO/IEC square areas. The recursive splitting is signaled efficiently
23008-2. At that time, new types of digital video and by sending a series of binary-valued splitting flags until a
applications had been emerging. These include picture leaf node indication or a maximum allowed splitting depth
resolutions beyond HD, such as 4k/UHD, as well as wider is reached. In HEVC (and VVC), a unit contains blocks of
MV are differentially coded using an MV prediction (MVP) horizontal and vertical directions. The prediction accuracy
and an MV difference (MVD). In AVC, a single MVP is is also improved by using bilinear interpolation between
derived using either median or directional prediction from the reference sample positions with 1/32 sample precision.
up to four already coded, spatially neighboring MVs. HEVC Improved most probable mode (MPM) coding: It is moti-
replaces the implicit derivation by explicitly signaling one vated by the increased number of prediction modes.
of two potential MVP candidates that are derived from In AVC, the prediction mode can be either signaled using
five spatially neighboring and two temporally colocated a flag indicating to use the mode inferred from neighbors
MVs, where “temporally colocated” refers to MVs used as the MPM or with a fixed-length code to select among
when coding a corresponding location in a particular pre- the less probable modes. HEVC extends the MPM concept
viously decoded picture. This use of explicit signaling to by constructing a list of three MPMs from the modes of
select among MVP candidates is known as advanced MVP the neighboring blocks to the left and above the current
(AMVP). In both AVC and HEVC, MVP-based motion data block. An MPM index indicates which MPM is selected, and
coding still requires an indication of whether uniprediction in case a non-MPM mode is selected, a fixed-length code
or biprediction is applied and, for each MV, an indication of indicates one of the remaining 31 modes.
which stored reference picture it refers to. Two reference
picture lists (RPLs) are constructed for inter-picture refer- 4) Transform and Quantization: As mentioned earlier,
encing purposes, called list 0 and list 1, where one picture the introduction of the coding quadtree with nested RQT
from one list is used for performing uniprediction and one allows variable power-of-2 transform sizes from 4 × 4 to
picture from list 0 and one from list 1 are used for bipredic- 32 × 32. As in AVC, integer transforms are applied to avoid
tion. A reference picture in such a list is selected by an implementation-dependent precision issues. The 2-D core
index into the list called a reference picture index. The so- transforms in HEVC are integer approximations of scaled
called direct or skip modes in AVC do not signal any motion DCT basis functions, realized by applying 1-D transforms
data; instead, the MVs and reference indices are derived sequentially for rows and columns. The basis functions for
from spatially and temporally neighboring blocks. The skip all four DCT-based integer transforms have been designed
mode in unipredictive slices derives the list 0 MV from the such that they can be extracted from those of the 32-point
MVP, and the list 0 reference picture index is 0, referring to transform by subsampling. Besides these new DCT-based
the first reference picture in the list. In bipredictive slices, integer transforms, the following additional transform-
the spatial direct or skip modes derive list 0 and list 1 MVs related features are introduced in HEVC.
and reference picture indices from spatially neighboring Discrete sine transform (DST) replaces the DCT for
blocks, whereas the temporal direct or skip modes derive prediction residuals resulting from directional intra-picture
list 0 and list 1 MVs and reference indices from the prediction when the block size is 4×4. It was found that the
temporally colocated block. The selection of the skip mode DST provides better energy compaction in cases where the
further indicates that the current block does not have a prediction error increases with increasing distance from
coded residual. HEVC replaces the direct and skip modes one of the block boundaries, which is typically the case for
by introducing block merging, which derives motion data intra-picture prediction due to increasing distance from the
from one of five merging candidates. The candidates are reference boundary. Like the DCT, the DST is also simplified
derived from spatially or temporally neighboring blocks, and incorporated as a 2-D separable transform. Its bases
and only a merge index is signaled to select among the are integer approximations of scaled DST basis functions.
merging candidates. This creates regions of equal motion Due to the limited compression benefit for larger block
data, thus enabling us to jointly code regions with equal sizes and associated implementation complexity, the DST
motion across block boundaries from different quadtree is restricted to 4 × 4 luma TBs.
branches. The combination of AMVP and the merge mode Transform skip is another mode that skips the trans-
is quite effective at establishing a coherent motion repre- form step and, instead, directly quantizes and codes
sentation in the decoded video. The skip mode in HEVC the residual samples in the spatial domain. For certain
applies block merging without coded residual data. video signals, such as computer-generated screen content
with sharp edges, applying a transform can, sometimes,
3) Intra-Picture Prediction: In principle, HEVC decrease the coding efficiency. Skipping the transform
intra-picture prediction employs the same types of modes for such content addresses this issue and can also avoid
as in AVC, namely DC, planar, and directional angular “ringing” artifacts.
modes. The more flexible block structures with larger Transform and quantization bypass allows an encoder
block sizes allow for the following main improvements. to skip both the transform and quantization to enable
Increased number of angles: From eight angles in AVC mathematically lossless end-to-end coding. A CU-level flag
to 33 in HEVC for the directional prediction, exploiting controls this mode, thereby enabling efficient regionwise
the increased number of reference samples available with lossless coding.
larger block sizes. The increase comes from adding bottom- Sign data hiding is used to conditionally skip the sig-
left to top-right diagonal directions and using a finer naling of the sign bit of the first nonzero coefficient in a
resolution of angles, with a denser coverage around the 4 × 4 subblock. The sign bit is inferred from the parity
of the sum of the coefficient amplitudes when it is not parameter sets (SPSs), multipicture-level picture para-
coded. To implement this, the encoder needs to select one meter sets (PPSs), slice-level header syntax, and lower
coefficient and alter its amplitude in cases where the parity level coded slice data. In the following, the systems and
does not indicate the correct sign of the first coefficient. transport interface aspects in HEVC v1 that are essentially
different from AVC are briefly summarized. An overview of
5) Entropy Coding: The higher coding efficiency of the
the AVC designs on these aspects can be found in [3]. More
AVC CABAC entropy coding method compared with CAVLC
details on the HEVC designs of these aspects can be found
motivated the decision to have CABAC as the only entropy
in [20]. For simplicity in this description, “HEVC” means
coding method in HEVC. The basic CABAC design is the
HEVC v1, unless otherwise stated.
same as in AVC, with the following:
Random access support: Random access refers to starting
1) increased parsing throughput by reducing intersym- the decoding of a bitstream from a picture that is not the
bol dependencies, especially for parallel-processing first picture in the bitstream in decoding order. To support
hardware architectures; tuning in and channel switching in broadcast/multicast
2) memory reduction by reducing the number of con- and multiparty video conferencing, seeking in local play-
texts used to store and adapt probability models; back and streaming, and stream adaptation in streaming,
3) improved transform coefficient coding with coeffi- the bitstream needs to include relatively frequent random
cient scanning and context modeling designed for access points that are typically intra-picture coded pictures
larger block sizes to increase the coding efficiency. but may also be inter-picture coded pictures (e.g., in the
case of gradual decoding refresh (GDR) as further dis-
6) In-Loop Filtering: The in-loop filtering from AVC was
cussed in the following).
kept in HEVC (with a slightly modified deblocking filter),
HEVC includes the signaling of intra random access
and a nonlinear in-loop filter was added as an additional
point (IRAP) pictures in the NAL unit header through
filtering stage, as follows.
NAL unit types. Three types of IRAP pictures are sup-
Parallel processing friendly deblocking is enabled in
ported, namely instantaneous decoder refresh (IDR), clean
HEVC by aligning the horizontal and vertical block edges,
random access (CRA), and broken link access (BLA) pic-
to which the deblocking filter is applied, on an 8 × 8 grid,
tures. IDR pictures constrain the inter-picture prediction
in contrast to the 4 × 4 grid used in AVC. Given the
structure to not reference any picture before the current
maximum filtering extent of four samples on each side of
GOP and are conventionally referred to as closed-GOP
an edge, each 8 × 8 block can be filtered in parallel.
random access points. CRA pictures are less restrictive by
Sample adaptive offset (SAO) is introduced in HEVC
allowing certain pictures to reference pictures that precede
and consists of two selectable nonlinear filters that are
the current GOP, all of which are discarded in the case of
designed to attenuate different artifacts in the recon-
random access. CRA pictures are conventionally referred to
structed picture after deblocking. Both filters involve clas-
as open-GOP random access points. BLA pictures usually
sifying samples and applying amplitude mapping functions
originate from splicing together two bitstreams or parts,
that add or subtract offsets to the samples that belong to
thereof, at a CRA picture, for example, during stream
the same class. The first one is called edge offset that aims
switching. To enable better systems usage of IRAP pictures,
to attenuate ringing artifacts. Edge offset classifies each
altogether six different NAL unit types are defined to signal
sample into one of five categories (flat area, local mini-
the properties of the IRAP pictures, which can be used to
mum, left or right edge, or local maximum) for four gradi-
enable various types of bitstream access points, such as
ents (horizontal, vertical, and two diagonals). The second
those defined in the ISO base media file format (ISOBMFF)
one is called band offset and is designed to attenuate
[21], which are used for random access support in dynamic
banding artifacts. It subdivides the range of sample values
adaptive streaming over HTTP (DASH) [22].
(e.g., 0–255 for 8-bit video) into 32 equally spaced bands.
Video parameter set (VPS): A new type of parameter
For four consecutive bands, a band-specific offset value is
set, called the VPS, was introduced in HEVC. Although
added to each sample inside each of the four bands. The
introduced in HEVC v1, the VPS is especially useful to
gradient direction for edge offset, the first of the four con-
provide a “big picture” of the characteristics of a multilayer
secutive bands for band offset, and the four offset values
bitstream, including what types of operation points are
are estimated at the encoder and signaled on a CTU basis.
provided, the profile, tier, and level (PTL) of the operation
7) Systems and Transport Interfaces: HEVC inherited points, layer dependence information, and so on.
the basic systems and transport interface designs from Temporal scalability support: HEVC supports temporal
AVC. These include the network abstraction layer (NAL) scalability (e.g., for extracting lower frame-rate video from
data unit syntax structuring, the hierarchical syntax a high-frame-rate bitstream) by signaling a temporal ID
relationships, the video usability information (VUI) and variable in the NAL unit header and imposing a restriction
supplemental enhancement information (SEI) message that pictures of a particular temporal sublayer cannot be
mechanisms, and the video buffering model based on used for inter-picture prediction referencing by pictures
a hypothetical reference decoder (HRD). The hierarchi- of a lower temporal sublayer. A subbitstream extraction
cal syntax and data unit structures consist of sequence process is also specified, with a requirement that each
subbitstream extraction output must be a conforming not been widely embraced by industry and, thus, were
bitstream. Media-aware network elements (MANEs) can not carried over from AVC. Instead, new concepts have
use the temporal ID in the NAL unit header for stream been introduced to HEVC, which mainly facilitate paral-
adaptation purposes based on temporal scalability. lel processing (an important feature given that HEVC is
Profile, tier, and level: In order to restrict the feature designed for higher-resolution videos).
set to be supported for specific applications, video cod- Tiles represent an alternative, rectangular grouping of
ing standards define so-called profiles. HEVC v1 defines CTUs to divide a picture into tile rows and tile columns.
the following three profiles: 1) the main profile that is The tiles in a picture are processed in raster-scan order,
restricted to support only the 4:2:0 chroma format and a and the CTUs in each tile are processed in raster-scan
bit depth of 8 bits per sample; 2) the Main 10 profile that order within the tile before the CTUs in the next tile are
is based on the main profile with the supported bit depth processed. A slice can either contain an integer number of
extended to 10 bits per color component; and 3) the main complete tiles such that all the tiles share the same slice
still picture profile that is also based on the main profile header (SH) information, or a tile can contain an integer
but restricted to have only one picture in a bitstream. number of slices with each of these slices being a subset of
In addition to profiles, HEVC also defines so-called levels the tile. The original intent of tiles was enabling parallel
and tiers. A level imposes restrictions on the bitstream encoding and decoding for higher-resolution video [23].
based on the values of syntax elements and their arithmetic However, with emerging 360◦ immersive videos, tiles
combinations, for example, as combinations of spatial res- turned out to also be useful for omnidirectional video
olution, bit rate, frame rate, and picture buffering capac- streaming when used in combination with encoder restric-
ity. The AVC and HEVC level specifications are generally tions and metadata [24]. If an encoder restricts the MVs
similar in spirit, with a couple of notable differences: 1) a that it uses to avoid referring to any regions of the refer-
smaller number of levels is specified in HEVC than in AVC, ence pictures that are outside of a particular set of tiles,
particularly for the levels with lower picture resolution lim- the slices containing these tiles can still be decodable
its and 2) the highest supported frame rate for operation if this set of tiles is extracted from each picture in the
with picture sizes that are relatively small is 172 frames/s bitstream. Such a set is known as a motion-constrained tile
for AVC in most levels, while, for HEVC, this is increased set (MCTS). Recent system-level functionalities, especially
to 300 frame/s. Both of these differences are in response for immersive videos, have made extensive use of MCTSs.
to the general trend of video picture resolutions and frame Wavefront parallel processing (WPP) allows multiple
rates becoming higher as time passes. The concept of tiers CTU rows to be processed in parallel for decoding (or
was newly introduced in HEVC, mainly to establish high encoding). When WPP is enabled, the internal state of the
bit rate capabilities for video contribution applications that CABAC context variables is not carried over to the start of
require higher quality than video distribution applications. a CTU row from the right-most CTU in the previous row,
Hypothetical reference decoder: AVC specifies a buffer but rather from the second CTU in the previous row. This
flow model using picture-based HRD operations with a allows the decoder (or encoder) to start processing the
picture being contained in an access unit (AU) with spec- next row with a two-CTU offset [25]. It should be noted
ified timing. In HEVC, for improved support of ultralow- that the WPP term does not appear in the HEVC specifica-
delay applications, an alternative mode of HRD operation tion since it is a matter of implementation choice whether
was introduced, which operates on smaller units of data. the decoder (and/or encoder) actually takes advantage of
It specifies a conforming behavior for encoders to send the feature’s parallelism opportunity; in the standard, this
only part of a picture as a decoding unit (DU) with accom- is called entropy coding synchronization.
panying timing information before the encoding of the Dependent slice segments have been introduced to
remaining areas of the same picture, as well as for decoders provide a separate framing of a coded slice into multiple
to be able to use the timing of DUs to start decoding the NAL units. A slice is split into an initial, independent slice
received areas before receiving the remaining parts of the segment that contains a full SH and subsequent dependent
picture. slice segments that each contain an abbreviated SH [20].
Dependent slice segments are particularly useful for MTU
8) High-Level Picture Partitioning: In AVC, the coded size matching in systems that limit the maximum amount
macroblocks of a picture are grouped together in slices, of data in an NAL unit or in combination with WPP,
each of which can be decoded independent of the other where each CTU row can be packed and transmitted in
slices in the same picture. When introduced, one of a dependent slice segment.
the main purposes of slices was for maximum transfer
unit (MTU) size matching for improved channel loss
resilience although they could be useful for parallel encod- B. Extensions
ing as well. In HEVC, the basic slice concept was kept, The first version of HEVC was limited to video signals
with slices that group together consecutive CTUs in raster- in 4:2:0 chroma format with up to 10 bits per sam-
scan order. The more complex slice concepts of flexible ple and was optimized for consumer-oriented applica-
macroblock ordering and arbitrary slice ordering have tions with 2-D, single-layer camera-captured content in
the Y C B C R color space. In October 2014, the second bit-masking and shift operation and can be enabled for
version (v2) of HEVC was finalized, in which the format increasing the CABAC parsing throughput at very high bit
range extensions (RExt) add support for more demanding rates.
higher quality applications [26], the multilayer extensions
2) Scalable HEVC Extensions (SHVCs): In HEVC v2,
for scalability [27], and 3-D multiview video coding [28].
the temporal scalability from v1 is extended by spatial,
The third version (v3) of HEVC was finalized in Febru-
quality, bit depth, and color gamut scalability, as well as the
ary 2015 and added support for combined coding of 3-D
combinations of these. The scalability is based on a multi-
video with depth maps [28]. In February 2016, the last
layer architecture that relies on multiple single-layer HEVC
major extension, for the coding of screen content mate-
v1 decoders, that is, it does not modify block-level decod-
rial [29], was added in the fourth version (v4). A short
ing tools. The reconstruction of a higher enhancement
summary of these extensions is given in the following.
layer from a lower layer, for example, reconstructing UHD
from an HD base layer for spatial scalability, is enabled
1) Range Extensions (RExt): The main goal of the
through picture referencing with added interlayer refer-
HEVC range extensions was to extend the 4:2:0 8–10-bit
ence picture-processing modules, including texture and
consumer-oriented scope of HEVC v1 by supporting high-
motion resampling and color mapping. On the one hand,
quality distribution broadcast (4:2:0, 12 bit), contribu-
this allows reusing HEVC v1 decoder cores but, on the
tion (4:2:2, 10 and 12 bit), production and high-fidelity
other hand, implementing an SHVC-compliant decoder
content acquisition (4:4:4, 16 bit, RGB, high bit rate),
with this architecture increases processing requirements
medical imaging (4:0:0 monochrome, 12–16 bit, near-
by needing multiple HEVC v1 cores plus the additional
lossless), alpha channels and depth maps (4:0:0 mono-
modules.
chrome, 8-bit), high-quality still pictures (4:4:4, 8–16 bit,
arbitrarily high picture size), and many other applications. 3) Multiview (MV-HEVC) and 3-D Extensions (3-D-HEVC):
The modifications introduced by RExt can be divided into Based on the same multilayer design introduced in HEVC
the following three categories. v2 together with the scalable extension, the multiview
Video format modifications to support chroma formats and 3-D extensions significantly improve the coding of
beyond 4:2:0 and bit depths beyond 10 bits per sample 3-D video compared with multicast or frame packing
have been kept to a minimum. Here, a rather conservative with HEVC v1. Similar to the AVC multiview extension,
approach to support the 4:2:2 and 4:4:4 chroma formats MV-HEVC (in v2 of HEVC), each view of a picture is to
without diverging unnecessarily from HEVC v1 was cho- be coded in a separate layer with interlayer prediction.
sen. The modifications include the extension of TB parti- 3-D-HEVC (in v3 of HEVC) extends this by coding the
tioning with existing syntax and transform logic, as well view plus its depth map, which allows rendering additional
as the adjustment of the intra-picture prediction angles to intermediate views. Especially for the depth map coding,
support the nonsquare rectangular blocks occurring in the statistical dependencies between video texture and depth
4:2:2 chroma format. For higher bit depths, only the SAO maps are exploited. This introduces new block-level coding
and interpolation precision are extended. tools, which requires new decoder cores for 3-D-HEVC
Coding efficiency improvements for extended formats, compared with HEVC v1.
lossless, and near-lossless coding are achieved by means
of modified HEVC v1 tools, as well as by introducing new 4) Screen Content Coding (SCC) Extensions: Applications
tools. From HEVC v1, mainly, the transform skip mode was such as screen sharing and gaming are mainly based on
extended to larger block sizes and coupled with a modified computer-generated or mixed content. All video coding
residual coding (with a separate CABAC context model and standards up to HEVC v1 had been mainly designed
residual rotation). Apart from that, RExt includes three for camera-captured video, which results in suboptimal
new tools to increase coding efficiency: adaptive chroma exploitation of the different signal characteristics present
QP offset allows more flexibility in chroma quantization, in screen content. These characteristics are exploited in
cross-component prediction (CCP) exploits remaining sta- HEVC SCC (in version 4 of HEVC) by introducing new
tistically redundancies between luma and chroma channels tools, including intra-picture block copy (IBC), palette
for 4:4:4 video by predicting the chroma spatial residuals mode, adaptive color transform (ACT), and adaptive MV
from luma using a linear model, and residual differential resolution (AMVR). Further detail on these tools is pro-
pulse code modulation (RDPCM) aims to reduce remain- vided in Section IV-B7, as VVC contains a rather similar
ing redundancies in the spatial residual signal when the design for these aspects.
transform is skipped.
Precision and throughput optimizations for very high IV. V E R S A T I L E V I D E O C O D I N G
bit rates and bit depths are achieved mainly by two This section describes the most recent standard, VVC,
methods. First, extended precision for the transform coef- in more detail. It is formally approved as ITU-T H.266 and
ficients and inverse transform processing enable efficient ISO/IEC 23090-3. The VSEI standard, that is, ITU-T
coding with high bit depths. Second, a modification of H.274 and ISO/IEC 23002-7, specifying the VUI and
CABAC allows to decode multiple coded bits with a single some of the SEI messages used with VVC bitstreams,
was developed and approved at the same time [7]. For A. Standardization and Development
HEVC and AVC, these aspects are specified directly within
the same video coding standard that specifies the cod- The development of VVC can be split into two phases,
ing tools. Apart from achieving major bit-rate savings which are summarized in the following. The first phase was
over its HEVC and AVC predecessors for camera-content the exploration phase, which started in 2015, primarily
video sequences, VVC was designed to provide and focusing on investigating the potential for increased cod-
improve functionalities and coding efficiency for a ing efficiency without as much consideration of practical
broadened range of existing and emerging applications, complexity constraints. The exploration phase provided
including: evidence that technology with sufficient compression capa-
bility beyond HEVC existed, justifying the start of the
1) Video beyond the standard and high defini- official standardization phase (the second phase) spanning
tions is greatly improved by using more flexible from 2018 to 2020. This phase targeted to maintain and
and larger block structures (see Section IV-B1) for even increase the coding efficiency while taking implemen-
higher resolutions and by a luma adaptive deblock- tation and complexity aspects into full consideration and
ing filter designed for HDR video characteristics fulfilling a broadened range of application scope.
(see Section IV-B6). Furthermore, profiles that sup-
port chroma formats beyond 4:2:0, such as 4:2:2 and 1) Exploration Phase (2015–2017): The need for even
4:4:4, are defined already in the first version of VVC more efficient compression than the current HEVC stan-
(see Section IV-C8). dard motivated ITU-T VCEG and ISO/IEC MPEG is study-
2) Computer-generated or screen content motivated ing the potential in 2014 and to join forces again in
the inclusion of techniques derived from the HEVC October 2015 for exploring coding technology beyond
SCC extensions, such as IBC block-level differen- HEVC in a new team called the Joint Video Explo-
tial pulse code modulation (BDPCM), ACT, palette ration Team (JVET). Based, initially, on VCEG key tech-
mode coding, and full-sample adaptive MV precision, nical area (KTA) software that began being developed in
as well as an alternative residual coding for transform January 2015, by the end of 2017, the JVET had developed
skip modes (see Section IV-B7). the joint exploration model (JEM) software codebase [30],
3) Ultralow-delay streaming is facilitated by built-in which demonstrated up to 30% bit-rate reduction com-
GDR handling that can avoid bit rate peaks intro- pared with HEVC.
duced by intra-picture coded pictures and vir- The coding efficiency improvements achieved in this
tual boundaries for improved support of GDR exploration effort were considered sufficient evidence to
(see Section IV-C1). issue a formal Joint Call for Proposals (CfP) for new video
4) Adaptive streaming with resolution changes ben- coding technology in October 2017, and it was agreed that,
efits from reference picture resampling (RPR) (see once the drafting of a formal standard began, the joint
Section IV-C6), which allows switching resolutions team would be renamed to reflect its change of mission,
within a CVS by resampling reference pictures to becoming the Joint Video Experts Team, without changing
the picture resolution of the current picture for the its JVET abbreviation.
purpose of inter-picture prediction.
2) Standardization Phase (2018–2020): The CfP attra-
5) 360◦ video for immersive and augmented
cted the submission of proposals from 32 organizations
reality applications is efficiently coded by the
for the coding of three categories of video content: stan-
motion-compensated prediction that can wrap
dard dynamic range (SDR), HDR, and 360◦ video [31].
around picture boundaries, by disabling in-loop
An independent subjective evaluation conducted in
filtering across virtual boundaries (see Section IV-B8)
April 2018 showed that all submissions were superior in
and by subpictures with boundary padding
terms of subjective quality to HEVC in most test cases
(see Section IV-C5).
and that several submissions were superior to the tech-
6) Multilayer coding is supported already in the
nology previously explored in the JEM framework in a
first version of VVC using a complexity-constrained,
relevant number of cases. Starting with the analysis of
single-layer-friendly approach that enables temporal,
the best-performing proposals among all the submissions,
spatial, and quality scalabilities, as well as multiview
the VVC development started in April 2018 with the first
coding (see Section IV-C7).
draft of the specification document and test model soft-
In the following, the initial steps toward establishing a ware. After a large number of coding tools had been on
new standardization project with compression efficiency the table from the CfP, it was decided to start with a
beyond HEVC, as well as a short review of the VVC “clean slate” approach. This first draft only included an
standard development, are covered in Section IV-A. Then, advanced quadtree with multitype tree (QT+MTT) block
the novel coding tools in VVC that contributes to the over- partitioning, which was identified as a common element
all bit-rate savings are described in Section IV-B. Finally, among almost all proposals, and its implementation would
advances and novelties in the systems and transport inter- heavily affect the design of all other block-based cod-
faces are presented in Section IV-C. ing tools. On top of that, more coding tools from the
CfP responses and new ones were studied extensively in Table 1 Overview of Coding Tools in HEVC and VVC
B. Coding Tools
VVC applies the classic block-based hybrid video cod-
ing architecture known from its predecessors. Although
the same framework is applied, novel tools are included
in each basic building block to further improve the
compression.
Table 1 provides an overview of the coding tools in
HEVC version 1 and VVC version 1. In the following,
the VVC tools will be explained in more detail.
1) Block Partitioning: In VVC, the QT+MTT scheme
using quaternary splits followed by binary and ternary
splits for its partitioning structure replaces the quadtree
with multiple partition unit types that were used in HEVC,
that is, it removes the concept of splitting a CU into PUs
and TUs and provides a more flexible CU partitioning.
Rectangular PU shapes are replaced by rectangular CU
shapes resulting from binary and ternary tree splits.
The RQT-based TU partitioning is removed as well, and
multiple TUs in a CU can only occur from an implicit
split of CUs that have a larger size than the maximum
transform length and from CUs with intra sub-partitions
(see Section IV-B3). Furthermore, the maximum CTU
size is increased to 128 × 128 luma samples, and the
maximum supported transform length is increased to 64.
This tree-based CU partitioning scheme forms the block
partitioning structure for VVC, together with sometimes
using a separate tree for the chroma components and
easing implementation with the concept of virtual pipeline
data units, as will be further described in the following.
Coding quadtree with multitype tree: A CTU is first par-
titioned by a quadtree structure. Then, the quadtree tree
leaf nodes can be further partitioned by a multitype tree
structure. There are four splitting types in the multitype
tree structure: vertical binary splitting, horizontal binary
splitting, vertical ternary splitting, and horizontal ternary
splitting. The multitype tree leaf nodes are called CUs,
and unless the CU is too large for the maximum trans-
form length, this segmentation is used for the prediction
and transform processing without any further partition-
ing. This means that, in most cases, the CU, PU, and
TU have the same block size in the QT+MTT coding
block structure. Other than when the CU is too large
for the maximum transform size, exceptions also occur dotted edges represent multitype tree partitioning with
when intra sub-partitions (see Section IV-B3) or subblock either binary or ternary splits. The size of the CU may
transforms (SBTs) (see Section IV-B4) are employed. This be as large as the CTU or as small as 4 × 4 in units of
also means that VVC supports nonsquare TBs in addition luma samples. The QT+MTT partitioning provides a very
to square ones. Fig. 5 shows a CTU divided into multiple flexible block structure to adapt to the local character-
CUs with a QT+MTT coding block structure, where the istics, as can be seen in the example overlay in Fig. 6.
solid block edges represent quadtree partitioning and the Furthermore, at the leaf node of the multitype tree, there
is an option to further split a CU into two nonrectangular blocks of all three components, whereas a CU in an inter-
prediction block partitions in the case of inter-picture pre- picture coded slice always consists of coding blocks of all
diction, selecting one of 64 geometric partitioning modes three color components.
(see Section IV-B2). Local dual-tree: In typical video encoder and decoder
Chroma separate tree: In VVC, the coding tree scheme implementations, the average processing throughput drops
supports the ability for luma and chroma to use separate when many small blocks (more specifically, small intra-
partitioning tree structures. For inter-picture coded slices, picture coded blocks since these need to be decoded
the luma and chroma CTBs in one CTU have to share sequentially) are present in the coded picture. In the
the same coding tree structure. However, for intra-picture single-coding tree structure, a CU can be as small as 4×4 in
coded slices, the luma and chroma can have separate trees. units of luma samples, which results in 2×2 chroma coding
When the separate tree mode is applied, the luma CTB blocks if the video uses 4:2:0 sampling. To avoid small
is partitioned into CUs by one QT+MTT structure, and chroma blocks, a local dual-tree structure is used. With the
the chroma CTBs are partitioned into CUs by another local dual-tree design, chroma intra-picture coded coding
QT+MTT structure. This means that, when the video is blocks with a size of less than 16 chroma samples or with
not monochrome, a CU in an intra-picture coded slice may 2×N sizes are prevented by using a separate tree locally for
consist of a coding block of the luma component only, the chroma when necessary to prevent such small chroma
coding blocks of two chroma components only, or coding blocks.
Virtual pipeline data units (VPDUs) are block units in
a picture that needs to be held in memory for processing
while decoding. In hardware decoders, successive VPDUs
can be processed by operating multiple pipeline stages at
the same time. The VPDU size would be roughly propor-
tional to the memory buffering size in most pipeline stages,
so it is important to keep the VPDU size reasonably small.
In the VVC QT+MTT scheme, ternary tree and binary tree
splits for CUs with the size of 128×128 luma samples could
have led to a VPDU size that was considered too difficult
to support. In order to keep the VPDU size at 64 × 64 luma
samples, normative partitioning restrictions (with syntax
signaling modification) are applied, disallowing certain
splits for CUs with width or height equal to 128, as shown
by dashed lines in Fig. 7. The VPDU concept was used to
establish these implementation-oriented split restrictions
but is not explicitly discussed in the standard.
interpolation filter (IF) for luma fractional positions and the CUs coded with affine mode, geometric partitioning,
four-tap IF for chroma fractional positions are also used. or subblock-based TMVP, the associated motion informa-
On top of these core features, new coding tools are intro- tion is added to the table in a first-in-first-out (FIFO)
duced in VVC for increasing the efficiency of inter-picture manner. The HMVP table size is 6.
prediction. VVC introduces subblock-based motion inher- The pairwise average MVP candidate is generated by
itance, in which the current CU is divided into subblocks averaging the MVs of the first two candidates in the exist-
with equal size (8 × 8 luma samples) and the MV for ing merge candidate list. The averaged MVs are calculated
each subblock is derived based on temporally colocated separately for each RPL. When the merge list is not full
blocks in a reference picture. Merge mode with additional after the pairwise average merge candidate is added, zero
MVD coding is added to further enhance the efficiency of MVPs are appended at the end until the maximum merge
the merge mode. A local CU-based affine motion model is candidate number is encountered.
used to represent higher-order motion, such as scaling and Subblock-based temporal MVP (SBTMVP): TMVP in
rotation, where only one set of parameters is coded per CU, merge mode inherits one set of motion information from
while the motion compensation is performed individually a temporal colocated CU. The SBTMVP method in VVC
per 4 × 4 subblock using six-tap IFs. VVC also increases allows inheriting the motion information from the colo-
the MV precision to 1/16 luma sample in some modes to cated picture at a finer granularity, that is, in units of
improve the prediction efficiency for video content with 8 × 8 subblocks. This requires storing the MVs of the
locally varying and nontranslational motion, such as in the colocated picture on an 8 × 8 luma sample grid (in contrast
case of the affine mode, while HEVC uses only quarter- to a 16 × 16 grid in HEVC). SBTMVP attains MVPs for the
luma-sample precision. On top of the higher precision MV subblocks within the current CU in two steps. In the first
representations, a block-level AMVR method is applied to step, the motion displacement to determine the colocated
customize the balance between the prediction quality and CU is set to the MV of the neighboring CU to the left
the bit cost overhead for MV signaling. The geometric if it uses the colocated picture as its reference picture.
partitioning mode splits a CU into two nonrectangular Otherwise, it is set to (0, 0). In the second step, the MVP for
partitions to better match motion at object boundaries. The each subblock is derived from the MV of its corresponding
biprediction with CU-level weights (BCW) mode extends subblock inside the colocated CU from the first step.
simple averaging to allow weighted averaging of the two Merge with MVD (MMVD): The VVC merge mode is
prediction signals at the block level. To further improve the extended by allowing signaling an MMVD, which only
prediction quality, decoder-side MV refinement (DMVR) allows a small number of difference values and, therefore,
and bidirectional optical flow (BDOF) are introduced, has less bit overhead than AMVP. When one of the first
which improves the motion compensation without increas- two merge candidates is selected for a CU, an MVD can be
ing bit overhead. Finally, VVC provides a mode for combin- signaled to further refine the MV. A set of MVD ranges are
ing inter-picture and intra-picture prediction to form the predefined, and an index is signaled to indicate how far
final prediction. the final MV can deviate from the predicted MV.
For a CU coded in merge mode, a merge candidate list Symmetric MVD (SMVD): When the motion of the
is constructed, and an index is signaled to specify which current block is on a constant motion trajectory between a
candidate MVP is used to form the prediction. In VVC, temporally past and a temporally future reference picture
the merge candidate list consists of five types of candi- in display order, corresponding MVs and reference picture
dates in the order: 1) MVPs from spatial neighboring CUs; indices tend to be symmetrical. SMVD exploits this to save
2) temporal MVP (TMVP) from colocated CUs; 3) history- bits for MVDs and reference picture index signaling. When
based MVP from an FIFO table; 4) pairwise average MVP; SMVD is applied for a CU, only the MVD for list 0 is
and 5) zero MVs. The length of the merge list is signaled signaled. The MVD for list 1 is set to the reverse of the list
in SPS, where the maximum allowed length is 6. The way 0 MVD, and the list 0 and list 1 reference picture indices
MVs from spatial neighboring CUs and colocated CUs are are implicitly derived at the slice level.
used is identical to the way that these are handled in the Adaptive MV resolution (AMVR): In inter-picture pre-
HEVC merge candidate list. diction, MVs with higher resolution, that is, higher frac-
History-based MV prediction (HMVP) provides can- tional sample position accuracy, usually lead to better
didates beyond the local spatial–temporal neighborhood prediction and, thus, smaller residual energy. However,
to allow usage of MV information from CUs that are more bits are required to represent the MVs with higher
more remote. The HMVP candidates can be used in both accuracy. In the HEVC SCC extension, the precision of the
merge and AMVP candidate list construction processes. MVs is switchable at the slice level between a quarter of
The motion information of previously coded blocks is a luma sample as in HEVC v1 and integer luma sample
stored in a table of MVP candidates for the current CU. precision. The benefit of being able to select integer luma
The table with multiple HMVP candidates is maintained sample precision is clear for SCC (e.g., for computer desk-
during the encoding/decoding process and is reset (all top screen sharing), where the motion in the computer
candidates removed) when a new CTU row is encountered. graphics synthesis is often using only integer sample dis-
Whenever there is an inter-picture coded CU, excluding placements. In such an instance, the integer-only option
avoids wasting bits on sending fractional precision that BCW is only applied to CUs with 256 or more luma samples
is not needed. However, to enable a more flexible adap- (i.e., CU width times CU height is greater than or equal to
tation for camera-captured video and mixed content and 256). If all reference pictures are temporally preceding the
screen content, a CU-level AMVR scheme is supported in current picture in display order, for example, for low-delay
VVC. MVDs of a CU with translational motion in AMVP applications, all five weights are used. Otherwise, only
mode can be coded in units of quarter luma samples, three weights w ∈ {3, 4, 5} are used.
half luma samples, integer luma samples, or four luma Combined inter-/intra-picture prediction (CIIP): In
samples. For the affine AMVP mode, MVDs can be switched VVC, when a CU is coded in merge mode, an additional
among quarter, integer, or 1/16 luma samples. In the flag is signaled to indicate whether a CIIP mode is applied
case of IBC (see Section IV-B7), the precision of the block to the current CU. The CIIP mode can be applied to a CU
displacement vectors can either be an integer or four containing at least 64 luma samples when both the CU
luma samples. In order to ensure that the final MV (i.e., width and CU height are less than 128 luma samples. As its
the sum of the MVP and MVD) uses the same precision as name indicates, the CIIP prediction combines an inter-
the MVD, the MVP is rounded to the indicated precision. picture prediction signal with an intra-picture prediction
With CU-level switching of MV resolution, a good tradeoff signal. The intra-picture prediction signal is generated
between prediction quality and MV bit overhead can be using the planar mode. The intra-picture and inter-picture
achieved. The CU-level MV resolution indication is condi- prediction signals are combined using weighted averaging,
tionally signaled if the current CU has at least one nonzero where the weight value is calculated depending on the
MVD component. When half-luma-sample MV accuracy is coding modes of the top and left neighboring blocks.
used in AMVP mode, a six-tap smoothing IF (SIF) is used Decoder-side MV refinement (DMVR) is used to
instead of the eight-tap IF from HEVC. improve the accuracy of the MVs of the merge mode.
Geometric Partitioning Mode (GPM) enables motion It searches candidate MVs around the initial MVs in list
compensation on nonrectangular partitions of blocks as 0 and list 1 and, like SMVD, is used only with temporally
one variant of the merge mode in VVC. When this mode bidirectional prediction. The DMVR searching process con-
is used, a CU is split into two partitions by a geometrically sists of an integer sample MV offset search and a fractional
located straight line, and two merge indices (one for sample MV refinement process. The integer sample MV
each partition) are further signaled. In total, 64 different searching calculates the distortion between each pair of
partition layouts are supported by geometric partitioning candidate reference blocks in list 0 and list 1, and the
for each possible CU size from 8 × 8 to 64 × 64, excluding search range is ±2 integer luma samples from the ini-
8 × 64 and 64 × 8. The location of the splitting line is tial MVs. The fractional sample refinement is derived by
mathematically derived from the angle and offset para- using a parametric error surface approximation instead of
meters of a specific partition. Each part of a geometric using additional searching with distortion measurement
partition in the CU is inter-picture predicted using its comparisons. When the width or height of a CU is larger
own motion, and only uniprediction is allowed for each than 16 luma samples, the CU is split, and DMVR is
partition, that is, each part has one MV and one refer- processed for each 16 × 16 block separately. The refined
ence picture index. The uniprediction motion constraint MVs are used to generate the inter-picture prediction
is applied to ensure that, as in conventional biprediction, samples and are also used in TMVP for the coding of
only two motion-compensated predictions need to be com- subsequent pictures. However, the original MVs are used
puted for each CU. After predicting each of the parts, in the deblocking process and are also used in spatial MVP
the sample values are combined using a blending process- for subsequent CU coding to ease potential pipelining in
ing with adaptive weights along the geometric partition hardware implementations.
edge. Bi-directional optical flow before (BDOF) is another
Biprediction with CU-level weights (BCW): In HEVC, technique for improving temporally bidirectional motion
the biprediction signal is generated by averaging two representation and is used to refine the biprediction signal
prediction signals obtained from two reference pictures of a CU at the 4×4 subblock level. It is applied to CUs coded
and/or using two MVs. Weighted averaging of the two either in the merge mode or the AMVP mode. Similar to
prediction signals is supported in HEVC but with a PROF for affine motion, the BDOF refinement is based
somewhat cumbersome scheme that required establishing on the optical flow concept and assumes homogeneous
weights at the slice level and using the reference picture motion of an object within the current CU. For each
index to control the weight selection. In VVC, this legacy 4 × 4 subblock, a motion difference relative to CU MVs
explicit-weighted prediction scheme is kept and extended is calculated by minimizing the difference between the list
with CU-level syntax control for weighted averaging. Five 0 and list 1 prediction subblocks using the cross-correlation
weights are allowed in this weighted averaging bipredic- and autocorrelation of the horizontal and vertical gradients
tion, w ∈ {−2, 3, 4, 5, 10}/8. For each bipredicted CU, for each prediction sample. The motion difference together
the weight w is determined in one of two ways: 1) for a with the prediction sample gradients is then used to adjust
nonmerge CU, the weight index is signaled after the MVD the bipredicted sample values in the 4 × 4 subblock.
or 2) for a merge CU, the weight index is inferred from Affine motion: In HEVC, only a translational motion
neighboring blocks based on the merge candidate index. model is applied in motion-compensated prediction, which
Vol. 109, No. 9, September 2021 | P ROCEEDINGS OF THE IEEE 1477
Bross et al.: Developments in International Video Coding Standardization After AVC, With an Overview of VVC
Table 2 Mapping of MTS Modes to Transform Kernels flag is signaled to indicate whether the whole residual
block or only a subpart of it is coded. In the former case,
inter-MTS information is further parsed to determine the
transform type of the CU. In the latter case, a part of the
residual block is coded with an inferred primary transform
type, and the other part of it is zeroed out. The part
with coded residual can be one-half or one-quarter the
size of the CU and can be located in the left, right, top,
or bottom region of the CU, which results in a total of eight
computation, for blocks with size (width or height, or both SBT modes.
width and height) equal to 32, only the coefficients within Adaptive chroma QP offset allows extending block-
the 16 × 16 lower frequency region are retained, and based quantization control for luma, which is similar in
the high-frequency transform coefficients are zeroed out spirit as the one introduced in HEVC version 2 by the
for these transforms. For the TBs with size (width or range extensions. Block-level QP control is widely used
height, or both width and height) equal to 64, only DCT in practical implementation for rate control and perceptu-
type-II is used, where only the coefficients within the ally optimized encoding approaches. In addition to signal
32 × 32 lower frequency region are retained and the high- luma QP changes for an area of blocks (quantization
frequency transform coefficients are zeroed out. In case a groups), chroma QPs are derived from the luma QP of
low-complexity encoder does not have the resources to test the colocated block via lookup tables. To support a wide
and signal the MTS, an implicit MTS can be used as an range of transfer functions and color formats, the lookup
alternative. In that case, a combination of DCT type-II and tables are defined by piecewise linear mapping functions
DST type-VII is derived based on the width and the height that are determined by an encoder and coded in the
of the current TB. SPS. Furthermore, VVC extends the range of QP values
Low-frequency non-separable transform (LFNST) can from 0 to 63 + 6∗ (BitDepth−8) in order to achieve low
be applied to the low-frequency components of the primary bit rates.
transform to better exploit the directionality characteristics Dependent Quantization (DQ) refers to an approach in
particularly of intra-picture coded CUs with DCT type-II as which the set of available reconstruction values for a given
the primary transform. It is applied between the forward transform coefficient depends on the reconstruction values
primary transform and quantization at the encoder side that were selected for transform coefficients that precede it
and between the inverse quantization scaling and inverse in scanning order. The main effect of this approach, in com-
primary transform at the decoder side. In LFNST, a 4 × 4 or parison to conventional independent scalar quantization
8 × 8 nonseparable transform is applied according to the as used in HEVC, is that the average distortion between
TB size. The 4 × 4 LFNST is applied to the low-frequency an input vector given in an M-dimensional vector space
transform coefficients of the TBs with width or height, (all transform coefficients in a TB) and the closest recon-
or both width and height equal to 4, and the 8 × 8 LFNST struction vector can be globally reduced. The approach
is applied for low-frequency transformed coefficients of of dependent scalar quantization in VVC is realized by:
the TBs with both width and height greater than 4. All 1) defining two scalar quantizers, denoted by Q0 and Q1,
transform coefficients outside the 4 × 4 or 8 × 8 LFNST with different sets of reconstruction levels and 2) defining
zone are discarded (set to zero). To further reduce the a process for switching states between the use of the
computational complexity and storage size of transform two scalar quantizers. The location of the available recon-
matrices, in the case of 8 × 8 LFNST, only 48 coefficients struction levels is uniquely specified by a quantization
from the primary transform are used as inputs, and only 16 step size . The scalar quantizer used (Q0 or Q1) is not
coefficients are generated as outputs from the secondary explicitly signaled in the bitstream. Instead, the quantizer
transform. Thus, a maximum of 16 coefficients needs to used for a current transform coefficient is determined by
be coded for any TB with LFNST mode enabled. For 4×N , the parities (k & 1) of the transform coefficient levels k that
N ×4, and 8 × 8 blocks, only eight coefficients are output precede the current transform coefficient in the scanning
from the secondary transform. order. As shown in Fig. 10, the switching between the two
In LFNST, a total of four transform sets and two non- scalar quantizers is realized via a state machine with four
separable transform matrices (kernels) per transform set states.
are predefined. The transform set to be used is determined Joint coding of chroma residual (JCCR) is used
based on intra-picture prediction modes. For each trans- to further reduce the redundancy of the two chroma
form set, the selected nonseparable secondary transform components’ residual signals when they are similar to
candidate is further specified by an explicitly signaled each other. Instead of signaling the residual for the two
LFNST index that is signaled for the CU. chroma components separately, one of three JCCR modes
Subblock Transform (SBT) is introduced for inter- with various weighting combinations of a single-coded
picture predicted CUs in VVC. In this transform mode, chroma residual can be selectively applied at the
only a subpart of the residual block is coded. A CU-level CU level.
Fig. 10. State transition and quantizer selection. 6) In-Loop Filtering: In VVC, a remapping operation and
three in-loop filters can be applied sequentially to the
reconstructed picture to modify its representation domain
5) Entropy Coding: As in HEVC, CABAC is used as the and alleviate different types of artifacts. First, a new
single entropy coding method in VVC. The CABAC design sample-based process called LMCS is performed. Then,
in VVC contains various coding efficiency improvements a deblocking filter is used to reduce blocking artifacts.
compared with the design in HEVC. The changes in the two SAO is then applied to the deblocked picture to attenuate
main parts of entropy coding, namely the CABAC engine ringing and banding artifacts. Finally, an ALF reduces other
and transform coefficient coding, are further described in potential distortion introduced by the quantization and
this section. transform processes. The deblocking filter design is based
CABAC engine with multihypothesis probability esti- on the one in HEVC but is extended with longer deblocking
mate: The CABAC engine in AVC and HEVC uses a filters and a luma-adaptive filtering mode designed specif-
table-based probability transition process between 64 dif- ically for HDR video. While SAO is the same as in HEVC,
ferent representative probability states. The range repre- and the deblocking is very similar, LMCS and ALF are new
senting the state of the coding engine is quantized to a set compared with previous standards. The design of ALF in
of four values prior to the calculation of the new interval VVC consists of two operations: 1) ALF with block-based
range. The state transition is implemented using a table filter adaption for both luma and chroma samples and 2) a
containing all the precomputed values to approximate cross-component ALF (CC-ALF) for chroma samples.
the values of the new probability interval range. In VVC, Luma mapping with chroma scaling (LMCS): Unlike
the basic concept is kept, but the binary arithmetic coder is other in-loop filters that, in general, apply filtering
applied with a multihypothesis probability update model, processes for a current sample by using the information
based on two probability estimates P0 and P1 that are of its spatial neighboring samples to reduce the coding
associated with each context model and are updated inde- artifacts, LMCS involves modifying the input signal before
pendently with different adaptation rates. The probability encoding by redistributing the amplitudes across the entire
estimate P that is used for the interval subdivision in the representation dynamic range for improved compression
binary arithmetic coder is the average of the estimates from efficiency. LMCS has two main components: 1) in-loop
the two hypotheses. The adaptation rates of P0 and P1 for mapping of the luma component based on adaptive piece-
each context model are pretrained based on the statistics wise linear models and 2) luma-dependent chroma resid-
of the associated binary events. ual scaling for the chroma components. Luma mapping
Improved transform coefficient coding: In HEVC, makes use of a forward mapping function and a corre-
transform coefficients of a coding block are coded by cate- sponding inverse mapping function. The forward mapping
gorizing them into coefficient groups (CGs or subblocks) function is a piecewise linear function with 16 equally
such that each CG contains the coefficients of a 4 × 4 sized segments that is signaled in the bitstream. The
subblock inside a square, power-of-2 sized TB. VVC also inverse mapping function does not need to be signaled
adopts the concept of CGs for coefficient coding. Besides and is instead derived from the forward mapping function.
the legacy 4 × 4 CG, additional CG sizes (1 × 16, 16 × 1, The luma mapping model is signaled in an adaptation
2 × 8, 8 × 2, 2 × 4, and 4 × 2) are introduced due to parameter set (APS; see Section IV-C2), and up to four
narrow luma TBs resulting from ISP and small chroma TBs. LMCS APSs with different mapping models can be used
The CGs inside a TB and the transform coefficients within in a CVS. When LMCS is enabled for a slice, the inverse
a CG are coded following a single reverse diagonal scan mapping function is applied to all the reconstructed luma
order. Similar to HEVC, the transform coefficient levels blocks to convert the samples back to the original domain
are coded using a combination of different binarizations. for display output and for storage as reference pictures.
This includes truncated unary coding with a cascade of For an inter-picture coded block, the forward mapping
flags that indicate whether the absolute value is greater function needs to be applied to the luma prediction signal
within the decoding process, as the reference pictures are greater than or equal to 8 (in units of chroma samples),
in the original domain. This is not required for intra-picture and three chroma samples from each side are filtered.
prediction because the reconstructed signal before inverse Luma-adaptive deblocking further adjusts tC and β of
mapping is used as a prediction in that case. Chroma resid- the deblocking filter based on the averaged luma level of
ual scaling is designed to compensate for the interaction the reconstructed samples. When luma-adaptive deblock-
between the luma signal and its corresponding chroma ing is enabled, an offset qpOffset, which is derived based
signals. When luma mapping is enabled, an additional flag on the average luma level around the filtering boundary,
is signaled to indicate whether a luma-dependent chroma is added to the average QPs of the two adjacent blocks.
residual scaling is enabled or not. The chroma residual The value of qpOffset as a function of average luma level
scaling factor depends on the average value of top and/or is determined by a table of thresholds signaled in the SPS,
left reconstructed neighboring luma samples of the current which may typically be chosen according to the transfer
CU. Once the scaling factor is determined, the forward characteristics (the electro-optical transfer function and
scaling is applied to both the intra-picture and inter-picture opto-optical transfer function) of the source video content.
predicted residual at the encoding stage, and the inverse Adaptive loop filter (ALF): Two filter shapes are used
scaling is applied to the reconstructed residual. in block-based ALF. A 7 × 7 diamond shape is applied for
Deblocking filter boundary handling modifications: the luma component, and a 5 × 5 diamond shape is applied
The deblocking filter is applied to the samples adjacent to for the chroma components. One among up to 25 filters
a CU, TU, and subblock boundary except for the case when is selected for each 4 × 4 block, based on the direction
the boundary is also a picture boundary, or when deblock- and activity of local gradients. Each 4 × 4 block in the
ing is disabled across slice, tile, or subpicture boundaries picture is classified based on directionality and activity.
(which is an option that can be signaled by the encoder). Before filtering each 4 × 4 block, simple geometric trans-
The deblocking filtering process is applied on a 4 × 4 grid formations, such as rotation or diagonal and vertical flip,
for CU boundaries and transform subblock boundaries and can be applied to the filter coefficients, depending on the
on an 8 × 8 grid for prediction subblock boundaries. The gradient values calculated for that block. This is equivalent
prediction subblock boundaries include the PU boundaries to applying these transformations to the samples in the
introduced by the SBTMVP and affine modes, and the filter support region. The idea is to make different blocks
transform subblock boundaries include the TU bound- to which ALF is applied more similar by aligning their
aries introduced by SBT and ISP modes and transforms directionality. Block-based classification is not applied to
due to implicit splits of large CUs. As done in HEVC, the chroma components.
the processing order of the deblocking filter is defined as ALF filter parameters are signaled in an APS. In one APS,
horizontal filtering for vertical edges for the entire picture up to 25 sets of luma filter coefficients and clipping value
first, followed by vertical filtering for horizontal edges. This indices and up to eight sets of chroma filter coefficients
specific order enables either multiple horizontal filtering or and clipping value indices can be signaled. To reduce
vertical filtering processes to be applied in parallel threads bit overhead, filter coefficients of different classifications
or can still be implemented on a CTB-by-CTB basis with for the luma component can be merged. In the PH or SH,
only a small processing latency. the IDs of up to seven APSs can be signaled to specify the
Deblocking long filters: The deblocking filtering luma filter sets that are used for the current picture or
process is similar to that of HEVC. The boundary filter slice. The filtering process is further controlled at the CTB
strength (bS) of the deblocking filter is controlled by level. For each luma CTB, a filter set can be chosen among
the values of several syntax elements of the two adja- 16 fixed-value filter sets and the filter sets signaled in APSs.
cent blocks, and according to the filter strength and the For the chroma components, an APS ID is signaled in the
average QP of the adjacent blocks, two thresholds, tC PH or SH to indicate the chroma filter sets being used for
and β, are determined from predefined tables. For luma the current picture or slice. At the CTB level, a filter index
samples, one of four cases, no filtering, weak filtering, is signaled for each chroma CTB if there is more than one
short strong filtering, and long strong filtering, is chosen chroma filter set in the APS. When ALF is enabled for a
based on β and block size. There are three cases: no CTB, for each sample within the CTB, the diamond-shaped
filtering, normal filtering, and strong filtering for chroma filter selected for the respective 4 × 4 block is used, with a
samples. Compared with HEVC, long strong filtering for clipping operation applied to limit the difference between
luma samples and strong filtering for chroma samples are each neighboring sample and the current sample. The
newly introduced in VVC. Long luma strong filtering is clipping operation introduces a nonlinearity by reducing
used when the samples on either side of a boundary belong the impact of neighbor sample values that are too different
to a large block. A sample belonging to a large block is from the current sample value.
defined as when the width is larger than or equal to 32 for Cross-component adaptive loop filter (CC-ALF) can
a vertical edge or when the height is larger than or equal to further enhance each chroma component on top of the
32 for a horizontal edge. Up to seven samples at one side of previously described ALF. The goal of CC-ALF is to use luma
a boundary are filtered in the strong filter. Strong chroma sample values to refine each chroma component. This is
filtering is applied when both sides of the chroma edge are achieved by applying a diamond-shaped high-pass linear
in the 4:4:4 chroma format, which is especially effective principle as in HEVC and contain similar types of header
for video sequences represented in RGB color spaces. The parameters. The support of temporal scalability in VVC is
ACT in VVC is the same as in the HEVC SCC extension. also basically the same as in HEVC. Other aspects of the
It performs in-loop color-space conversion in the prediction systems and transport interfaces in VVC are summarized in
residual domain by adaptively converting the residuals the following, focusing on the differences compared with
from the input color space (presumed to be RGB) to the HEVC.
YCgCo-R luma–chroma color representation [32]. A flag at
the CU level is used to indicate whether the residuals of the 1) Random Access Support: VVC supports three types of
CU are coded with the YCgCo-R transformation or in the IRAP pictures, two types of IDR pictures (one type with
original color space. The YCgCo-R transformation is fully and one type without associated with other pictures that
reversible, so it can even be applied for lossless coding. precede them in display order), and one type of CRA
In order to reduce cache storage requirements, when ACT picture. These are basically the same as in HEVC. The BLA
is enabled for a CVS, the maximum transform size cannot picture types in HEVC are not included in VVC, mainly
exceed 32 × 32 samples since ACT requires temporarily because: 1) the basic functionality of BLA pictures can be
storing all three TBs. realized using CRA pictures and an end of sequence NAL
unit, the presence of which indicates that the next picture
8) 360◦ Video Coding Tools: Another design goal for
starts a new CVS in a single-layer bitstream and 2) there
VVC is the efficient coding of immersive video. This
was a desire for specifying fewer NAL unit types than in
includes 360◦ video, which is typically coded by rep-
HEVC to simplify the design understanding, as reflected by
resenting a 2-D picture that has been generated by a
the use of five instead of six bits for the NAL unit type field
projection mapping from a 3-D sphere. One example of
in the NAL unit header.
such a mapping is the equirectangular projection format
Another key difference in random access support
(ERP), in which the sphere is projected onto a rectangu-
between VVC and HEVC is the support of GDR in a more
lar picture with some geometric distortions, especially at
normative manner in VVC. In GDR, the decoding of a
the poles. Another mapping is the cube map projection
bitstream can start from an inter-picture coded picture,
(CMP), where the sphere is mapped onto the six faces of
and although, in the beginning, some parts of the pic-
a cube, which are then packed together into one picture.
ture region cannot be correctly decoded, after decoding
The ability to indicate such formats and the following
a number of additional pictures, the entire picture region
two techniques have been added to VVC to increase the
would become correct for decoding later pictures in the
coding efficiency for video pictures using these projection
bitstream. (AVC and HEVC can also support a form of
formats:
GDR, using a recovery point indication SEI message for
MV wrap-around allows for prediction samples to
signaling the GDR random access points and the recovery
“wrap-around” from the opposite left or right boundary in
points.) In VVC, a new NAL unit type is specified for
cases where an MV points outside of the coded area. In ERP
an indication of GDR pictures, and the recovery point is
pictures, the content tends to be continuous across such
signaled in the picture header (PH) syntax structure, and
a wrap-around due to the 360◦ nature of the projection
a bitstream or a CVS within a bitstream is allowed to
mapping, which can result in having a moving object that is
start with a GDR picture. This means that it is allowed
partly at the left boundary and partly at the right boundary
for an entire bitstream to contain only inter-picture coded
of a picture.
pictures without a single intra-picture coded picture. The
Virtual boundaries for in-loop filtering prevents
main benefit of specifying GDR support in this way is to
applying in-loop filtering across certain “virtual” bound-
provide a conforming behavior for GDR operation. GDR
aries, for example, not slice or tile boundaries but corre-
enables encoders to smooth out the bit rate of a bitstream
sponding to the CMP face boundaries in CMP pictures. The
by distributing intra-picture coded slices or blocks across
locations of these boundaries are typically signaled at the
multiple pictures that also contain inter-picture predicted
CVS level.
slices or blocks, as opposed to intra-picture coding of entire
pictures, thus allowing significant end-to-end delay reduc-
C. Systems and Transport Interfaces tion to improve behavior for ultralow-delay applications,
VVC inherited many aspects of the systems and transport such as wireless display, online gaming, and drone-based
interfaces from HEVC and the associated header syntax. applications.
The bitstream structure is the same as in HEVC except that Another GDR-related feature in VVC is the vir-
the concept of an elementary stream is not included. The tual boundary signaling discussed earlier. The boundary
NAL unit syntax and NAL unit header are both similar as between the refreshed region (i.e., the correctly decoded
in HEVC, with a small difference in the NAL unit header region) and the unrefreshed region at a picture between a
syntax, where HEVC uses six bits for the NAL unit type GDR picture and its recovery point can be signaled as a vir-
field, while VVC uses only five bits, thus allowing half tual boundary, and when signaled, in-loop filtering across
of the maximum number of specified NAL unit types. the boundary would not be applied; thus, a decoding
The VPS, SPS, PPS, and SH followed the same design mismatch for some samples at or near the boundary would
not occur. This can also be useful when the application basic concept of slices was kept in VVC but designed in an
involves displaying the correctly decoded regions during essentially different form. VVC introduces subpictures that
the GDR process. provide the same region extraction functionality as MCTSs
but are designed in a different way to have better coding
2) Adaptation Parameter Set: VVC introduced a new
efficiency and to be friendlier for usage in application
type of parameter set called the APS. An APS conveys
systems. More detail about these differences is described
picture- and/or slice-level information that may be shared
in the following.
by multiple slices of a picture and/or by slices of different
Tiles and WPP: As in HEVC, a picture can be split
pictures but can change frequently from picture-to-picture
into tile rows and tile columns in VVC, intra-picture pre-
with the total number of variants potentially being high
diction across tile boundaries is disallowed, and so on.
and thus not suitable for inclusion into the PPS. Three
However, the syntax for signaling the tile partitioning
types of parameters are included in APSs: ALF parame-
has been simplified, by using a unified syntax design for
ters, LMCS parameters, and scaling list parameters for
both the uniform and the nonuniform use cases. The
frequency-specific inverse quantization scaling. The main
WPP design in VVC has two differences compared with
purpose of introducing APSs is to save signaling overhead.
HEVC: 1) the CTU row delay is reduced from two CTUs
3) Picture Header: VVC also uses a PH, which contains to one CTU and 2) the signaling of entry point offsets for
header parameters for a particular picture. Each picture WPP in the SH is optional in VVC, while it is mandatory
must have exactly one PH. The PH basically carries those in HEVC.
parameters that would have been in the SH if the PH were Slices: In VVC, the support of conventional slices based
not introduced but would have the same value for all slices on CTUs (as in HEVC) or macroblocks (as in AVC), that is,
of a picture. These include IRAP/GDR picture indications, such that each slice consists of an arbitrary number of CTUs
flags indicating whether inter-picture and intra-picture or macroblocks in raster scan order within a tile or within
coded slices are allowed, picture ordering position syntax, a picture, has been removed. The main reasoning behind
information on RPLs, deblocking, SAO, ALF, QP selection, this architectural change is as follows. The advances in
weighted prediction control, coding block partitioning video coding since 2003 (the publication year of AVC v1)
information, virtual boundaries, colocated picture infor- have been such that slice-based error concealment has
mation, and so on. It often occurs that each picture in become practically impossible due to the ever-increasing
an entire sequence of pictures contains only one slice. number and efficiency of intra-picture and inter-picture
To avoid needing to have at least two NAL units for each prediction mechanisms. An error-concealed picture is the
picture, the PH syntax structure can be included either in decoding result of a transmitted coded picture for which
the PH NAL unit or in the SH in this case. The main purpose there has been some data loss (e.g., loss of some slices)
of introducing the PH was for saving signaling overhead for of the coded picture or a reference picture so that at least
cases where pictures are split into multiple slices. some part of the decoded picture is not error-free (e.g.,
because one or more reference pictures were lost or were
4) Reference Picture Management: Reference picture
error-concealed pictures). For example, when one of the
management is core functionality that is necessary for
multiple slices of a picture is lost, it may be error-concealed
any video coding scheme that uses multipicture buffering
using interpolation of the neighboring slices. While AVC
with generalized inter-picture prediction. It manages the
prediction mechanisms provide significantly higher cod-
storage and removal of reference pictures into and from
ing efficiency, they also make it harder for algorithms to
a decoded picture buffer (DPB) and puts reference pic-
estimate the quality of an error-concealed picture, which
tures in their proper order in the RPLs. Reference picture
was already a hard problem with the use of simpler
management in VVC is more similar to HEVC than AVC
prediction mechanisms. Advanced intra-picture prediction
but is somewhat simpler and more robust. As in those
mechanisms also function much less well if a picture is split
standards, two RPLs, called list 0 and list 1, are derived,
into multiple slices. Furthermore, network conditions have
but they are not based on the reference picture set concept
become significantly better in the meantime. As a result,
used in HEVC or the automatic sliding window process
very few implementations have recently used slices for
used in AVC; instead, they are signaled more directly.
MTU size matching. Instead, substantially, all applications
Reference pictures are listed for the RPLs as either active
where low-delay error/loss resilience is required (e.g.,
or inactive entries, and only the active entries may be used
video telephony and video conferencing) have come to rely
as reference indices for inter-picture prediction of CTUs of
on system/transport-level error resilience (e.g., retransmis-
the current picture. Inactive entries indicate other pictures
sion and forward error correction) and/or picture-based
to be held in the DPB for potential referencing by other
resilience tools (feedback-based resilience, insertion of
pictures that arrive later in the bitstream.
IRAPs, scalability with uneven protection of the base layer,
5) High-Level Picture Partitioning: VVC also includes four and so on). With all these, it is very rare that a picture
different high-level picture partitioning schemes but not that cannot be correctly decoded is passed to the decoder,
the same set as in HEVC. VVC inherited the tiles and WPP and when such a rare case occurs, the system can afford to
from HEVC, with some minor-to-moderate differences. The wait for an error-free picture to be decoded and available
Fig. 12. Picture with 18 × 12 luma CTUs that are partitioned into
Fig. 14. Picture with 18 × 12 luma CTUs that are partitioned into
24 tiles and nine rectangular slices.
12 tiles and three raster-scan slices.
Fig. 13. Picture partitioned into four tiles and four rectangular
slices (note that the top-right tile is split into two rectangular Fig. 15. Picture partitioned into 18 tiles, 24 slices, and
slices). 24 subpictures.
camera-captured video, Class E has video conferencing Table 3 YUV BD-Rate Savings of VVC (VTM-9.0) Over AVC and HEVC
1
PSNR YUV = (6 ∗ PSNRY + PSNR CB + PSNR CR ).
8
Table 4 Random Access YUV BD-Rate Savings of VVC (VTM-9.0) Over Table 5 MOS and PSNR-YUV BD-Rate Savings of VVC (VTM-10.0) Over
VVC Without Specific Tool Sets HEVC (HM-16.22) and of an Optimized VVC Encoder (VVenC-0.1) Over
VTM
JCCR), and loop filtering tools (ALF, CC-ALF, and LMCS) with the HEVC reference software (HM), an open-source
for the random access case. The coding gain of the VVC encoder implementation (VVenC) was also included in
QT+MTT block partitioning scheme can be approximated the tests as well [43]. The tested VVenC version 0.1 in
by comparing the first version of the VTM [41], which is “medium” preset runs significantly faster (110×) than VTM
basically adding QT+MTT on top of HEVC, to HM-16.20. and additionally includes subjective quality enhancement
VTM-1.0 provides around 10% YUV BD-rate savings for techniques, that is, temporal filtering of the input video
random access over the HEVC HM. It should be noted and perceptually tuned bit allocation [44]. Table 5 sum-
that some VVC coding features, for example, the improved marizes the subjective mean opinion score (MOS) and
CABAC engine, transform coefficient coding, intra-picture objective PSNR-YUV-based BD-rate savings for all five
prediction mode coding, and PDPC, cannot be turned off in test sequences. This test verifies that the VTM and VVenC
the VVC reference software. Hence, their respective gains encoders for VVC significantly improve compression, with
are not included in this experiment. Table 4 further lists the VTM reducing the bit rate by 43% on average relative
relative encoding and decoding runtimes for the averages, to the HM for the same perceived quality and VVenC
where 100% represents the runtime of the respective reducing the bit rate by an additional 12% relative to the
anchor. The presented results show that VVC’s coding VTM. On the other hand, the PSNR-YUV BD-rate savings
efficiency improvement over HEVC stems from multiple are much lower and even negative (i.e., a bit rate increase)
new coding features in each major module. In addition, for VVenC versus the VTM. For both tested VVC encoders,
the combined gains of all four tool sets (inter, intra, trans- the measured subjective quality benefit relative to the HM
form and quantization, and loop filtering) are just slightly somewhat exceeds the benefit measured by PSNR-YUV
lower than the sum of the individual gains. An additional BD-rate numbers—a phenomenon that was also observed
tool-on test, where each specific tool set is enabled on top for HEVC relative to its AVC predecessor [19]. Fig. 19
of a version of VTM with all tools off, has been performed shows pooled results for all five test sequences by plotting
as well and the results are not significantly different than
for the tool-off test.
B. Subjective
The compression capability goal of the HEVC and VVC
projects has been to reduce the bit rate for a given level of
subjective video quality, that is, the quality perceived by
human observers. While PSNR is a convenient objective
measurement method, it is not an adequate substitute
for subjective quality measurement. This motivated the
JVET to initiate formal testing activities using rigorous
subjective assessment methods in order to verify the coding
efficiency of the final standard. The first such verification
test was completed in October 2020, covering UHD SDR
content in a random access configuration, as may be
used in newer streaming and broadcast television appli-
cations [42]. Here, five challenging UHD SDR sequences
outside the JVET test set were selected and encoded over
a range of five quality levels spanning from annoying Fig. 19. Average (arithmetic) MOS and (geometric mean) bit rates
to almost imperceptible impairments. Although the main of VVC (VTM and VVenC encoders) and HEVC (HM encoder) pooled
focus was on comparing the VVC reference software VTM over the five UHD SDR sequences used in the verification test.
the arithmetic average of the MOS values over the geo- shown great promise in that direction, but this work has
metric average of the corresponding rate points. It can just begun to emerge, and such techniques are typically
be seen that the quality levels of the VTM and HM are difficult to implement at the high speeds and low costs that
well matched. At the time of writing, testing of HD SDR are necessary for widespread deployment in many video
(random access and low delay), HDR, and 360◦ video applications. Another promising direction is the develop-
content is ongoing and expected to be completed in ment of improved methods of measuring perceptual video
April 2021 [45]. quality. Given some improved method of measuring qual-
ity, there may be improved compression technologies that
VI. C O N C L U S I O N A N D O U T L O O K can optimize that quality. Yet another interesting direction
VVC is a major advance in both video compression capa- is the concept of video coding for machines, where the
bility and the versatility of the application domain, again key difference compared with conventional video coding is
demonstrating about 50% bit rate reduction for equal that the decoded video quality measurement needs to take
subjective quality—a characteristic that it shares with its into account the performance of a nonhuman usage of the
HEVC and AVC predecessors as a new milestone gener- decoded video for some particular purposes, for example,
ation of video coding technology. In terms of applica- by self-driving vehicles.
tions, it has substantial new features for such uses as The breadth of applications of video coding technology
the coding of HDR and 360◦ video content, streaming also continues to expand, as in recent and emerging work
with adaptive picture resolution, support for compressed- on the coding of point clouds, textures mapped onto mov-
domain bitstream extraction and merging, and, practically, ing 3-D meshes, and plenoptic light field coding. Such tech-
all of the features of the prior international video coding nologies will bring new requirements to the compression
standards and their extensions (e.g., extended chroma for- technology although the VVC standard seems quite flexible
mats, scalability, multiview coding, and SCC). Optimized to address the stable and well-understood applications that
encoder and decoder implementations of VVC have begun have driven the current demand for a new international
to emerge and have clearly demonstrated that the standard standard.
is feasible to implement with good compression perfor-
mance and practical levels of complexity. While the first
version of VVC has included only bit depths up to 10 bits Acknowledgment
per sample, the first extension work for VVC has begun The authors would like to thank the experts of ITU-T
to extend it to support higher bit depths and enhance VCEG, ISO/IEC MPEG, and their ITU-T/ISO/IEC Joint
its performance in the very high (near lossless) fidelity Video Experts Team (JVET) for their contributions. Their
range. work has not only led to the development of the new
Further research will result in further improvements in Versatile Video Coding (VVC) standard but also made
video compression, but it may be difficult to significantly a large archive of innovative contributions available for
surpass the capability of the VVC design for quite a few further study. The archive of JVET documents can be found
years to come. Artificial intelligence technologies have online at https://www.jvet-experts.org/.
REFERENCES
[1] High Efficiency Video Coding, Recommendation [8] Cisco Systems, “Cisco visual networking index: [16] J. Pfaff et al., “Data-driven intra-prediction modes
ITU-T H.265 and ISO/IEC 23008-2 (HEVC), ITU-T Forecast and trends, 2017–2022,” Cisco Syst., in the development of the versatile video coding
and ISO/IEC JTC 1, Apr. 2013. White Paper, 2019. [Online]. Available: http://web. standard,” ITU J. ICT Discoveries, vol. 3, no. 1,
[2] Advanced Video Coding for Generic Audio-Visual archive.org/web/20181213105003/https:/www. May 2020.
Services, Recommendation ITU-T H.264 and cisco.com/c/en/us/solutions/collateral/service- [17] Information Technology—Digital Compression and
ISO/IEC 14496-10 (AVC), ITU-T and ISO/IEC provider/visual-networking-index-vni/white-paper- Coding of Continuous-Tone Still Images—Part 1:
JTC 1, May 2003. c11-741490.pdf Requirements and Guidelines, Recommendation
[3] G. J. Sullivan and T. Wiegand, “Video [9] Video Codec for Audiovisual Services at P x 64 kbit/s, ITU-T T.81 and ISO/IEC 10918-1, ITU-T and
compression—From concepts to the H.264/AVC Recommendation ITU-T H.261, ITU-T, 1993. ISO/IEC JTC 1, 1992.
standard,” Proc. IEEE, vol. 93, no. 1, pp. 18–39, [10] Codecs for Videoconferencing Using Primary Digital [18] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and
Jan. 2005. Group Transmission, Recommendation ITU-T T. Wiegand, “Overview of the high efficiency video
[4] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and H.120, ITU-T, 1993. coding (HEVC) standard,” IEEE Trans. Circuits Syst.
A. Luthra, “Overview of the H.264/AVC video [11] Information Technology—Coding of Moving Pictures Video Technol., vol. 22, no. 12, pp. 1649–1668,
coding standard,” IEEE Trans. Circuits Syst. Video and Associated Audio for Digital Storage Media at up Dec. 2012.
Technol., vol. 13, no. 7, pp. 560–576, to About 1,5 Mbit/s—Part 2: Video, ISO/IEC [19] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan,
Jul. 2003. 11172-2, ISO/IEC JTC 1, 1993. and T. Wiegand, “Comparison of the coding
[5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and [12] Information Technology—Generic Coding of Moving efficiency of video coding standards—Including
G. J. Sullivan, “Rate-constrained coder control and Pictures and Associated Audio Information: Video, High Efficiency Video Coding (HEVC),” IEEE Trans.
comparison of video coding standards,” IEEE Trans. Recommendation ITU-T H.262 and ISO/IEC Circuits Syst. Video Technol., vol. 22, no. 12,
Circuits Syst. Video Technol., vol. 13, no. 7, 13818-2, ITU-T and ISO/IEC JTC 1, 1995. pp. 1669–1684, Dec. 2012.
pp. 688–703, Jul. 2003. [13] Video Coding for Low Bit Rate Communication, [20] R. Sjöberg et al., “Overview of HEVC high-level
[6] Versatile Video Coding, Recommendation ITU-T Recommendation ITU-T H.263, ITU-T, Mar. 1996. syntax and reference picture management,” IEEE
H.266 and ISO/IEC 23090-3 (VVC), ITU-T and [14] Information Technology—Coding of Audio-Visual Trans. Circuits Syst. Video Technol., vol. 22, no. 12,
ISO/IEC JTC 1, Jul. 2020. Objects—Part 2: Visual, document ISO/IEC pp. 1858–1870, Dec. 2012.
[7] Versatile Supplemental Enhancement Information 14496-2, ISO/IEC JTC 1, 2001. [21] Information Technology—Coding of Audio-Visual
Messages for Coded Video Bitstreams, [15] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of Objects—Part 12: ISO Base Media File Format,
Recommendation ITU-T H.274 and ISO/IEC hierarchical B pictures and MCTF,” in Proc. IEEE Int. document ISO/IEC 14496-12, ISO/IEC JTC 1,
23002-7 (VSEI), ITU-T and ISO/IEC JTC 1, Conf. Multimedia Expo (ICME), Toronto, ON, 2004.
Jul. 2020. Canada, Jul. 2006, pp. 1929–1932. [22] Information Technology—Dynamic Adaptive
Streaming Over HTTP (DASH)—Part 1: Media [30] J. Chen, M. Karczewicz, Y.-W. Huang, K. Choi, fraunhofer.de/jvet/VVCSoftware_VTM/-/tags/
Presentation Description and Segment Formats, J.-R. Ohm, and G. J. Sullivan, “The joint VTM-8.0
document ISO/IEC 23009-1, ISO/IEC JTC 1, exploration model (JEM) for video compression [38] HEVC Reference Software Version 16.20.
2012. with capability beyond HEVC,” IEEE Trans. Circuits Accessed: Sep. 2018. [Online]. Available:
[23] K. Misra, A. Segall, M. Horowitz, S. Xu, A. Fuldseth, Syst. Video Technol., vol. 30, no. 5, pp. 1208–1225, https://vcgit.hhi.fraunhofer.de/jct-vc/HM/-
and M. Zhou, “An overview of tiles in HEVC,” IEEE May 2020. /tags/HM-16.20
J. Sel. Topics Signal Process., vol. 7, no. 6, [31] B. Bross et al., “General video coding technology in [39] HEVC Screen Content Coding Extension Reference
pp. 969–977, Dec. 2013. responses to the joint call for proposals on video Software version 16.21+SCM8.8. Accessed:
[24] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl, compression with capability beyond HEVC,” IEEE Mar. 2020. [Online]. Available: https://vcgit.
“Tile based HEVC video for head mounted Trans. Circuits Syst. Video Technol., vol. 30, no. 5, hhi.fraunhofer.de/jct-vc/HM/-/tags/HM-16.21+
displays,” in Proc. IEEE Int. Symp. Multimedia pp. 1226–1240, May 2020. SCM-8.8
(ISM), San Jose, CA, USA, Dec. 2016, pp. 399–400. [32] H. S. Malvar, G. J. Sullivan, and S. Srinivasan, [40] AVC Reference Software Version 19.0. Accessed:
[25] C. Ching Chi et al., “Parallel scalability and “Lifting-based reversible color transformations for Mar. 2019. [Online]. Available: https://vcgit.hhi.
efficiency of HEVC parallelization approaches,” image compression,” Proc. SPIE vol. 7073, fraunhofer.de/jct-vc/JM/-/tags/JM-19.0
IEEE Trans. Circuits Syst. Video Technol., vol. 22, Aug. 2008, Art. no. 707307, Paper 7073-07. [41] J. Chen and E. Alshina, Algorithm Description for
no. 12, pp. 1827–1838, Dec. 2012. [33] Procedure for the Allocation of ITU-T Defined Codes Versatile Video Coding and Test Model 1 (VTM 1),
[26] D. Flynn et al., “Overview of the range extensions for Non-Standard Facilities, document document JVET-J1002, 10th Meeting of ITU-T/ISO/
for the HEVC standard: Tools, profiles, and Recommended ITU-T T.35, ITU-T, 1988. IEC Joint Video Experts Team (JVET), Apr. 2018.
performance,” IEEE Trans. Circuits Syst. Video [34] F. Bossen, J. Boyce, X. Li, V. Seregin, and K. Sühring, [42] V. Baroncini and M. Wien, VVC Verification Test
Technol., vol. 26, no. 1, pp. 4–19, Jan. 2016. JVET Common Test Conditions and Software Report for UHD SDR Video Content, document
[27] J. M. Boyce, Y. Ye, J. Chen, and Reference Configurations for SDR Video, document JVET-T2020, 21th Meeting of ITU-T/ISO/IEC Joint
A. K. Ramasubramonian, “Overview of SHVC: JVET-N1010, 14th Meeting of ITU-T/ISO/IEC Joint Video Experts Team (JVET), Oct. 2020.
Scalable extensions of the high efficiency video Video Experts Team (JVET), Mar. 2019. [43] Fraunhofer HHI VVenC Software Repository.
coding standard,” IEEE Trans. Circuits Syst. Video [35] G. Bjøntegaard, Improvement of BD-PSNR Model, Accessed: Sep. 2020. [Online]. Available: https://
Technol., vol. 26, no. 1, pp. 20–34, Jan. 2016. document VCEG-AI11 of ITU-T SG16/Q6, Berlin, github.com/fraunhoferhhi/vvenc
[28] G. Tech, Y. Chen, K. Muller, J.-R. Ohm, A. Vetro, Germany, Jul. 2008. [Online]. Available: http:// [44] A. Wieckowski et al., Open Optimized VVC Encoder
and Y.-K. Wang, “Overview of the multiview and 3D wftp3.itu.int/av-arch/video-site/0807_Ber/ (VVenC) and Decoder (VVdeC) Implementations,
extensions of high efficiency video coding,” IEEE [36] Working Practices Using Objective Metrics for document JVET-T0099, 21th Meeting of
Trans. Circuits Syst. Video Technol., vol. 26, no. 1, Evaluation of Video Coding Efficiency Experiments, ITU-T/ISO/IEC Joint Video Experts Team (JVET),
pp. 35–49, Jan. 2016. document ITU-T HSTP-VID-WPOM and ISO/ Oct. 2020.
[29] J. Xu, R. Joshi, and R. A. Cohen, “Overview of the IEC DTR 23002-8, ITU-T and ISO/IEC JTC 1, [45] M. Wien, V. Baroncini, A. Segall, and Y. Ye, VVC
emerging HEVC screen content coding extension,” 2020. Verification Test Plan (Draft 4), document
IEEE Trans. Circuits Syst. Video Technol., vol. 26, [37] VVC Reference Software Version 8.0. Accessed: JVET-T2009, 21st Meeting of ITU-T/ISO/IEC Joint
no. 1, pp. 50–62, Jan. 2016. Feb. 2020. [Online]. Available: https://vcgit.hhi. Video Experts Team (JVET), Oct. 2020.
Benjamin Bross (Member, IEEE) received Jianle Chen (Senior Member, IEEE) received
the Dipl.Ing. degree in electrical engineer- the B.S. and Ph.D. degrees from Zhejiang
ing from RWTH Aachen University, Aachen, University, Hangzhou, China, in 2001 and
Germany, in 2008. 2006, respectively.
In 2009, he joined the Fraunhofer Insti- He was formerly with Samsung Electron-
tute for Telecommunications–Heinrich Hertz ics, Suwon, South Korea, Qualcomm, San
Institute, Berlin, Germany, where he is cur- Diego, CA, USA, and Huawei, Santa Clara,
rently the Head of the Video Coding Systems CA, USA, focusing on the research of video
group and a part-time Lecturer with the HTW technologies. Since 2006, he has been
Berlin University of Applied Sciences. Since 2010, he has been actively involved in the development of various video coding stan-
very actively involved in the ITU-T Video Coding Experts Group dards, including the High Efficiency Video Coding (HEVC) standard,
(VCEG)|ISO/IEC MPEG video coding standardization processes as a its scalable, format range and screen content coding extensions,
Technical Contributor, a Coordinator of core experiments, and the and, most recently, the Versatile Video Coding (VVC) standard
Chief Editor of the High Efficiency Video Coding (HEVC) standard in the Joint Video Experts Team (JVET). He has been the main
(ITU-T H.265|ISO/IEC 23008-2) and the Versatile Video Coding (VVC) developer of the recursive partitioning structure with large block
standard (ITU-T H.266|ISO/IEC 23090-3). Besides giving talks about size, which is one of the key features of the HEVC standard and its
recent video coding technologies, he is the author or a coauthor of potential successors. He is currently the Director of the Multimedia
several fundamental HEVC-related publications and the author of R&D Group, Qualcomm, Inc. His research interests include video
two book chapters on HEVC and inter-picture prediction techniques coding and transmission, point cloud coding, AR/VR, and neural
in HEVC. network compression.
Mr. Bross received the IEEE Best Paper Award of the 2013 IEEE Dr. Chen was an Editor of the HEVC specification version 2 (the
International Conference on Consumer Electronics–Berlin in 2013, scalable HEVC (SHVC) text specification) and SHVC Test Model. For
the SMPTE Journal Certificate of Merit in 2014, and the Emmy Award VVC, he has been the Lead Editor of the Joint Exploration Test Model
at the 69th Engineering Emmy Awards in 2017 as a part of the Joint (JEM) and the VVC Test Model (VTM). He is an Editor of the VVC Text
Collaborative Team on Video Coding for its development of HEVC. Specification.
Jens-Rainer Ohm (Member, IEEE) has been Ye-Kui Wang received the B.S. degree
holding the chair position of the Institute of in industrial automation from Beijing Insti-
Communication Engineering, RWTH Aachen tute of Technology, Beijing, China, in 1995,
University, Aachen, Germany, since 2000. and the Ph.D. degree in information and
He is currently the Dean of the Faculty telecommunication engineering from the
of Electrical Engineering and Information Graduate School in Beijing, University of Sci-
Technology, RWTH Aachen University. Since ence and Technology of China, Hefei, China,
1998, he has been participating in the work in 2001.
of the Moving Picture Experts Group (MPEG). His earlier working experiences and titles
He has authored textbooks on multimedia signal processing, analy- include the Chief Scientist of Media Coding and Systems, Huawei
sis, and coding on communication engineering and signal trans- Technologies, San Diego, CA, USA, the Director of Technical Stan-
mission and numerous articles in these fields. His research and dards with Qualcomm, San Diego, a Principal Member of Research
teaching activities cover the areas of multimedia signal process- Staff with Nokia Corporation, Tampere, Finland, and so on. He is
ing, analysis, compression, transmission, and content description, currently a Principal Scientist with Bytedance Inc., San Diego.
including 3-D and VR video applications, biosignal processing and He has been an active contributor to various multimedia standards,
communication, application of deep learning approaches in the including video codecs, file formats, RTP payload formats, and mul-
given fields, and fundamental topics of signal processing and digital timedia streaming and application systems, developed by various
communication systems. standardization organizations, including ITU-T Video Coding Experts
Dr. Ohm has been chairing/cochairing various standardization Group (VCEG), ISO/IEC Moving Picture Experts Group (MPEG), Joint
activities in video coding, namely the MPEG Video Subgroup 2002– Video Team (JVT), JVT on Video Coding (JCT-VC), JCT-3V, 3GPP
2018, the Joint Video Team (JVT) of MPEG and ITU-T SG 16 Video SA4, IETF, AVS, DVB, ATSC, and DECE. He has coauthored about
Coding Experts Group (VCEG) 2005–2009, and the Joint Collabora- 1000 standardization contributions and over 50 academic articles.
tive Team on Video Coding (JCT-VC) since 2010 and the Joint Video He is a listed inventor for more than 300 U.S. patents. His research
Experts Team (JVET) since 2015. He has served on the editorial interests include video coding, storage, transport, and multimedia
boards of several journals and program committees of various systems.
conferences. He has been chairing the development of OMAF at MPEG and an
editor for several standards, including Versatile Video Coding (VVC),
Versatile Supplemental Enhancement Information (VSEI), OMAF, all
versions of High Efficiency Video Coding (HEVC), VVC file format,
HEVC file format, layered HEVC file format, ITU-T H.271, SVC file
Gary J. Sullivan (Fellow, IEEE) received the format, MVC, RFC 6184, RFC 6190, RFC 7798, 3GPP TR 26.906, and
B.S. and M.Eng. degrees from the University 3GPP TR 26.948.
of Louisville, Louisville, KY, USA, in 1982 and
1983, respectively, and the Ph.D. degree
from the University of California at Los Ange-
les, Los Angeles, CA, USA, in 1991.
He is currently a Video and Image Tech-
nology Architect with Microsoft Research,
Redmond, WA, USA. He has been the Long-
standing Chairman/Co-Chairman of various video and image cod-
ing standardization activities in ITU-T Video Coding Experts Group
(VCEG), ISO/IEC MPEG, ISO/IEC JPEG, and in their joint collab-
orative teams since 1996. He has led the development of the
Advanced Video Coding (AVC) standard (ITU-T H.264|ISO/IEC 14496-
10), the High Efficiency Video Coding (HEVC) standard (ITU-T
H.265|ISO/IEC 23008-2), the Versatile Video Coding (VVC) stan-
dard (ITU-T H.266|ISO/IEC 23090-3), and various other projects.
At Microsoft, he has been the Originator and the Lead Designer of
the DirectX Video Acceleration (DXVA) video decoding feature of
the Microsoft Windows operating system.
Dr. Sullivan is a Fellow of SPIE. He received the IEEE Masaru
Ibuka Consumer Electronics Award, the IEEE Consumer Electronics
Engineering Excellence Award, two IEEE Transactions on Circuits
and Systems for Video Technology Best Paper Awards, and the
SMPTE Digital Processing Medal. The team efforts that he has led
have been recognized by three Emmy Awards.