Metadata For Object-Based Interactive Audio
Metadata For Object-Based Interactive Audio
This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed
by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention
paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the
Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request
and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see
www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct
permission from the Journal of the Audio Engineering Society.
ABSTRACT
For object-based audio, an appropriate definition of metadata is needed to ensure flexible playback in any
reproduction scenario and to allow for interactivity. Important use-cases for object-based audio and audio
interactivity are described and requirements for metadata are derived. A metadata scheme is defined that
allows for enhanced audio rendering techniques such as content-dependent processing, automatic scene scaling
and enhanced level control. Also, a metadata preprocessing logic is proposed that prepares rendering and
playout and allows for user interaction with the audio content of an object-based scene. In addition, the
paper points out how the metadata can be transported efficiently in a bitstream. The proposed metadata
scheme has been adopted and integrated into the currently finalized MPEG-H 3D Audio standard.
tion to or as a replacement of the main dialog audio formats. All technical requirements and open as-
track. An example is a movie with additional di- pects were addressed with the publication of the
rector’s commentary or a car race during which the ISO/IEC Committee Draft for MPEG-H 3D Audio
user can select a team radio as an additional audio (CD, April 2014) and the Draft International Stan-
source. Another example is the presence of e.g. spo- dard (DIS, July 2014 [7]), which constitutes the tech-
ken subtitles or audio description elements, which nically complete specification.
can be enabled or disabled.
MPEG-H 3D Audio has been designed to suit the
Choosing between content versions: Different requirements for delivery of next generation audio
versions of content are offered, e.g. a sports event or content to the user. It supports delivery rang-
a match between two teams with different stadium ing from highest-quality cable and satellite TV to
atmospheres or commentaries, one in favor of the streaming to mobile devices and reproduction for
home team and one in favor of the guest team. arbitrary output setups ranging from 22.2 and be-
Enhanced level control: With object-based audio yond down to 5.1, stereo and binaural reproduction.
the level of single sounds and the balance between A brief overview over the main features that make
different sounds are changeable in a convenient way. MPEG-H 3D Audio applicable for the different asso-
Examples for the need for an enhanced level control ciated playback scenarios is outlined here. A system
are spoken subtitles, voice over translation, audio de- overview is given in Figure 2.
scription or simultaneous translation. The level of MPEG-H 3D Audio offers the possibility for cod-
the main speech content / main audio track should ing of channels-based content, object-based content
be reduced automatically or by the user when an ad- and Higher Order Ambisonics (HOA) as a sound-
ditional track is active (ducking, audio description field representation. As a first step, all transmitted
receiver mix). For better speech intelligibility the audio signals are decoded by an extended Unified
balance between speech content and ambient sound Speech and Audio Coding (USAC [8]) stage (USAC-
may be changed depending on personal user prefer- 3D). Channel-based signals are mapped to the tar-
ence, the listening environment or the hearing abil- get reproduction loudspeaker layout using a “format
ities of the consumer [6]. If the overall volume level converter” module. The format converter gener-
is low, the volume of low-level content (e.g. dialog) ates high-quality downmixes to convert the decoded
might be increased for easier understanding while channel signals to numerous output formats, i.e. for
at the same time the intensity of higher-level audio playback on different loudspeaker layouts (including
content is reduced (late night mode). non-ideal loudspeaker placement).
Automatic audio scene scaling: If sounds with Object-based signals are rendered to the target re-
accompanying pictures are played back, the audio- production loudspeaker layout by the object ren-
visual coherence should be, and remain, consistent derer, which maps the signals to loudspeaker feeds
in different playback scenarios and screen-sizes. The based on the metadata and the locations of the
audio scene should therefore be automatically scal- loudspeakers in the reproduction room. The ob-
able according to the reproduction screen-size, such ject renderer applies Vector Base Amplitude Pan-
that the positions of visual elements and the corre- ning (VBAP, [9]) and provides an automatic trian-
sponding origins of sounds are in agreement. The gulation algorithm of the 3D surface surrounding the
need for a screen-dependent position correction is listener for arbitrary target configurations.
also addressed in [5].
Alternatively, signals coded via an extended ver-
3. THE MPEG-H 3D AUDIO DECODER sion of Spatial Audio Object Coding (SAOC-3D),
In order to facilitate high-quality bitrate-efficient i.e. parametrically coded channel-based and object-
distribution and flexible reproduction of 3D sound, based signals, are rendered by the SAOC-3D decoder
the MPEG standardization group is currently final- to the target reproduction loudspeaker layout ex-
izing MPEG-H 3D Audio which allows for the uni- ploiting the associated metadata. The original Spa-
versal carriage of encoded 3D sound from channel- tial Audio Object Coding (SAOC) codec [10, 11] has
based, object-based and scene-based (HOA) sound therefore been enhanced with multiple extensions.
HOA content is rendered to the target reproduction Signaling of “groups”: The concept of an element
loudspeaker layout using the associated HOA meta- group is defined for arranging related elements, e.g.
data by a HOA renderer that uses simple matrix for common interactivity and simultaneous render-
operations for manipulation and rendering. ing. A use-case for groups of elements is the defini-
tion of channel-based recordings (stems, sub-mixes)
For a more detailed system description of MPEG-H
as audio elements (e.g. a stereo recording where the
3D Audio see [12, 13].
two signals should only be manipulated as a pair).
4. OBJECT-BASED AUDIO IN MPEG-H Signaling of “switch groups”: The concept of
During the development of the object-based meta- a switch group describes a grouping of elements,
data in MPEG-H 3D Audio, the focus was to sup- which are mutually exclusive. The switch group can
port the following relevant features. be used to ensure that only exactly one out of the
Signaling of the position of elements: For the switch group members is enabled at a time. This
possibility of rendering to any target layout and for allows for switching between e.g. different language
allowing for moving objects (predefined trajectories tracks, where it is not sensible to simultaneously en-
or interactive movements), the position of an ele- able multiple ones. A special case is a “0/1”-switch
ment in space is fully defined by metadata descrip- group with a minimum of zero enabled members (e.g.
tors as spherical coordinates (azimuth, elevation and for voice over elements, to ensure that either none (if
radius/distance). no voice over is needed) or just one of the available
voice over elements is enabled).
Signaling of screen-relation: For enabling auto-
matic audio scene scaling, it is possible to signal that Signaling of content characteristics: By speci-
the position of an element is related to the screen. fying content type and language of content, the pos-
This information together with the screen-size can sibility for separate processing of different types of
be used in the decoding process to preserve the re- content (e.g. for dynamic compression and ducking)
lation between image and sound. and choosing of different languages can be assured.
Interactivity control metadata: The metadata • Switch groups can only contain groups.
allows for the definition of different categories of
• A group can contain elements of type “chan-
user interactivity, e.g. to reflect the content cre-
nels” (signals that should be played back by
ators opinion to what extent his artistic intent may
a specific loudspeaker configuration), “objects”
be modified. The definition of ranges of interactivity
or “SAOC” (signals that should be rendered to
allows the content creators to limit the interaction
the reproduction loudspeaker layout, could be
and adaptations (e.g. the position could only be
still or moving) or “HOA” (Higher Order Am-
changed in a range between an offset of -30◦ and
bisonics signals). One group can only contain
30◦ azimuth).
elements of one type.
Signaling of special playback options: The
metadata definition includes a possibility to signal A possible audio scene is depicted in Figure 3.
that elements should be directly played back by a
specified loudspeaker without any rendering. This The metadata fields are described in Table 1. They
can be used for traditional channel-based content are sorted into categories that reflect their function-
that is treated as audio elements. It is also pos- ality and are directly related to some of the men-
sible to route elements to the geometrically clos- tioned use-cases and features.
est loudspeaker for a maximum discrete playback.
4.2. Transport of object-based Metadata in the
The elements could for example contain the differ-
Bitstream
ent participants of a teleconference meeting which
The defined object-based metadata is included in
could be routed to discrete speakers instead of be-
the MPEG-H 3D Audio bitstream to be transported
ing rendered.
in an encoder-decoder chain. For the encoding and
On/Off-status signaling: The metadata contains transport of the metadata it is distinguished between
the default on/off-status of elements. This allows static metadata (constant over time) and dynamic
for embedding additional audio elements (e.g. ad- metadata (changes over time).
ditional commentaries or additional speech tracks)
that are switched off by default. The static object-based metadata is only transmit-
ted once at the beginning of an audio file or at a
Priority: The metadata model includes descriptors
regular basis, e.g. at random-access points. The bit-
for the priority or importance of an element or a
stream syntax is designed in a bit-efficient manner
group of elements. This can be used in a renderer
and no coding scheme is applied here. For the audio
or a coding engine that can only handle a certain
scene from Figure 3 without any description of the
number of elements due to complexity reasons, e.g.
groups, the static metadata takes up to 1.7 kBit per
for real-time playback. The signaling of the prior-
second (assuming a transmission every 0.5 seconds).
ity then allows determining which elements could be
If a description of 128 bytes length is added for each
discarded.
group and switch group, the size of the static meta-
4.1. Metadata Structure and Fields data would be 20.5 kBit per second.
The described features are reflected in the de-
fined object-based metadata structure and metadata 4.2.1. Transport of Dynamic Metadata
fields. The metadata definition contains all needed Only the metadata fields in the category “Dynamic
information for reproduction and rendering in flexi- Element Characteristics” in Table 1 may change dy-
ble reproduction layouts and allows for extensibility. namically over time. They describe the position of
For a simple and efficient metadata definition, the an element in 3D space, namely azimuth, elevation
following structure of the audio elements in an au- and radius/distance, its level (the energy of the ele-
dio scene is defined: ment source), spread (i.e. energy distribution of an
element in azimuth and elevation direction), and its
• All elements (audio tracks / sound events) in an dynamic priority. Because of the dynamic change,
audio scene have to be members of a group. this metadata should be repeated at a high rate
• Groups can only contain elements, but not other within the bitstream, e.g. every 2048 audio samples.
groups. As this would result in a relatively high data rate if
Switch Group Definition and Description (defined once per switch group)
Dynamic Element Characteristics (defined once per element, for members of “object” or “SAOC” groups only)
Fig. 3: Groups and switch groups in an exemplary audio scene. Different dialog elements and voice over
elements are combined in two switch groups to signal their mutually exclusive relationship.
a high number of elements are present, a data com- of the intracoded metadata, either an absolute value
pression method for the dynamic element metadata or a quantized difference value is transmitted at a
is utilized. regular time interval.
For random-access support, a full transmission of In case the data is transmitted differentially, the
the complete set of dynamic element metadata hap- value of the metadata trajectory y at the time n
pens on a regular basis, i.e. intracoded metadata. can be calculated with the help of the differential
In-between full transmissions, only differential meta-
data is transmitted. Therefore, time-variant data is
quantized and downsampled and a coarse approx-
imation of the change is determined. The differ-
ence between the original dynamic metadata and
the linearly interpolated and upsampled version of
the coarse approximation is analyzed.
A number of N consecutive difference values are used
to approximate a polygon course that is formed by
a variable number of quantized polygon points. The
number of needed polygon points is on average sig-
nificantly smaller than the number of N . The poly-
gon points are coded as small integer numbers with
a low number of bits. The processing for decoding
the dynamic metadata is illustrated in Figure 4.
4.2.2. Low Delay Coding of Dynamic Metadata Fig. 4: Decoding of the dynamic metadata by an
In addition to the described coding, a dynamic meta- interpolation in between the given polygon points.
data encoding scheme with low latency is defined as The dashed line depicts the first interpolation step,
a modified DPCM (Differential Pulse Code Modu- i.e. interpolation of intracoded metadata. The dot-
lation) procedure. In-between the full transmissions ted line shows the decoded metadata.
Type Description
groups should always be off. The on/off-statuses of for the presets in Figure 5 take another 6.64 kBit per
these groups should not be changeable by the user. second when transmitted every 0.5 seconds (preset
For this case, the preset definition would contain the description of length 128 bytes assumed). Presets
following conditions: can be used as references in the loudness metadata,
! as described above, to define a content context for a
OnOff(GroupID = 2) = 1;
!
loudness information set. In addition to the basic in-
OnOff(GroupID = 3) = 1; teraction mode, an advanced interactivity mode can
!
OnOff(GroupID = 4) = 1; be defined, where the user is given full control of all
OnOff(GroupID = 6) = 0;
! present groups within the limits of the “Interactivity
! Data and Playout” metadata.
OnOff(GroupID = 7) = 0;
6.2. Processing of Metadata and User Interac-
The described preset and two more exemplary preset
tion Data in the Decoding Process
definitions are depicted in Figure 5.
For processing any user interaction, the desired mod-
Groups that are not referenced in the preset are ifications have to be taken into account in the de-
marked as “free to choose by user”. The metadata coder. The dedicated metadata fields have to be
evaluated and interaction data coming from a user
interface or an application has to be processed. This
processing can happen in a separate decoder mod-
ule. An “Element Metadata Preprocessor” prepares
the elements for rendering, such that the rendering is
agnostic of the interaction. An overview of the pro-
cessing steps in the Element Metadata Preprocessor
is given in Figure 6.
6.2.1. Initialization
First, the preprocessor determines which groups
should be played back and which ones can be dis-
carded. This is defined by their on/off-status (e.g.
defined by a chosen preset) and the implicit logic of
switch group definitions. Then, the relevant groups
are further processed.
6.2.2. Screen-Related Processing
Fig. 5: Possible preset definitions for the exemplary The positions of all screen-related elements are
audio scene. remapped according to the reproduction screen-size
as an adaptation to the reproduction room. If no The elevation mapping function is defined accord-
reproduction screen-size information is given or no ingly. The screen-related processing can also take
screen-related element exists, no remapping is ap- into account a zooming area for zooming into high-
plied. resolution video content. Screen-related processing
is only defined for elements that are accompanied
The remapping is defined by linear mapping func- by dynamic position data and that are labeled as
tions that take into account the reproduction screen- screen-related.
size information in the playback room and screen-
size information of a reference screen, e.g. used in 6.2.3. Gain and Position Interaction
the mixing and monitoring process. The azimuth The gain interaction and position interaction are ap-
mapping function is depicted in Figure 7. It is de- plied next. Output gain values are calculated for all
fined such that the azimuth values between the left elements, taking into account the interactivity gain
edge and the right edge of the reference screen are modification, the DefaultGain of the groups and the
mapped (compressed or expanded) to the interval dynamic gains if available. Position interactivity is
between the left edge and the right edge of the re- only defined for groups with associated dynamic po-
production screen. Other azimuth values are com- sition metadata. Output azimuth, elevation and dis-
pressed or expanded, such that the whole range of tance values are determined for each element of these
values is covered. groups, taking into account the interactivity modi-
fication and the values from the dynamic position
metadata.