0% found this document useful (0 votes)
55 views

Metadata For Object-Based Interactive Audio

This document describes a metadata scheme for object-based interactive audio. It analyzes use cases like changing audio object positions or languages, and defines metadata fields to enable features like content-dependent processing and user interaction with audio objects. The metadata scheme has been adopted by the MPEG-H 3D Audio standard.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Metadata For Object-Based Interactive Audio

This document describes a metadata scheme for object-based interactive audio. It analyzes use cases like changing audio object positions or languages, and defines metadata fields to enable features like content-dependent processing and user interaction with audio objects. The metadata scheme has been adopted by the MPEG-H 3D Audio standard.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Audio Engineering Society

Convention Paper 9097


Presented at the 137th Convention
2014 October 9–12 Los Angeles, USA

This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed
by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention
paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the
Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request
and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see
www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct
permission from the Journal of the Audio Engineering Society.

Design, Coding and Processing of Metadata


for Object-Based Interactive Audio
Simone Füg1 , Andreas Hölzer1 , Christian Borß1 , Christian Ertel1 , Michael Kratschmer1 , and Jan Plogsties1
1
Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany

Correspondence should be addressed to Simone Füg ([email protected])

ABSTRACT
For object-based audio, an appropriate definition of metadata is needed to ensure flexible playback in any
reproduction scenario and to allow for interactivity. Important use-cases for object-based audio and audio
interactivity are described and requirements for metadata are derived. A metadata scheme is defined that
allows for enhanced audio rendering techniques such as content-dependent processing, automatic scene scaling
and enhanced level control. Also, a metadata preprocessing logic is proposed that prepares rendering and
playout and allows for user interaction with the audio content of an object-based scene. In addition, the
paper points out how the metadata can be transported efficiently in a bitstream. The proposed metadata
scheme has been adopted and integrated into the currently finalized MPEG-H 3D Audio standard.

1. INTRODUCTION AND BACKGROUND audio content is based on three fundamental repre-


Over recent years, interest in “3D audio” has grown sentations: channel-based, scene-based and object-
with new cinema audio formats evolving. The more based audio.
descriptive term “immersive audio” refers to an au-
ditory experience of sound that appears to come With channel-based audio, each channel is delivered
from any direction around the listener, including to a loudspeaker in a precisely defined and fixed tar-
above and below. get location relative to the listener, e.g. 2.0, 5.1
and 22.2 setups. Scene-based audio is based on a
Secondly, enabling users to interact with audio con- soundfield representation, e.g. Higher Order Ambi-
tent is subject to experiments and development of sonics (HOA). There, the soundfield has to be de-
new services in broadcasting and IP-based media de- coded to a defined loudspeaker layout for reproduc-
livery. Interactive and immersive user experience of tion. In object-based audio the sound is represented
Füg et al. Metadata for Object-Based Interactive Audio

definition the listening experience can be optimized


depending on local factors relating to the device, the
environment and the audience.
This paper describes the definition process of a
metadata scheme for object-based interactive audio
including the analysis of use-cases and requirements,
as well as the definition of metadata fields. Addition-
Fig. 1: Concept of rendering audio elements to a re- ally, it is described how the defined metadata can be
production loudspeaker layout, taking into account coded for transmission in an encoder-decoder chain
the metadata and possible user interactions. and how the metadata can be handled in an au-
dio decoder to prepare the rendering process and re-
production. The described metadata definition and
by a number of separate audio objects that consist coding has been adopted to and integrated into the
of audio tracks or sound events (e.g. a talker, an currently finalized MPEG-H 3D Audio standard.
airplane, a guitar) and associated side information
as metadata. 2. USE-CASES FOR OBJECT-BASED AUDIO
This metadata for object-based audio must exactly AND AUDIO INTERACTIVITY
define ‘when’, ‘where’ and ‘how loud’ sounds should Object-based audio allows for personalized playback
occur, because it is used by the playback system options ranging from simple adjustments, such as
to render the audio to the reproduction loudspeaker increasing or decreasing the level of dialog relative
layout. Other types of content-descriptive metadata, to the other audio elements, to broadcasts where
e.g. information about artist, year of production or several audio elements may be adjusted in level or
genre, are not considered here. position to adapt the audio playback experience to
the user’s liking. The following use-cases for audio
The usage of the object-based metadata for the ren- interactivity and object-based audio served as a ba-
dering process ensures that a predefined location or sis for the definition of needed metadata fields in
movement of a sound event is maintained, regardless the process of designing the metadata structure for
of the reproduction system. Consequently, object- MPEG-H 3D Audio.
based audio allows for a correct reproduction of spa-
tial aspects of recorded sound, which has been an Changing the position of sound events: Imag-
ongoing topic for many years [1, 2, 3, 4]. It also al- ine the recording of a rock band in which each in-
lows for interactivity, because audio objects can be strument and vocalist can actively be manipulated
processed separately and may be influenced by the in terms of location. Thus, a listener can create his
user or application prior to the rendering process. own mix and/or change his virtual listening posi-
The two other representations of audio (channel- tion, i.e. put himself acoustically in the middle of
based and scene-based) can also be treated as object- the band instead of sitting in front. The need for
based, with channels or soundfield as separate audio user control of the scene to adjust the spatial dis-
objects. With this expansion, object-based audio tribution of sound sources is addressed by previous
can be seen as a superset with extended processing publications, e.g. in [5].
capabilities to meet the needs of future audio sys- Changing the language of a program: Multi-
tems. Each audio track with accompanying meta- ple language tracks are offered of which one could
data in object-based audio is hence called “audio be selected for playback. Instead of transmitting
element” in this paper as it can be channel, object a complete mix for each language version (original
or soundfield representation. The concept of render- language and several dub versions), the different lan-
ing audio elements to the reproduction loudspeaker guages and the language-agnostic content (e.g. am-
layout is illustrated in Figure 1. bience) can be transmitted as separate audio ele-
The object-based metadata specifies the character- ments.
istics of the raw audio data and the relation to other Enabling of additional dialog tracks: Dialog
audio elements. Only with an appropriate metadata tracks are provided that can be selected in addi-

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 2 of 12
Füg et al. Metadata for Object-Based Interactive Audio

tion to or as a replacement of the main dialog audio formats. All technical requirements and open as-
track. An example is a movie with additional di- pects were addressed with the publication of the
rector’s commentary or a car race during which the ISO/IEC Committee Draft for MPEG-H 3D Audio
user can select a team radio as an additional audio (CD, April 2014) and the Draft International Stan-
source. Another example is the presence of e.g. spo- dard (DIS, July 2014 [7]), which constitutes the tech-
ken subtitles or audio description elements, which nically complete specification.
can be enabled or disabled.
MPEG-H 3D Audio has been designed to suit the
Choosing between content versions: Different requirements for delivery of next generation audio
versions of content are offered, e.g. a sports event or content to the user. It supports delivery rang-
a match between two teams with different stadium ing from highest-quality cable and satellite TV to
atmospheres or commentaries, one in favor of the streaming to mobile devices and reproduction for
home team and one in favor of the guest team. arbitrary output setups ranging from 22.2 and be-
Enhanced level control: With object-based audio yond down to 5.1, stereo and binaural reproduction.
the level of single sounds and the balance between A brief overview over the main features that make
different sounds are changeable in a convenient way. MPEG-H 3D Audio applicable for the different asso-
Examples for the need for an enhanced level control ciated playback scenarios is outlined here. A system
are spoken subtitles, voice over translation, audio de- overview is given in Figure 2.
scription or simultaneous translation. The level of MPEG-H 3D Audio offers the possibility for cod-
the main speech content / main audio track should ing of channels-based content, object-based content
be reduced automatically or by the user when an ad- and Higher Order Ambisonics (HOA) as a sound-
ditional track is active (ducking, audio description field representation. As a first step, all transmitted
receiver mix). For better speech intelligibility the audio signals are decoded by an extended Unified
balance between speech content and ambient sound Speech and Audio Coding (USAC [8]) stage (USAC-
may be changed depending on personal user prefer- 3D). Channel-based signals are mapped to the tar-
ence, the listening environment or the hearing abil- get reproduction loudspeaker layout using a “format
ities of the consumer [6]. If the overall volume level converter” module. The format converter gener-
is low, the volume of low-level content (e.g. dialog) ates high-quality downmixes to convert the decoded
might be increased for easier understanding while channel signals to numerous output formats, i.e. for
at the same time the intensity of higher-level audio playback on different loudspeaker layouts (including
content is reduced (late night mode). non-ideal loudspeaker placement).
Automatic audio scene scaling: If sounds with Object-based signals are rendered to the target re-
accompanying pictures are played back, the audio- production loudspeaker layout by the object ren-
visual coherence should be, and remain, consistent derer, which maps the signals to loudspeaker feeds
in different playback scenarios and screen-sizes. The based on the metadata and the locations of the
audio scene should therefore be automatically scal- loudspeakers in the reproduction room. The ob-
able according to the reproduction screen-size, such ject renderer applies Vector Base Amplitude Pan-
that the positions of visual elements and the corre- ning (VBAP, [9]) and provides an automatic trian-
sponding origins of sounds are in agreement. The gulation algorithm of the 3D surface surrounding the
need for a screen-dependent position correction is listener for arbitrary target configurations.
also addressed in [5].
Alternatively, signals coded via an extended ver-
3. THE MPEG-H 3D AUDIO DECODER sion of Spatial Audio Object Coding (SAOC-3D),
In order to facilitate high-quality bitrate-efficient i.e. parametrically coded channel-based and object-
distribution and flexible reproduction of 3D sound, based signals, are rendered by the SAOC-3D decoder
the MPEG standardization group is currently final- to the target reproduction loudspeaker layout ex-
izing MPEG-H 3D Audio which allows for the uni- ploiting the associated metadata. The original Spa-
versal carriage of encoded 3D sound from channel- tial Audio Object Coding (SAOC) codec [10, 11] has
based, object-based and scene-based (HOA) sound therefore been enhanced with multiple extensions.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 3 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Fig. 2: Top level block diagram of the MPEG-H 3D Audio decoder.

HOA content is rendered to the target reproduction Signaling of “groups”: The concept of an element
loudspeaker layout using the associated HOA meta- group is defined for arranging related elements, e.g.
data by a HOA renderer that uses simple matrix for common interactivity and simultaneous render-
operations for manipulation and rendering. ing. A use-case for groups of elements is the defini-
tion of channel-based recordings (stems, sub-mixes)
For a more detailed system description of MPEG-H
as audio elements (e.g. a stereo recording where the
3D Audio see [12, 13].
two signals should only be manipulated as a pair).
4. OBJECT-BASED AUDIO IN MPEG-H Signaling of “switch groups”: The concept of
During the development of the object-based meta- a switch group describes a grouping of elements,
data in MPEG-H 3D Audio, the focus was to sup- which are mutually exclusive. The switch group can
port the following relevant features. be used to ensure that only exactly one out of the
Signaling of the position of elements: For the switch group members is enabled at a time. This
possibility of rendering to any target layout and for allows for switching between e.g. different language
allowing for moving objects (predefined trajectories tracks, where it is not sensible to simultaneously en-
or interactive movements), the position of an ele- able multiple ones. A special case is a “0/1”-switch
ment in space is fully defined by metadata descrip- group with a minimum of zero enabled members (e.g.
tors as spherical coordinates (azimuth, elevation and for voice over elements, to ensure that either none (if
radius/distance). no voice over is needed) or just one of the available
voice over elements is enabled).
Signaling of screen-relation: For enabling auto-
matic audio scene scaling, it is possible to signal that Signaling of content characteristics: By speci-
the position of an element is related to the screen. fying content type and language of content, the pos-
This information together with the screen-size can sibility for separate processing of different types of
be used in the decoding process to preserve the re- content (e.g. for dynamic compression and ducking)
lation between image and sound. and choosing of different languages can be assured.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 4 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Interactivity control metadata: The metadata • Switch groups can only contain groups.
allows for the definition of different categories of
• A group can contain elements of type “chan-
user interactivity, e.g. to reflect the content cre-
nels” (signals that should be played back by
ators opinion to what extent his artistic intent may
a specific loudspeaker configuration), “objects”
be modified. The definition of ranges of interactivity
or “SAOC” (signals that should be rendered to
allows the content creators to limit the interaction
the reproduction loudspeaker layout, could be
and adaptations (e.g. the position could only be
still or moving) or “HOA” (Higher Order Am-
changed in a range between an offset of -30◦ and
bisonics signals). One group can only contain
30◦ azimuth).
elements of one type.
Signaling of special playback options: The
metadata definition includes a possibility to signal A possible audio scene is depicted in Figure 3.
that elements should be directly played back by a
specified loudspeaker without any rendering. This The metadata fields are described in Table 1. They
can be used for traditional channel-based content are sorted into categories that reflect their function-
that is treated as audio elements. It is also pos- ality and are directly related to some of the men-
sible to route elements to the geometrically clos- tioned use-cases and features.
est loudspeaker for a maximum discrete playback.
4.2. Transport of object-based Metadata in the
The elements could for example contain the differ-
Bitstream
ent participants of a teleconference meeting which
The defined object-based metadata is included in
could be routed to discrete speakers instead of be-
the MPEG-H 3D Audio bitstream to be transported
ing rendered.
in an encoder-decoder chain. For the encoding and
On/Off-status signaling: The metadata contains transport of the metadata it is distinguished between
the default on/off-status of elements. This allows static metadata (constant over time) and dynamic
for embedding additional audio elements (e.g. ad- metadata (changes over time).
ditional commentaries or additional speech tracks)
that are switched off by default. The static object-based metadata is only transmit-
ted once at the beginning of an audio file or at a
Priority: The metadata model includes descriptors
regular basis, e.g. at random-access points. The bit-
for the priority or importance of an element or a
stream syntax is designed in a bit-efficient manner
group of elements. This can be used in a renderer
and no coding scheme is applied here. For the audio
or a coding engine that can only handle a certain
scene from Figure 3 without any description of the
number of elements due to complexity reasons, e.g.
groups, the static metadata takes up to 1.7 kBit per
for real-time playback. The signaling of the prior-
second (assuming a transmission every 0.5 seconds).
ity then allows determining which elements could be
If a description of 128 bytes length is added for each
discarded.
group and switch group, the size of the static meta-
4.1. Metadata Structure and Fields data would be 20.5 kBit per second.
The described features are reflected in the de-
fined object-based metadata structure and metadata 4.2.1. Transport of Dynamic Metadata
fields. The metadata definition contains all needed Only the metadata fields in the category “Dynamic
information for reproduction and rendering in flexi- Element Characteristics” in Table 1 may change dy-
ble reproduction layouts and allows for extensibility. namically over time. They describe the position of
For a simple and efficient metadata definition, the an element in 3D space, namely azimuth, elevation
following structure of the audio elements in an au- and radius/distance, its level (the energy of the ele-
dio scene is defined: ment source), spread (i.e. energy distribution of an
element in azimuth and elevation direction), and its
• All elements (audio tracks / sound events) in an dynamic priority. Because of the dynamic change,
audio scene have to be members of a group. this metadata should be repeated at a high rate
• Groups can only contain elements, but not other within the bitstream, e.g. every 2048 audio samples.
groups. As this would result in a relatively high data rate if

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 5 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Audio Scene Information (defined once per audio scene)

NumGroups, NumSwitchGroups number of groups and switch groups in an audio scene


RefScreenSizeAz azimuth of a production/monitoring reference screen (half the screen width),
optional
RefScreenSizeTopEl top elevation of a reference screen, optional
RefScreenSizeBottomEl bottom elevation of a reference screen, optional
Group Definition and Description (defined once per group)

GroupID, GroupNumMembers unique identifier of an element group, number of members


GroupDescription textural description of a group, optional
GroupMembers list of ElementIDs of group members
GroupType type of group elements (“channels”, “objects”, “SAOC” or “HOA”)
GroupPriority static priority of a group of elements

Switch Group Definition and Description (defined once per switch group)

SwitchGroupID unique identifier of a switch group


SwitchGroupDescription textural description of a switch group, optional
SwitchGroupDefault GroupID of the default member (including reserved value for “no default” for
“0/1”-switch group)
SwitchGroupNumMembers number of members of a switch group
SwitchGroupMembers list of GroupIDs of switch group members

Content Data (defined once per group, optional)

ContentKind, ContentLanguage kind (dialog, music, etc.) and language of content

Dynamic Element Characteristics (defined once per element, for members of “object” or “SAOC” groups only)

Azimuth, Elevation azimuth and elevation angle of element position


Radius distance of element from reference point (sweet spot)
Gain gain of element, controls the energy of the element source
Spread spread of element in azimuth and elevation direction
DynamicPriority dynamic priority of an audio element, optional

Interactivity Data and Playout (defined once per group)

AllowOnOff flag, on/off-interactivity is allowed/not allowed


AllowPositionInteractivity flag, change of position is allowed/not allowed
AllowGainInteractivity flag, change of gain is allowed/not allowed
DefaultOnOff, DefaultGain default on/off-status and default gain in dB
MinGain, MaxGain minimum and maximum additive gain values in dB for interactivity
MinAzOffset, MaxAzOffset minimum and maximum azimuth offsets for interactivity
MinElOffset, MaxElOffset minimum and maximum elevation offsets for interactivity
MinDistFact, MaxDistFact minimum and maximum distance factors for interactivity
AudioChannelLayout loudspeaker layout for groups whose members should be directly routed to
speakers, for “channel” groups only
ClosestSpeakerPlayout flag, group members should be played back each by its nearest loudspeaker

Element Definition (defined once per element)

ElementID unique identifier of an element

ScreenRelative flag that defines if an element is screen-related

Table 1: Defined object-based metadata fields.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 6 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Fig. 3: Groups and switch groups in an exemplary audio scene. Different dialog elements and voice over
elements are combined in two switch groups to signal their mutually exclusive relationship.

a high number of elements are present, a data com- of the intracoded metadata, either an absolute value
pression method for the dynamic element metadata or a quantized difference value is transmitted at a
is utilized. regular time interval.
For random-access support, a full transmission of In case the data is transmitted differentially, the
the complete set of dynamic element metadata hap- value of the metadata trajectory y at the time n
pens on a regular basis, i.e. intracoded metadata. can be calculated with the help of the differential
In-between full transmissions, only differential meta-
data is transmitted. Therefore, time-variant data is
quantized and downsampled and a coarse approx-
imation of the change is determined. The differ-
ence between the original dynamic metadata and
the linearly interpolated and upsampled version of
the coarse approximation is analyzed.
A number of N consecutive difference values are used
to approximate a polygon course that is formed by
a variable number of quantized polygon points. The
number of needed polygon points is on average sig-
nificantly smaller than the number of N . The poly-
gon points are coded as small integer numbers with
a low number of bits. The processing for decoding
the dynamic metadata is illustrated in Figure 4.
4.2.2. Low Delay Coding of Dynamic Metadata Fig. 4: Decoding of the dynamic metadata by an
In addition to the described coding, a dynamic meta- interpolation in between the given polygon points.
data encoding scheme with low latency is defined as The dashed line depicts the first interpolation step,
a modified DPCM (Differential Pulse Code Modu- i.e. interpolation of intracoded metadata. The dot-
lation) procedure. In-between the full transmissions ted line shows the decoded metadata.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 7 of 12
Füg et al. Metadata for Object-Based Interactive Audio

data d at the time n: 6. USER INTERACTION WITH OBJECT-


BASED AUDIO IN MPEG-H
y[n] = y[n − 1] + d[n] for ∀n ∈ [1, N ] The object-based metadata is defined in a way such
that content creators can configure the user inter-
The intracoded metadata or the transmission of an activity. Therefore, different user interaction cate-
absolute value resets the error that is implied by the gories on the level of element groups are defined:
quantization of the differential metadata.
On-Off Interactivity: If the allowOnOff flag of
4.3. Loudness Metadata an element group is equal to one, the group can in-
Besides the object-based audio metadata, loudness teractively be switched on or off. The content of the
metadata can also be transmitted in MPEG-H 3D group is then either played back or discarded depen-
Audio. A so-called ”Loudness Info Set” can contain dent on the on/off-status of the group.
loudness information for different content contexts:
Gain Interactivity: If the allowGainInteractiv-
the whole audio scene (default), a group of elements
ity flag of an element group is equal to one, the
or a defined combination of groups (so-called “Pre-
level/gain of the group of elements can interactively
sets”, see below). The loudness metadata in MPEG-
be changed. The amount of possible gain change is
H 3D Audio is directly related to the object-based
restricted by the metadata fields MinGain and Max-
audio metadata because a GroupID or a PresetID
Gain.
is used to identify the context in which a Loudness
Info Set is applicable. Position Interactivity: If the allowPositionInter-
activtiy flag of an element group is equal to one, the
A choice of optionally transmittable loudness and position of the members of the group of objects can
peak information is described in Table 2. interactively be changed. The change of position is
defined as an elevation offset, an azimuth offset and a
5. RELATIONSHIP TO OTHER OBJECT- distance factor. The ranges for the two offset values
BASED AUDIO FORMATS are restricted by the corresponding metadata fields
So far, no common definition exists for how the MinElOffset, MaxElOffset, MinAzOffset, MaxAzOff-
object-based metadata has to be structured or what set, the distance change is restricted by MinDistFact
exact information has to be included. Several ap- and MaxDistFact.
proaches are known to describe object-based content
in audio production. 6.1. User Interaction by Presets
If several interactive element groups are available,
The “Audio Definition Model” (ADM, [14]) defines a not all combinations are meaningful. Also, a simpli-
formalized description of audio content. ADM uses fied user interface would need a sensible preselection
XML to represent the object-based metadata. The of interactive elements. Therefore a basic interaction
metadata can be embedded in Broadcast Wave For- mode is defined, in which one preset of a number of
mat (BWF) files. ADM also describes content (lan- defined presets can be chosen. Presets define a com-
guage, loudness, etc.) and format of the audio ma- bination of groups in an audio scene. A preset has
terial, but it does not include any control informa- a specified unique PresetID and contains a list of
tion, e.g. switch group logic for mutually exclusive GroupIDs and an associated on/off-status for each
relationships [14]. In ADM no bitstream syntax or of these groups (presets’ conditions). With presets
coding is defined as it is intended for file-based pro- a content creator or an application can provide a
duction. restricted number of meaningful options to the user.
The object-based metadata definition of MPEG-H For example, the producer could define a preset for
3D Audio is defined in a way that it supports the “English Language without Voice Over”. In the ex-
main features of ADM. A scene description in ADM emplary audio scene (see Figure 3) the desired be-
can therefore be transferred to MPEG-H object- havior would be the following: The “Effects” group
based metadata. The same is expected for other and the “Atmosphere” group should always be on.
production formats currently developed or standard- The “Dialog English” group should always be on and
ized. the “Voice Over French” and “Voice Over Spanish”

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 8 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Type Description

Sample Peak Level Level of the maximum sample magnitude.


True Peak Level True peak level as defined in [15].
Program Loudness Level Overall loudness of an audio program as defined in [16].
Anchor Loudness Level Loudness of an anchor element in an audio program (usually the
dialog) as defined in [16].
Loudness Range Loudness range according to [17].
Maximum of the Loudness Range 95th percentile of the loudness range [17].
Maximum Momentary Loudness Maximum of loudness measured in a 0.4 s window [18].
Maximum Short Term Loudness Maximum of loudness measured in a 3 s window [18].

Table 2: Loudness metadata in MPEG-H 3D Audio.

groups should always be off. The on/off-statuses of for the presets in Figure 5 take another 6.64 kBit per
these groups should not be changeable by the user. second when transmitted every 0.5 seconds (preset
For this case, the preset definition would contain the description of length 128 bytes assumed). Presets
following conditions: can be used as references in the loudness metadata,
! as described above, to define a content context for a
OnOff(GroupID = 2) = 1;
!
loudness information set. In addition to the basic in-
OnOff(GroupID = 3) = 1; teraction mode, an advanced interactivity mode can
!
OnOff(GroupID = 4) = 1; be defined, where the user is given full control of all
OnOff(GroupID = 6) = 0;
! present groups within the limits of the “Interactivity
! Data and Playout” metadata.
OnOff(GroupID = 7) = 0;
6.2. Processing of Metadata and User Interac-
The described preset and two more exemplary preset
tion Data in the Decoding Process
definitions are depicted in Figure 5.
For processing any user interaction, the desired mod-
Groups that are not referenced in the preset are ifications have to be taken into account in the de-
marked as “free to choose by user”. The metadata coder. The dedicated metadata fields have to be
evaluated and interaction data coming from a user
interface or an application has to be processed. This
processing can happen in a separate decoder mod-
ule. An “Element Metadata Preprocessor” prepares
the elements for rendering, such that the rendering is
agnostic of the interaction. An overview of the pro-
cessing steps in the Element Metadata Preprocessor
is given in Figure 6.
6.2.1. Initialization
First, the preprocessor determines which groups
should be played back and which ones can be dis-
carded. This is defined by their on/off-status (e.g.
defined by a chosen preset) and the implicit logic of
switch group definitions. Then, the relevant groups
are further processed.
6.2.2. Screen-Related Processing
Fig. 5: Possible preset definitions for the exemplary The positions of all screen-related elements are
audio scene. remapped according to the reproduction screen-size

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 9 of 12
Füg et al. Metadata for Object-Based Interactive Audio

Fig. 6: Preprocessing of metadata and user interaction data.

as an adaptation to the reproduction room. If no The elevation mapping function is defined accord-
reproduction screen-size information is given or no ingly. The screen-related processing can also take
screen-related element exists, no remapping is ap- into account a zooming area for zooming into high-
plied. resolution video content. Screen-related processing
is only defined for elements that are accompanied
The remapping is defined by linear mapping func- by dynamic position data and that are labeled as
tions that take into account the reproduction screen- screen-related.
size information in the playback room and screen-
size information of a reference screen, e.g. used in 6.2.3. Gain and Position Interaction
the mixing and monitoring process. The azimuth The gain interaction and position interaction are ap-
mapping function is depicted in Figure 7. It is de- plied next. Output gain values are calculated for all
fined such that the azimuth values between the left elements, taking into account the interactivity gain
edge and the right edge of the reference screen are modification, the DefaultGain of the groups and the
mapped (compressed or expanded) to the interval dynamic gains if available. Position interactivity is
between the left edge and the right edge of the re- only defined for groups with associated dynamic po-
production screen. Other azimuth values are com- sition metadata. Output azimuth, elevation and dis-
pressed or expanded, such that the whole range of tance values are determined for each element of these
values is covered. groups, taking into account the interactivity modi-
fication and the values from the dynamic position
metadata.

6.2.4. Closest Speaker Playout Processing


As a next step, the closest speaker playout process-
ing is conducted. If the ClosestSpeakerPlayout flag
is equal to one, the position of the closest existing
loudspeaker is determined for each member of the
group.

Therefore, a distance measure for two positions P1


and P2 in a spherical coordinate system is defined as
the sum of the absolute differences of their azimuth
angles ϕ and elevation angles θ

Fig. 7: Mapping function of azimuth angles. ∆(P1 , P2 ) = |θ1 − θ2 | + |ϕ1 − ϕ2 | .

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 10 of 12
Füg et al. Metadata for Object-Based Interactive Audio

The closest loudspeaker Pnearest is the one of the N 8. REFERENCES


speakers in the reproduction loudspeaker layout, for [1] ITU-R, “Recommendation-BS.775-2. Mul-
which the distance to the desired position Pd of the tichannel stereophonic sound system with
audio element is minimal: and without accompanying picture” ITU-R
BS.1770, 2006.
Pnearest = min(∆(Pd , P1 ), . . . , ∆(Pd , PN )).
[2] Rumsey, F., “Spatial Audio”, Focal Press, Ox-
This processing is also only meaningful and defined ford, 2001.
for groups of elements with dynamic position data.
[3] Gerzon, M. A., “Perophony: With-Height
6.2.5. Routing Sound Reproduction”, Journal of the Audio En-
After this processing step, the groups and the needed gineering Society, Vol. 21, Issue 1: pp. 3-10,
metadata are routed to the renderer modules (for- 1973.
mat converter, object renderer, SAOC-3D decoder,
HOA renderer) depending on their GroupType and [4] Ahrens, J., “Analytic Methods of Sound Field
the ClosestSpeakerPlayout flag. Synthesis”, T-Labs Series in Telecommunica-
tion Services, Springer, Berlin, Heidelberg,
7. SUMMARY AND CONCLUSIONS 2012.
A metadata scheme for object-based audio content
[5] Trivi, J.-M. and Lemordant, J., “Use of a 3-D
was developed for an optimized listening experience,
Positional Interface for the Implementation of
taking into account several use-cases for object-
a Versatile Graphical Mixing Console”, 107th
based interactive audio and for consumer applica-
AES Convention, New York, USA, 1999.
tions like TV.
The object-based metadata scheme allows describing [6] Fuchs, H. et al, “Dialog Enhancement - En-
channel-, object- and scene-based audio and their abling User Interactivity With Audio”, NAB,
combination as audio elements. General character- Las Vegas, USA, 2012.
istics like the position in 3D space, screen-relation, [7] ISO/IEC JTC1/SC29/WG11 N14747 “Text of
content characteristics (type of content, language), ISO/MPEG 23008-3/DIS 3D Audio”, Sapporo,
interaction possibilities and relation with other au- Japan, July 2014.
dio elements (groups and switch groups) can be de-
scribed. [8] Neuendorf, M. et al, “The ISO/MPEG Unified
Speech and Audio Coding Standard - Consis-
The metadata scheme is defined in such a way that tent High Quality for All Content Types and
audio scene descriptions from object-based produc- at All Bit Rates”, Journal of the Audio Engi-
tion formats can be transferred to it. It uses mech- neering Society, Vol. 61, Issue 12, pp. 956-977,
anisms and fields that are adapted to the ones in 2013.
existing object-based formats to allow for interoper-
ability. [9] Pulkki, V., “Virtual sound source positioning
using vector base amplitude panning”, Journal
The defined dynamic metadata can be transmitted of the Audio Engineering Society, Volume 45,
efficiently in a bitstream by application of a ded- Issue 6, pp. 456-466, 1997.
icated coding scheme. Movements of objects can
therefore be reconstructed with a high accuracy in [10] Herre, J. et al, “MPEG Spatial Audio Ob-
time and space without the consumption of high ject Coding - The ISO/MPEG Standard for
data rate. Efficient Coding of Interactive Audio Scenes”,
Journal of the Audio Engineering Society, Vol.
The described metadata definition and coding has
60, Issue 9, pp. 655-673, 2012.
been adopted and integrated into the MPEG-H 3D
Audio standard [12], which is designed to transport [11] ISO/IEC, “MPEG-D (MPEG audio technolo-
object-based interactive audio in combination with gies), Part 2: Spatial Audio Object Coding”,
UHDTV/4k video. International Standard ISO/IEC 23003-1:2010.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 11 of 12
Füg et al. Metadata for Object-Based Interactive Audio

[12] Herre, J. et al, “MPEG-H Audio - The New


Standard for Universal Spatial / 3D Audio Cod-
ing”, 137th AES Convention, Los Angeles, USA,
2014.
[13] Herre, J. et al, “MPEG-H Audio - The Upcom-
ing Standard for Universal Spatial / 3D Audio
Coding”, International Conference on Spatial
Audio (ICSA), Erlangen, Germany, 2014.
[14] European Broadcasting Union, “Audio Defini-
tion Model - Metadata Specification”, EBU-
Tech 3364 (Version 1.0), January 2014.

[15] ITU-R, “Recommendation ITU-R BS.1770-


3. Algorithm to measure audio programme
loudness and true-peak audio level”, ITU-R
BS.1770, August 2012.
[16] ISO/IEC, “Information technology - MPEG
systems technologies - Part 8: Coding-
independent code points”, International Stan-
dard ISO/IEC 23001-8.
[17] European Broadcasting Union, “Loudness
Range: A measure to supplement loudness nor-
malisation in accordance with EBU R 128”,
EBU-Tech 3342, Geneva, Switzerland, 2011.
[18] ITU-R, “Recommendation ITU-R BS.1771-1.
Requirements for loudness and true-peak indi-
cating meters”. ITU-R BS.1771, January 2012.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 12 of 12

You might also like