0% found this document useful (0 votes)
96 views

MPEG-H The New Standard For Spatial Audio Coding

The document describes the new MPEG-H Audio standard for encoding 3D audio. It provides an overview of the standardization project and the system architecture, capabilities, and performance of the MPEG-H reference model. The standard aims to efficiently distribute and flexibly reproduce 3D sound across different output formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

MPEG-H The New Standard For Spatial Audio Coding

The document describes the new MPEG-H Audio standard for encoding 3D audio. It provides an overview of the standardization project and the system architecture, capabilities, and performance of the MPEG-H reference model. The standard aims to efficiently distribute and flexibly reproduce 3D sound across different output formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Audio Engineering Society

Convention Paper 9095


Presented at the 137th Convention
2014 October 9–12 Los Angeles, USA

This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least
two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been
reproduced from the author's advance manuscript without editing, corrections, or consideration by the Review Board. The AES
takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio
Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved.
Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio
Engineering Society.

MPEG-H Audio - The New Standard for


Universal Spatial / 3D Audio Coding
1 2 1 2
Jürgen Herre , Johannes Hilpert , Achim Kuntz , and Jan Plogsties
1
International Audio Laboratories Erlangen, Erlangen, Germany
A Joint Institution of Universität Erlangen-Nürnberg and Fraunhofer IIS
2
Fraunhofer IIS, Erlangen, Germany

ABSTRACT

Recently, a new generation of spatial audio formats were introduced that include elevated loudspeakers and surpass
traditional surround sound formats, such as 5.1, in terms of spatial realism. To facilitate high-quality bitrate-efficient
distribution and flexible reproduction of 3D sound, the MPEG standardization group recently started the MPEG-H
Audio Coding development for the universal carriage of encoded 3D sound from channel-based, object-based and
HOA-based input. High quality reproduction is supported for many output formats from 22.2 and beyond down to
5.1, stereo and binaural reproduction - independently of the original encoding format, thus overcoming
incompatibility between various 3D formats. The paper describes the current status of the standardization project
and provides an overview of the system architecture, its capabilities and performance.

move towards adding ‘height’ or ‘ lower’ loudspeakers


above or below the listener’s head in order to create an
1. INTRODUCTION even more enveloping and realistic spatial sound
experience. Typical examples of such ‘3D’ loudspeaker
The faithful reproduction of the spatial aspects of setups include 7.1 with two height channels [8], 9.1 [9]
recorded sound has been an ongoing topic for a very and 22.2 [10]. While such loudspeaker setups clearly
long time, starting with two-channel stereophony [1,2], can deliver higher spatial fidelity than the established
continuing with multi-channel (‘surround’) sound 5.1 setup [11,12,13,14], the adoption of 3D setups poses
reproduction [3,4,5], Ambisonics [6] and wavefield a number of challenges for production, distribution and
synthesis (WFS) [7]. While the vast majority of rendering:
proposed technologies have been using a number of
loudspeakers that surround the listener(s) within a
horizontal plane, there recently has been a significant
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

• How can the sound engineer/Tonmeister make 2. PREVIOUS MPEG AUDIO MULTI-
best possible use of 3D loudspeaker setups? CHANNEL CODING TECHNOLOGY
The answer to this may very well require a
learning process similar to that at the transition The first commercially-used multi-channel audio coder
from stereo to 5.1. standardized by MPEG in 1997 is MPEG-2 Advanced
Audio Coding (AAC) [16,17], delivering EBU
• In contrast to the traditional 2D surround broadcast quality at a bitrate of 320 kbit/s for a 5.1
world, where 5.1 is an established standard for signal. A significant step forward was the definition of
content production, distribution and rendering, MPEG-4 High Efficiency AAC (HE-AAC) [18] in
there is a plethora of concurrent proposals for 2002/2004, which combines AAC technology with
3D loudspeaker setups competing in the bandwidth extension and parametric stereo coding, and
market. It currently seems quite unclear, thus allows for full audio bandwidth also at lower data
whether one predominant format will evolve rates. For carriage of 5.1 content, HE-AAC delivers
which eventually can serve – similar to 5.1 for quality comparable to that of AAC at a bitrate of 160
2D – as a common denominator for content kbit/s [19]. Later MPEG standardizations provided
production, digital media and consumer generalized means for parametric coding of multi-
electronics to create a thriving new market. channel spatial sound: MPEG-D MPEG Surround
(MPS, 2006) [20,21] and MPEG-D Spatial Audio
• How can 3D audio content be distributed Object Coding (SAOC, 2010) [22,23] allow for the
efficiently and with the highest quality, such highly efficient carriage of multi-channel sound and
that existing distribution channels (possibly object signals, respectively. Both codecs can be
including wireless links) and media can carry operated at lower rates (e.g. 48 kbit/s for a 5.1 signal).
the new content? Finally, MPEG-D Unified Speech and Audio Coding
(USAC, 2012) [24,25] combined enhanced AAC coding
• How can consumers and consumer electronics with state-of-the-art full-band speech coding into an
manufacturers accept these new formats, given extremely efficient system, allowing carriage of e.g.
that many consumers may be willing to install good quality mono signals at bitrates as low as 8 kbit/s.
just a single 3D loudspeaker setup in their Incorporating advances in joint stereo coding, USAC is
living room with a limited number of speakers. capable of delivering further enhanced performance
Can they, nonetheless, enjoy content that was compared to HE-AAC also for multi-channel signals.
produced for, say, 22.2 channels? For the definition of MPEG-H 3D Audio, it was
strongly encouraged to re-use these existing MPEG
Based on such considerations, the ISO/MPEG technology components to address the coding (and,
standardization group has initiated a new work item to partially, rendering) aspect of the envisioned system. In
address aspects of bitrate-efficient distribution, this way, it was possible to focus the MPEG-H 3D
interoperability and optimal rendering by the new Audio development effort primarily on delivering the
ISO/MPEG-H 3D Audio standard. missing functionalities rather than on addressing basic
coding/compression issues.
This paper describes a snapshot of MPEG-H 3D Audio
Reference Model technology [15] as of the writing of
3. THE MPEG-H 3D AUDIO WORK ITEM
this paper, i.e. after the 109th MPEG meeting in July
2014. It is structured as follows: given that coding of
Dating back to early 2011, initial discussions on 3D
multi-channel/surround sound has been present in
Audio at MPEG were triggered by the investigation of
MPEG Audio for a considerable time, the existing
video coding for devices whose capabilities are beyond
MPEG Audio technology in this field is briefly
those of current HD displays, i.e. Ultra-HD (UHD)
introduced. Then, the MPEG-H 3D Audio work item is
displays with 4K or 8K horizontal resolution. With such
explained and the MPEG-H Reference Model
displays a much closer viewing distance is feasible and
architecture and technology are outlined. Finally, we
the display may fill 55 to 100 degrees of the user’s field
show results of some recent performance evaluation of
of view such that there is a greatly enhanced sense of
the new technology, followed by a number of expected
visual envelopment. To complement this technology
or possible further developments of the Reference
vision with an appropriate audio component, the notion
Model.
of 3D audio, including elevated (and possibly lower)

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 2 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

speakers was explored, eventually leading to a ‘Call For • Channel-based: Traditionally, spatial audio
Proposals’ (CfP) for such 3D Audio technologies in content (starting from simple two channel
January 2013 [26]. The CfP document specified stereo) has been delivered as a set of channel
requirements and application scenarios for the new signals which are designated to be reproduced
technology together with a development timeline and a by loudspeakers in a precisely defined, fixed
number of operating points at which the submitted target location relative to the listener.
technologies should demonstrate their performance,
ranging from 1.2 Mbit/s down to 256 kbit/s for a 22.2 • Object-based: More recently, the merits of
input. The output was to be rendered on various object-based representation of a sound scene
loudspeaker setups from 22.2 down to 5.1, plus have been embraced by sound producers, e.g.
binauralized rendering for virtualized headphone to convey sound effects like the fly-over of a
playback. The CfP also specified that evaluation of plane or space ship. Audio objects are signals
submissions would be conducted independently for two that are to be reproduced as to originate from a
accepted input content types, i.e. ‘channel and object specific target location that is specified by
(CO) based input’ and ‘Higher Order Ambisonics associated side information. In contrast to
(HOA)’. At the 105th MPEG meeting in July/August channel signals, the actual placement of audio
2013, Reference Model technology was selected from objects can vary over time and is not
the received submissions (4 for CO and 3 for HOA) necessarily pre-defined during the sound
based on their technical merits to serve as the baseline production process but by rendering it to the
for further collaborative technical refinement of the target loudspeaker setup at the time of
specification. Specifically, the winning technology reproduction. This may also include user
came from Fraunhofer IIS (CO part) and interactivity.
Technicolor/Orange Labs (HOA part). In a next step,
both parts were subsequently merged into a single • Higher Order Ambisonics (HOA) is an
harmonized system. Further improvements are on the alternative approach to capture a 3D sound
way, e.g. binaural rendering. The final stage of the field by transmitting a number of ‘coefficient
specification, i.e. International standard, is anticipated to signals’ that have no direct relationship to
be issued at the 111th MPEG meeting in February of channels or objects.
2015.
The following text discusses the role of these format
types in the context of MPEG-H 3D Audio.
4. THE MPEG-H REFERENCE MODEL

4.1. General Features and Architecture


Channel-based 3D Audio formats:
MPEG-H 3D Audio has been designed to meet
The improvement offered by 3D sound over traditional
requirements for delivery of next generation audio
5.1 or 7.1 systems is substantial, since the spatial
content to the user, ranging from highest-quality cable
realism is significantly enhanced by the reproduction of
and satellite TV down to streaming to mobile devices.
sound from above. Also, 3D formats offer the ability to
The main features that make MPEG-H 3D Audio
localize on-screen sounds vertically, which will become
applicable to this wide range of applications and the
more important as viewing angles increase with the
different associated playback scenarios are outlined in
transition to 4K and 8K video. Figure 1 shows the
the following sections.
results of a subjective listening test comparing the
overall sound quality obtained from 3D systems in
4.1.1. Flexibility with regard to input formats
comparison to today’s stereo and 5.1 formats.
A future-proof audio system has to accept multiple
In MPEG-H 3D Audio, the most popular channel-based
formats that are or will become established in music,
formats are listed directly in the MPEG specification.
movie production and broadcast. Generally, multi-
Beyond this, other alternative production formats are
channel and 3D audio content falls into the following
addressed by including more advanced flexible
categories:
signalling mechanisms, thus ensuring future proofness.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 3 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

Moreover, audio objects such as dialogue can be


controlled individually in terms of their dynamic range,
which ensures best intelligibility and supports dedicated
reproduction for hearing impaired listeners.

The notion of objects also allows accurate spatial


reproduction of sounds in different playback scenarios.
Therefore, object metadata that describes the geometric
position of the sound sources contained in the objects
can be embedded in the bitstream. The MPEG-H
decoder contains an object renderer that maps the object
signal to loudspeaker feeds based on the metadata and
the locations of the loudspeakers in the user’s home. As
Figure 1 - Overall sound quality impression on a a result, controlled positioning of sounds can be
MUSHRA scale from 0 to 100 relative to a 22.2 achieved for regular or unconventional loudspeaker
reference with increasing number of reproduction setups, e.g. to align sounds with visual objects on the
channels from stereo to surround and immersive / 3D screen.
formats, for further details see [11].

HOA:
Audio Objects:
The concept of higher order ambisonics (HOA)
Using audio objects or embedding of objects as provides a way to capture a sound field with a multi-
additional audio tracks inside channel-based audio capsule microphone. Manipulating and rendering of
productions and broadcast opens up a range of new such signals requires a simple matrix operation, which
applications. Inside an MPEG-H 3D audio bitstream, will not be discussed in detail in this publication. In
objects can be embedded that can be selected by the addition to channels and objects, also HOA content can
user during playback. Objects allow consumers to have be carried in MPEG-H 3D Audio.
personalized playback options ranging from simple
adjustments (such as increasing or decreasing the level
of announcer’s commentary or actor’s dialogue relative
to the other audio elements) to conceivable future 4.1.2. Flexibility with regard to reproduction
broadcasts where several audio elements may be
adjusted in level or position to tailor the audio playback For audio production and monitoring, the setup of
experience to the user’s liking, as illustrated in the loudspeakers is well defined and established in practice
following Figure. for stereo and 5.1. However, in consumer homes,
loudspeaker setups are typically “unconventional” in
terms of non-ideal placement and differ regarding the
number of speakers. Within MPEG-H 3D Audio,
flexible rendering to different speaker layouts is
implemented by a format converter that adapts the
content format to the actual real-world speaker setup
available on the playback side to provide an optimum
user experience under the given user conditions. For
well-defined formats, specific downmix metadata can
be set on the encoder to ensure downmix quality, e.g.
when playing back 9.1 content on a 5.1 or stereo
playback system.

It is foreseeable that media consumption is moving


further towards mobile devices with headphones being
Figure 2 - Thought example of a future interactive the primary way to play back audio. Therefore, a
American football broadcast binaural rendering component was included in the

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 4 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

MPEG-H 3D audio decoder for dedicated rendering on by tools that especially exploit the perceptual effects of
headphones with the aim of conveying the spatial 3D reproduction and thereby further enhance the coding
impression of immersive audio production also on efficiency.
headphones.
The most prominent enhancements are:
Figure 3 shows an overview of an MPEG-H 3D Audio
decoder, illustrating all major building blocks of the • A Quad Channel Element that jointly codes a
system: quadruple of input channels. In a 3D context,
inter-channel redundancies and irrelevancies
• As a first step, all transmitted audio signals, be can be exploited in both horizontal and vertical
they channels, objects or HOA components, directions. Parametric coding of vertically
are decoded by an extended USAC stage aligned channel pairs can be carried out while
(USAC-3D). binaural unmasking effects [27] can be avoided
in the horizontal plane.
• Channel signals are mapped to the target
reproduction loudspeaker setup using a format • An enhanced noise filling is provided through
converter. Intelligent Gap Filling (IGF). IGF is a tool that
parametrically restores portions of the
• Object signals are rendered to the target transmitted spectrum using suitable
reproduction loudspeaker setup by the object information from spectral tiles that are adjacent
renderer using the associated object metadata. in frequency and time. The assignment and the
processing of these spectral tiles is controlled
• Alternatively, signals coded via an extended by the encoder based on an input signal
Spatial Audio Object Coding (SAOC-3D), i.e. analysis. Hereby, spectral gaps can be filled
parametrically coded channel signals and audio with spectral coefficients that perceptually
objects, are rendered to the target reproduction have a better match than pseudo random noise
loudspeaker setup using the associated sequences of conventional noise filling would
metadata. provide.

• Higher Order Ambisonics content is rendered Apart from these enhancements in coding efficiency, the
to the target reproduction loudspeaker setup USAC-3D core is equipped with new signaling
using the associated HOA metadata. mechanisms for 3D content/loudspeaker layouts and for
the type of signals in the compressed stream (audio
In the following, the main technical components of the channel vs. audio object vs. HOA signal).
MPEG-H 3D Audio decoder/renderer are described.
Another new aspect in the design of the compressed
audio payload is an improved behavior for
instantaneous rate switching or fast cue-in as it appears
4.2. USAC-3D Core Coder and Extensions in the context of MPEG Dynamic Adaptive Streaming
(DASH) [28]. For this purpose, so-called ‘immediate
The MPEG-H 3D Audio codec architecture is built upon playout frames’ have been added to the syntax that
a perceptual codec for compression of the different enable gapless transitions from one stream to the other.
input signal classes, based on MPEG Unified Speech This is particularly advantageous for adaptive streaming
and Audio Coding (USAC) [24]. USAC is the state-of- over IP networks.
the-art MPEG codec for compression of mono to multi-
channel audio signals at rates of 8 kbit/s per channel and
higher. For the new requirements that arose in the
context of 3D audio, this technology has been extended

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 5 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

Figure 3. Top level block diagram of MPEG-H 3D Audio decoder

• Advanced active downmix algorithm to avoid


downmixing artefacts.
4.3. CO Decoding and Rendering
4.3.1. Format converter Within the format converter module, there are two
major building blocks, i.e. a rules-based initialization
The MPEG-H decoder comprises a so called “format block and the active downmix algorithm. Both are
converter” module that converts the decoded raw described in the following.
channel signals to numerous output formats, i.e. for
rendering on different loudspeaker setups. This
processing block renders high-quality downmixes, for Format converter initialization
example, when playing back a 22.2 channel program on
a 5.1 surround reproduction loudspeaker setup. To The first sub-module derives optimized downmix
produce high output signal quality, the format converter coefficients mapping the channel configuration of the
in MPEG-H 3D Audio provides the following features: format converter input to the output loudspeaker layout.
During the initialization, the system iterates through a
• Automatic generation of optimized downmix set of tuned mapping rules for each input channel. Each
matrices, taking into account non-standard rule defines the rendering of one input channel to one or
loudspeaker positions. more output channels, potentially complemented by an
equalizer curve that is to be applied if the particular
• Support for optionally transmitted downmix mapping rule has been selected. The iteration is
matrices to preserve the artistic intent of a terminated at the first rule for which the required output
producer or broadcaster. channels are available in the reproduction setup, thus
selecting the particular mapping rule. Since the mapping
• Application of equalizer filters for timbre rules have been ordered according to the anticipated
preservation. mapping quality during the definition of the rules, this
process results in selection of the highest-quality

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 6 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

mapping to the loudspeaker channels that are available 4.3.2. Object renderer
in the reproduction setup.
In MPEG-H 3D, transmitted metadata allows for
The rules have been designed individually for each rendering audio objects into predefined spatial
potential input channel incorporating expert knowledge, positions. Time-varying position data enables the
e.g to avoid excessive use of phantom sources when rendering of objects on arbitrary trajectories.
rendering to the available target loudspeakers. Thus the Additionally, time-varying gains can be signaled
rules-based generation of downmix coefficient allows individually for each audio object. An overview of
for a flexible system that can adapt to different MPEG-H audio metadata is provided in [32].
input/output configurations, while at the same time
ensuring a high output signal quality by making use of The object renderer applies Vector Base Amplitude
the expert knowledge contained in the mapping rules. Panning (VBAP, [29]) to render the transmitted audio
Note that the initialization algorithm compensates for objects to the given output channel configuration. As
non-standard loudspeaker positions of the reproduction input the renderer expects
setup, aiming at the best reproduction quality even for
asymmetric loudspeaker setups. - Geometry data of the target rendering setup.

Active downmix algorithm - One decoded audio stream per transmitted audio
object.
Once the downmix coefficients have been derived, they
are applied to the input signals in the actual downmix - Decoded object metadata associated with the
process. MPEG-H 3D Audio uses an advanced active transmitted objects, e.g. time-varying position data
downmix algorithm to avoid downmix artefacts like and gains.
signal cancellations or comb-filtering that can occur
when combining (partially) correlated input signals in a As presented in the following, VBAP relies on a
passive downmix, i.e. when linearly combining the triangulation of the 3D surface surrounding the listener.
input signals, weighted with static gains. Note that high The MPEG-H 3D Audio object renderer thus provides
signal correlations between 3D audio signals are quite an automatic triangulation algorithm for arbitrary target
common in practice since a large portion of 3D content configurations. Since not all target loudspeaker setups
is typically derived from 2D legacy content (or 3D are complete 3D setups, e.g. most setups lack
content with smaller loudspeaker setups) e.g. by filling loudspeakers below the horizontal plane, the
the additional 3D channels with delayed and filtered triangulation introduces imaginary loudspeakers to
copies of the original signals. provide complete 3D triangle meshes for any setup to
the VBAP algorithm.
The active downmix in the MPEG-H 3D Audio decoder
adapts to the input signals in two ways to avoid the The MPEG-H 3D Audio object rendering algorithm
issues outlined above for passive downmix algorithms: performs the following steps to render the transmitted
Firstly, it measures the correlation properties between audio objects to the selected target setup:
input channels that are subsequently combined in the
downmix process and aligns the phases of individual • Search for the triangle the current object
input channels if necessary. Secondly, it applies a position falls into.
frequency dependent energy-normalization to the
downmix gains that preserves the energy of the input • Build a vector base L = [l1, l2, l3] out of the
signals that have been weighted by the downmix three unit vectors pointing towards the vertices
coefficients. The active downmix algorithm is designed of the selected loudspeaker triangle.
such that it leaves uncorrelated input signals untouched,
thus eliminating the artefacts that occur in passive • Compute the panning gains vector G = [g1, g2,
downmixes with only minimum signal adjustments. g3]T for transmitted object position P according
to P = L*G à G = L-1* P

• Normalize G to preserve energy and apply the


transmitted object gain g:

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 7 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

• Gnorm = gG /sqrt(g1²+g2²+g3²) position up to an upper frequency limit beyond which


spatial aliasing occurs. The time-varying coefficients of
• Linearly interpolate between the current the spherical harmonics expansion are called HOA
panning gains and the gains computed from the coefficients and carry the information of the wave field
object metadata received for the previous time that is to be transmitted or reproduced.
stamp..
Instead of transmitting the HOA coefficients directly in
• Compute output signals by mixing the input a bitstream representation, MPEG-H 3D Audio applies a
signals through application of the interpolated two-stage coding process to the HOA data to improve
gains into the output channels. the coding performance of the system, namely spatial
coding of the HOA components and multichannel
• Add output signal contributions of all rendered perceptual coding. These two stages have to be reverted
objects. in the MPEG-H 3D Audio decoder in reverse order as
shown below in Section 4.4.3

4.3.3. SAOC-3D decoding and rendering The spatial coding block for the HOA representation
applies two basic principles: decomposition of the input
field and decorrelation of the signals prior to
In order to serve as a technology component for 3D
transmission in the core coder, both of which are
audio coding, the original Spatial Audio Object Coding
described in the following.
(SAOC) codec [22, 23] has been enhanced into SAOC-
3D with the following extensions:

• While SAOC supports only up to two 4.4.1. Decomposition of the sound field in
downmix channels, SAOC-3D supports more encoder
(in principle an arbitrary number of) downmix
channels. In the HOA encoder the sound field determined by the
HOA coefficients is decomposed into predominant and
• While rendering to multi-channel output has ambient sound components. At the same time,
been possible with SAOC only by using parametric side-information is generated that signals the
MPEG Surround (MPS) as a rendering engine, time-varying activity of the different sound-field
SAOC-3D performs direct decoding/rendering components to the decoder.
to multichannel/3D output with arbitrary
output speaker setups. This includes a revised Predominant components mainly contain directional
approach towards decorrelation of output sounds and are coded as plane wave contributions that
signals. travel through the wave field of interest in a certain
direction. The number of predominant components can
• Some SAOC tools that have been found vary over time as well as their direction. They are
unnecessary within the MPEG-H 3D Audio transmitted as audio streams together with the
system have been excluded. As an example, associated time-variant parametric information
residual coding has not been retained, since (direction of the directional components, activity of the
carriage of signals with very high quality can directional components in the field).
already be achieved through encoding them as
discrete channel or object signals. The remaining part of the HOA input, which has not
been captured by the predominant component, is the
4.4. HOA Decoding and Rendering ambient component of the sound field to code. It mostly
contains non-directional sound components. Details of
Higher order ambisonics (HOA) builds on the idea of a the spatial properties of this part of the field are
field based representation of an audio scene. More considered less important. Therefore the spatial
mathematically stated, it is based on a truncated resolution of the ambient component is typically
expansion of the wave field into spherical harmonics, reduced by limiting the HOA order to improve the
which determines the acoustic wave field quantities coding efficiency.
within a certain source free region around the listener’s

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 8 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

4.4.2. Encoder signal component multiplication of the multichannel HOA representation


decorrelation and a rendering matrix.

The predominant sound components are represented as The HOA rendering matrix has to be generated at the
plane wave signals with associated directions. Thus time of initialization or when the HOA order or the
sound events emanating from uncorrelated sound reproduction setup change. It is a matrix that mixes the
sources in different directions lead to uncorrelated audio contribution of each HOA component to the available
streams to transmit. loudspeakers using mixing gains that result in the best
field approximation of that HOA component in a region
However, the HOA representation of the ambient around the listener. One main design characteristic of
component may exhibit high correlations between the the HOA rendering matrix is the energy preservation.
HOA coefficients. This can lead to undesired spatial This describes the characteristics that the HOA signal’s
unmasking of the coding noise since the quantization loudness is preserved independent of the speaker setup
noise introduced by the perceptual coder is uncorrelated and that constant amplitude spatial sweeps can be
between the coder channels, thus resulting in different perceived equally loud after rendering.
spatial properties of the desired signal and the
quantization noise during reproduction. The HOA
representation is therefore decorrelated by transforming
4.5. Loudness and Dynamic Range Processing
it into a different spatial domain to avoid the spatial
unmasking of the coding noise. Note that this spatial 4.5.1. Loudness normalization
decorrelation step and its inverse operation in the
decoder is equivalent to the mid-side coding principle One of the essential features for a next generation audio
applied to stereo coding of correlated signals, e.g. when delivery is proper loudness signaling and normalization.
coding a phantom source using a stereo audio coder. Within MPEG-H 3D Audio, comprehensive loudness
related measures according to ITU-R BS.1770-3 [30] or
EBU R128 [31] are embedded into the stream for
4.4.3. MPEG-H 3D Audio decoder HOA loudness normalization. The decoder normalizes the
rendering audio signal to map the program loudness to the desired
target loudness for playback. Downmixing and dynamic
The MPEG-H 3D Audio decoder transmitted HOA range control may change the loudness of the signal.
content is first decoded into a HOA representation by Dedicated program loudness metadata can be included
the following processing steps: in the MPEG-H bitstream to ensure correct loudness
normalization for these cases.
• Multichannel USAC 3D core decoding. 4.5.2. Dynamic range control

• Inverse decorrelation of ambient sound, i.e. Looking at different target playback devices and
transformation from decorrelated listening environments, the control of the dynamic range
representation to a HOA coefficients is vital. In the framework of dynamic range control
representation. (DRC) in MPEG, different DRC gain sequences can be
signaled that allow encoder-controlled dynamic range
• Synthesis of a HOA coefficients representation processing in the playback device. Multiple individual
of the predominant sound components. DRC gain sequences can be signaled with high
resolution for a variety of playback devices and
• HOA composition (superposition of HOA listening conditions, including home and mobile use
representations of predominant and ambient cases. The MPEG DRC concept also provides improved
components). clipping prevention and peak limiting.

In a subsequent processing step the composed HOA 5. PERFORMANCE EVALUATION


representation is rendered to the target loudspeaker
configuration using a generic HOA renderer. The HOA For MPEG-H 3D Audio, several candidate technologies
rendering itself consists of a simple matrix have undergone rigorous testing to select the best
coding and rendering system for immersive audio. 24

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 9 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

test items were chosen to represent typical and critical


audio material. During the performance evaluation more
than 40000 answers from a total of 10 test labs were
collected. Four main test cases were defined to
characterize the system in different operation points:

• Test 1: Rendering to 9.0 - 22.2 loudspeakers


Objective: Demonstrate very high quality for
reproduction on large reproduction setups. Three bit
rates: 1.2 Mbit/s, 512 kbit/s, 256 kbit/s

• Test 2: Listening at four “off sweet spot” positions


Objective: Verify results from Test 1 for non-
optimum listener positions. Bit rate 512 kbit/s

• Test 3: Binaural rendering to headphones


Objective: Demonstrate ability for convincing
headphone rendering. Bit rate 512 kbit/s

• Test 4: Rendering to alternative speaker


configurations
Figure 4: Summary of Reference Model listening test
Objective: Demonstrate ability to perform high-
results for channel and objects content; the total mean
quality rendering to smaller and non-standard
MUSHRA score of each test is shown. Confidence
reproduction setups: 5.1, 8.1 and, with two
intervals were smaller than 2.5 points in every case.
loudspeaker setups that were randomly selected
Note that the results are obtained in separate tests. Same
subsets of the 22.2 setup, one with 5 and one with 10
shading indicates same bitrates.
loudspeakers (‘Random 5’, ‘Random 10’). Bit rate
512 kbit/s
• ‘off sweet spot’ listening test did not reveal any
additional problematic effects which would
All tests were carried out with the MUSHRA test
degrade sound quality.
methodology in high quality listening rooms and used
the original format, i.e. 9.0, 11.1, 14.0 or 22.2 signal as
their reference. There were 12 test items for each of the • Test 3 showed adequate binaural quality at 512
two input categories CO and HOA. kbit/s without undue degradation due to
coding/decoding or simplified/optimized
binaural processing.
After the evaluation of all test results, the system
submitted by Fraunhofer IIS was selected as the
reference model for MPEG-H 3D Audio CO and the 6. FURTHER EVOLUTION AND OUTLOOK
Technicolor/Orange Labs submission for HOA
processing since these systems performed better or The MPEG-H 3D standardization time-line is designed
equal than submitted competing systems. The pooled to consolidate technology by July 2014. Until then,
results for the selected RM system for ‘channels and further developments were under discussion in the
objects’ are shown in Figure 4. MPEG audio group.
As can be seen from the above test results,
One activity was the merge of the CO codec and the
HOA codec (both systems had originally been
• bitrate was the major factor that determined the developed as separate architectures in response to the
achieved subjective audio quality. At 1.2 Call for Proposals). The architecture of the merged
Mbit/s and 512 kbit/s, the reference model system, as depicted in Figure 3, has been defined and
technology delivered, on average, excellent implemented between the beginning of February and
quality, and produced good sound quality at a July 2014.
bitrate of 256 kbit/s.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 10 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

An independent standardization timeline has been 8. REFERENCES


defined for a so-called “Phase 2” of MPEG-H 3D
Audio. The associated part of the Call for Proposals [1] Alexander, R.: The Inventor of Stereo: The Life
asked for technology proposals to extend the operation and Works of Alan Dower Blumlein. Focal Press,
range of the 3D Audio codec to even lower rates. 2000, ISBN 978-0240516288.
Specifically, proponents were asked to submit coded
material at bitrates of 48, 64, 96 and 128 kbit/s for 22.2 [2] Blumlein, A.D.: Improvements in and relating to
channels (or a full HOA encoding) by May 2014. A sound-transmission, sound-recording and sound
selection of technology for a Phase 2 reference model reproducing systems. 1931, British Patent 394 325.
was made at the 109th MPEG meeting in July 2014. For
the CO input category, the winning technology was [3] ITU-R, Recommendation-BS.775-2, Multichannel
provided by Fraunhofer IIS and was based on Phase 1 stereophonic sound system with and without
technology together with an MPEG Surround extension accompanying picture. 2006, Intern. Telecom
for the lowest bitrates. For the HOA input category, a Union, Geneva, Suisse.
merge of the systems of Technicolor and Qualcomm is
performed. Finally, there is opportunity for [4] Rumsey, F., Spatial Audio. 2001, Focal Press,
collaborative further improvement. Oxford. ISBN 0 240 51623 0.

Due to the increasing interest in MPEG-H 3D Audio in [5] Silzle, A. and Bachmann, T.: How to Find Future
broadcast application standards like ATSC and DVB, Audio Formats? VDT-Symposium, 2009,
the timeline of Version 1 is designed such that the Hohenkammer, Germany.
specification is expected to be International Standard by
February 2015. [6] Gerzon, M.A., Perophony: With-Height Sound
Reproduction. J. Audio Eng. Soc., 1973. Issue
21(1): p. 3-10.
7. CONCLUSIONS
[7] Ahrens, J., Analytic Methods of Sound Field
Synthesis. T-Labs Series in Telecommunication
In order to facilitate high-quality bitrate-efficient Services. 2012, Springer, Berlin, Heidelberg. ISBN
distribution and flexible reproduction of 3D sound, the 978-3-642-25742-1.
MPEG standardization group recently started the
development effort of MPEG-H Audio Coding which
[8] Chabanne, C., McCallus, M., Robinson, C.,
allows for the universal carriage of encoded 3D sound
Tsingos, N.: Surround Sound with Height in Games
from channel-based, object-based and HOA-based
Using Dolby Pro Logic IIz, 129th AES Convention,
sound formats. Reproduction is supported for many
Paper Number 8248, San Francisco, CA, USA,
output setups ranging from 22.2 and beyond down to
November 2010.
5.1, stereo and binaural reproduction. Depending on the
available output setup, the encoded material is rendered
[9] Daele, B. V.: The Immersive Sound Format:
to yield highest spatial audio quality, thus overcoming
Requirements and Challenges for Tools and
the incompatibility between various 3D (re)production
Workflow, International Conference on Spatial
formats. Moreover, MPEG-H Audio is a unified system
Audio (ICSA), 2014, Erlangen, Germany.
for carriage of channel-oriented, object-oriented and
Higher Order Ambisonics based high quality content.
[10] Hamasaki, K., Matsui, K., Sawaya, I., and Okubo,
This paper described the current status of the
H.: The 22.2 Multichannel Sounds and its
standardization project and provided an overview of the
Reproduction at Home and Personal Environment,
system architecture, its technology, capabilities and
AES 43rd International Conference on Audio for
current performance. Further improvements and
Wirelessly Networked Personal Devices, Pohang,
extensions, such as the ability to operate at very low
Korea, September 2011.
data rates, or the integration into transport systems are
on the way.
[11] Silzle, A., et al.: Investigation on the Quality of 3D
Sound Reproduction. International Conference on
Spatial Audio (ICSA). 2011. Detmold, Germany.

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 11 of 12
Herre et al. MPEG-H Universal Spatial / 3D Audio Coding

[12] Hiyama, K., Komiyama, S., and Hamasaki, K.: The L., Falch, C., Hölzer, A., Valero, M.L., Resch, B.,
minimum number of loudspeakers and its Mundt, H., and Oh, H.: MPEG Spatial Audio
arrangement for reproducing the spatial impression Object Coding – The ISO/MPEG Standard for
of diffuse sound field. 113th AES convention. Efficient Coding of Interactive Audio Scenes,
2002. Los Angeles, USA. Journal of the AES, Vol. 60, No. 9, September
2012, pp. 655-673.
[13] Hamasaki, K., et al.: Effectiveness of Height
Information for Reproducing Presence and Reality [23] ISO/IEC 23003-1:2010, MPEG-D (MPEG audio
in Multichannel Audio System. 120th AES technologies), Part 2: Spatial Audio Object Coding,
Convention. 2006. Paris, France. 2010.

[14] Kim, S., Lee, Y.W., and Pulkki, V.: New 10.2- [24] Neuendorf, M.; Multrus, M.; Rettelbach, N. et al:
channel Vertical Surround System (10.2-VSS); The ISO/MPEG Unified Speech and Audio Coding
Comparison study of perceived audio quality in Standard - Consistent High Quality for All Content
various multichannel sound systems with height Types and at All Bit Rates, Journal of the AES,
loudspeakers. 129th AES Convention. 2010. San Vol. 61, No. 12, December 2013, pp. 956-977.
Francisco, USA.
[25] ISO/IEC 23003-1:2012, MPEG-D (MPEG audio
[15] ISO/IEC JTC1/SC29/WG11 N14747, Text of technologies), Part 3: Unified Speech and Audio
ISO/MPEG 23008-3/DIS 3D Audio, Sapporo, July Coding, 2012.
2014.
[26] ISO/IEC JTC1/SC29/WG11 N13411: Call for
[16] Bosi, M., Brandenburg, K., Quackenbush, S.: Proposal for 3D Audio, Geneva, January 2013.
ISO/IEC MPEG-2 Advanced Audio Coding,
Journal of the AES, Vol. 45/10, October 1997; pp. [27] Blauert, J. Spatial hearing: The psychophysics of
789-814. human sound localization, revised edition; MIT
Press, 1997.
[17] ISO/IEC JTC1/SC29/WG11 MPEG, International
Standard ISO/IEC 13818-7, Generic Coding of [28] ISO/IEC 23009-1:2012(E), Information technology
Moving Pictures and Associated Audio: Advanced - Dynamic adaptive streaming over HTTP (DASH)
Audio Coding, 1997. - Part 1: Media presentation description and
segment formats, 2012.
[18] Herre, J., Dietz, M.: Standards in a Nutshell:
MPEG-4 High-Efficiency AAC Coding, IEEE [29] Pulkki, V.: Virtual sound source positioning using
Signal Processing Magazine, Vol. 25, Iss. 3, 2008; vector base amplitude panning. Journal of the
pp. 137-142. Audio Engineering Society, Volume 45, Issue6,
June 1997; pp. 456-466.
[19] EBU Evaluations of Multichannel Audio Codecs.
EBU-Tech. 3324. Geneva, September 2007, [30] ITU-R, Recommendation-BS1770.3. Algorithms to
available at measure audio programme loudness and true-peak
https://tech.ebu.ch/docs/tech/tech3324.pdf. audio level, 2012, Intern. Telecom Union, Geneva,
Suisse.
[20] Hilpert, J., Disch, S.: Standards in a Nutshell: The
MPEG Surround Audio Coding Standard, IEEE [31] European Broadcasting Union (EBU),
Signal Processing Magazine, Vol. 26, Iss. 1, 2009; Recommendation R128. Lautheitsaussteuerung,
pp. 148-152. Normalisierung und zulässiger Maximalpegel von
Audiosignalen, 2011, Geneva, Suisse.
[21] ISO/IEC 23003-1:2007, MPEG-D (MPEG audio
technologies), Part 1: MPEG Surround, 2007. [32] Füg, S. et al.: “Design, Coding and Processing of
Metadata for Object-Based Interactive Audio”,
[22] Herre, J., Purnhagen, H., Koppens, J., Hellmuth, O., 137th AES convention, 2014, Los Angeles, USA.
Engdegård, J., Hilpert, J., Villemoes, L., Terentiv

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 12 of 12

You might also like