An Object-Based Audio System For Interactive Broadcasting
An Object-Based Audio System For Interactive Broadcasting
This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least
two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced
from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no
responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society,
60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper,
or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
ABSTRACT
This paper describes audio recording, delivery and rendering for an end-to-end broadcast system allowing
users free navigation of panoramic video content with matching interactive audio. The system is based on
one developed as part of the EU FP7 funded project, FascinatE. The premise of the system was to allow users
free navigation of an ultra-high denition 180 degree video panorama for a customisable viewing experience.
From an audio perspective the complete audio scene is recorded and broadcast so the rendered sound scene
at the user end may be customised to match the viewpoint. The approach described here uses and object-
based audio paradigm. This paper presents an overview of the system and describes how such an approach
is useful for facilitating an interactive broadcast.
end rather than the production end as has been the tradi- approach records and transmits all of the audio ingredi-
tional method. In essence, this provides the user with all ents needed to reassemble the audio scene at the user end.
the necessary information about the recorded scene such As all of the positions of the sources are known, it is pos-
that they can customise their experience in their specific sible to rotate the scene and even change the position and
environment. level of the sound sources such that a complete customi-
sation of the audio can be facilitated at the user end. It is
From an audio perspective, this presents some new chal- also possible to include different processing algorithms
lenges with respect to the recording, broadcast and re- that may be applied to individual audio objects such as
production of the scene. Standard channel-based record- compression, filtering and effects etc.
ing techniques can give way to an object-based approach
where much more information about the original scene is Object-based audio (OBA) has therefore been considered
retained right through to the user end. This provides the by many to be the future of spatial audio and there have
user with all the necessary audio components to recom- been some suggestions on how to represent object-based
pile the sound scene based on their viewing perspective audio scenes such as MPEG-4 AudioBIFS [5], Spatial
or preferences. Audio Object Coding (SAOC) [6] and the Audio Scene
Description Format (ASDF) [7]. OBA enables reproduc-
This paper describes the approach developed as part of tion on any loudspeaker system providing the relevant
the FascinatE Project to facilitate such an audio system decoders are available on the user end computer to ren-
that allows both an interactive and immersive experience. der the sound scene on the user’s audio setup. OBA is be-
The paper begins with an overview of object-based au- ginning to gather momentum on a commercial level too
dio and then precedes with a description of the Fasci- with the implementation of Dolby Atmos [8] and also
natE Project and of the audio system developed as part the DTS Multi-dimensional Audio (MDA) open source
of the project. The paper continues then to discuss some format being two current examples.
specifics of the scene capture, delivery and reproduction
Some non-spatial applications of OBA have also been
techniques used and finishes with some concluding re-
proposed; BBC research has implemented object-based
marks and further work.
audio in a test radio broadcast which used audio objects
to tailor the radio programme depending on the listeners
2. OBJECT-BASED AUDIO geographical location [9]. OBA allowed specific audio
There are many different ways of representing a sound events that made up the programme such as sound ef-
scene, with varying degrees of accuracy depending on fects, actors’ voices and music to be customised based
the required spatial resolution and whether the aim is on geographical location and the date and time of access
a perceptual approximation or a mathematically exact to the radio programme. The programme was delivered
representation. Channel-based systems such as two- over IP and used the HTML5 standard to carry out all au-
channel stereophony, 5.1, 7.1 etc. essentially sample dio processing and mixing at the user end. Another use of
the sound scene at a discrete location or attempt to syn- audio objects that has been proposed by the BBC was for
thesise an impression of that sound scene by delivering users to be able to change the duration of a programme
specific audio content to the available loudspeaker chan- by adjusting the spaces between audio events, without
nels/target format. Other techniques which can be con- any need for time stretching or other process that may
sidered transformation-based systems [3] aim at utilising be detrimental to audio quality or intelligibility. Addi-
a mathematical representation of the sound scene such tionally OBA also enables remixing of the relative levels
as a spatially orthogonal basis function e.g. [4]. In this between objects in the scene such as between commen-
case transformation coefficients are transmitted rather tary and different areas of the crowd in a football match
than loudspeaker signals and these are then decoded at [10]. Other possibilities exist to utilise this approach to
the user end to the specific rendering system. facilitate improvements to the speech intelligibility for
A more transparent method of representing a sound scene people with hearing impairments [11].
is to utilise an object-based approach where each sound 2.1. FascinatE Project
source in the scene is recorded separately along with its The recently completed FascinatE Project [2] was a Eu-
position in space and some associated metadata describ- ropean research project aiming at the development of a
ing other source characteristics. Such an object-based complete end-to-end future broadcast system designed to
be format agnostic and interactive, based on user naviga- 14] base. The advantage of adopting such an approach
tion of an ultra-high definition panorama with accompa- is that the audio objects can be moved according to the
nying 3D audio. FascinatE stands for Format-Agnostic customised visuals and the sound field component can
SCript-based INterAcTive Experience and brought to- also be rotated to match the correct viewing perspective
gether 11 institutions from 8 different European coun- so the final output is spatially coherent.
tries. Fundamental to the system developed was a for-
As shown in Figure 1, the FascinatE system comprised
mat agnostic approach such that only one set of capture
three components, audio scene extractor/capture, audio
and broadcast infrastructure was needed for all differ- composer and audio presenter modules. The scene ex-
ent potential use-cases of the system i.e. ranging from tractor was responsible for the capture of the sound field
an individual watching on a mobile device and listen- component and the extraction of the audio objects from
ing on headphones right through to someone watching
the scene. Information was gathered at this stage and
the broadcast in a public setting with a large scale wrap
metadata generated as described later. The Audio Com-
around screen and a large multi-channel immersive audio
poser (AC) takes this information and composes and au-
system. dio scene based on global preferences, scripting informa-
In the FascinatE Project, object-based audio was utilised tion and individual scene navigation decisions etc. such
as a means to provide a dynamically matching audio for that the audio perspective matches as close as possible
interactive navigation through an AV scene. The project with the visual perspective. Once the scene has been
captured a very high definition video panorama of 7K composed, the Audio Presenter (AP) module receives in-
resolution (approx 7K x 2K pixels) utilising Fraunhofer formation on the user’s audio system and decodes the
HHI’s OMNICAM [12] and allowed pan, tilt and zoom sound scene for reproduction accordingly, thus the sys-
navigation of the panorama by the user. In order to pro- tem is format agnostic, allowing replay on any output
vide matching audio for the user-defined scene it was system providing the necessary decoders are installed on
necessary to move away from a conventional channel- the user’s system.
based audio paradigm. Instead a hybrid object-based and
transformation-based approach was adopted to capture 4. SCENE CAPTURE
the audio scene without reference to any specific target To capture the entire audio scene, it is important that
loudspeaker configuration. Instead of defining captured each discrete sound source (audio object) is separated
audio events as emanating from a given loudspeaker or and positioned accurately in space and also that the
from between two loudspeakers of a target reproduc- sound field component (providing a spatially accurate
tion system, sound events were captured complete with background/ambience) is recorded correctly and to an
3D coordinate information specifying where in the audio appropriate degree of accuracy. In some cases it is also
scene the event had taken place. Additionally the spatial desirable to capture so-called diffuse audio objects which
sound field was also captured using Higher Order Am- can be useful for representing features such as the late re-
bisonics to provide the ambience/background. This is flection energy in a room’s impulse response, this is sub-
analogous to a gaming audio scenario and, in the same ject to current research activity and not covered in any
way as a first person game allows navigation around and more detail in this contribution.
between audio objects, the object-based audio capture 4.1. Sound Field Component
enabled users to pan around and zoom into the AV scene Microphone arrays like SoundField R or Eigenmike R
with audio events remaining in their correct locations, are used to capture the entire three-dimensional sound
similarly, the sound field can also be rotated and manip- field at the microphone array position. The Ambison-
ulated corresponding to user navigation. It was possible ics or Higher Order Ambisonics (HOA) representation is
in the FascinatE system to zoom across a scene and past used to preserve the 3D acoustic/audio scene.
audio objects which would then move behind the user’s
viewpoint thus realising a realistic audio scene to match In principle, Ambisonics uses spherical harmonic func-
the chosen visual viewpoint. tions on the surface of a sphere to model the superposi-
tion of single audio sources distributing acoustic waves
3. SYSTEM OVERVIEW from different directions as a linear combination of or-
The basis of the FascinatE audio system is an object- thogonal basis directions. The components in each ba-
based paradigm with a Higher Order Ambisonics [13, sis direction are the spherical harmonics of one single
NE
NW
SW
S
SE
E
Ambisonics
WFS
Decoder
WFS
Encoder
Sound Field
Sound Field Ambisonics
Signals Decomposer
Signals
Ambisonics
Decoder
Device
3D Ambisonics
User
Request Paremeter
Script Parameters
direction. Normally, more than one source is encoded, matrixed to an intermediate representation describing the
therefore coefficients from different directions/sources pressure distribution exactly on the surface of the sphere.
are combined in form of matrices. For each sample time, The second step removes the impact of the capsules as
an encoder matrix maps all the sound source signals from well as the impact of the array arrangement to obtain the
different directions into a single vector where its compo- HOA coefficients of the free field. These impacts are de-
nents represent the Ambisonics coefficients of one audio scribed by the term microphone array response which is
source/direction at a specific time. So at each sample basically a filter term [15].
time a vector is derived containing all recorded source 4.2. Audio Object Extractor
information encoded in its vector components. This is a In order to have a more complete description of the sound
format-agnostic approach because the Ambisonics repre- scene it is desirable to also capture the audio objects in
sentation is space invariant, thus it can be decoded such the scene. Once captured these can be individually ma-
that the sound field can be reproduced on arbitrary loud- nipulated to provide a fully customisable sound scene at
speaker configurations. An important parameter of the the user end (as described later). The principle task of
HOA description is the order of the spherical harmonic the Audio Object Extractor (AOE) is to use the avail-
functions because it controls the number of coefficients, able microphones to pick out and separate the discrete
i.e., the accuracy/resolution of the sound field descrip- sound sources in the scene. The AOE aims to find the
tion: the more coefficients (the higher the order) the bet- audio content that is deemed salient for the given record-
ter the accuracy. ing scenario and to locate the audio objects in space.
For example an order of one can be achieved by the 4.2.1. Audio Objects
SoundField microphone and the Tetra Mic which both These key, discrete audio events in the scene are de-
utilise four microphone capsules. The first order repre- scribed by audio objects. An audio object contains the
sentation provides the basic spatial information that com- audio content of a source with a specific location in time
pares to a traditional mid-side recording. The Eigenmike and space. An audio object therefore has audio data, a
contains 32 microphone capsules on a rigid sphere and position, an onset/offset time and potentially some addi-
delivers HOA coefficients up to order four. As a con- tional metadata such as source directivity, reverberation
sequence, the reproduction yields a much better spatial- time etc. Generally speaking, in terms of recording, au-
ization of the content. The HOA representation is ob- dio objects can be split into two categories depending on
tained in two steps. In a first step the capsule signals are the nature of the audio capture techniques involved; so-
called explicit audio objects and implicit audio objects. on a time-difference-of-arrival method to position the au-
dio object as depicted in Figure 3, more information on
Explicit Audio Objects this process can be found in [16].
If a sound source can be closely miked with a tracked or
stationary location, it can be classed as an explicit audio
object. An example of an explicit audio object would
be the violinist in an orchestra. In this case the sound
source can be recorded with very little bleed from other
audio sources and the position is generally stationary or
at least can be tracked easily by local GPS type tracking
systems. Another example of an explicit audio object
would be the feed from an interview or commentary in
this case the microphone feed is generally clean and the
position of the object is not important
are all part of the frame header, this frame header also 6. SCENE REPRODUCTION
contains the sample rate, number of tracks, number of The reproduction end of the FascinatE audio system (see
samples per segment and potentially some additional in- Figure 1) consists of an Audio Composer (AC) and an
formation such as room environment etc. Each track Audio Presenter (AP) module. The Audio Composer re-
header will also contain a unique number and time stamp ceives Audio Objects (AO) and Sound Field Descriptions
offset and will include information about the track type (SFD) and processes them, like a mixing desk would do
(i.e. sound field description or audio object). Also in in a studio, based on information from the User Control
the track header is positional information, orientation, di- Node and the Video Rendering Node. The updated sound
rectivity etc. allowing a dynamically varying scene with scene is then communicated to the Audio Presenter for
moving objects etc. For a sound field track, parameters rendering.
like 2D/3D, the Ambisonics order, the orientation of the
sound field, the rotation, the bit depths, the coefficient 6.1. Audio Composer
order and the normalization method used (e.g. “Furse- The Audio Composer positions the audio objects in the
Malham weights”, “Schmidt Semi-Normalized” or 4π sound scene along with the recorded sound field descrip-
Normalized etc.) should also be part of this header in- tions which are positioned in the scene at the location in
formation. which they were recorded. The Audio Composer is con-
nected via a TCP/IP connection to the Video Rendering
Node (VRN). The VRN communicates an updated, pan,
5.2. Scene compression tilt and zoom with each video frame. The AC uses this
Whilst the FascinatE Project explicitly did not focus on data to update the positions of the audio objects in the
compression techniques there has been some recent stan- sound scene and applies a rotation to the recorded SFDs.
dardisation activity in this area by MPEG[17] which is
currently looking at 3D audio scene compression. To re- The input and output formats for the Audio Composer
duce the amount of audio data to be transmitted, MPEG are designed to be identical such that several ACs can be
is working on MPEG-H 3D Audio “High efficiency cod- concatenated from the production to the user side. This
ing and media delivery in heterogeneous environments”. allows the separate distribution of operations on AOs and
The first phase of this activity focused on bitrates of 256 SFDs throughout the whole chain. Furthermore, if non-
kbits/s and above, while the second phase, for bitrates linear effects in the audio processing are required, e.g.,
below 256 kbits/s, is just beginning. translational movement by a zooming operation; such
operations can be differentiated for AOs and SFDs. A
first AC could react to script inputs and could preselect
MPEG-H 3D Audio can compress channel-based, AOs, as defined by a service provider; a second AC could
object-based as well as scene-based (HOA) input for- be placed at the end-user terminal, controlled by the user
mats so is ideally suited to compression within the format inputs, to do the final positioning of the objects.
mentioned here as any mixture of audio objects, sound
field descriptions or channel-based formats can be ac- 6.1.1. Adjusting the sound scene
commodated. The positions of the audio objects need to be updated by
the Audio Composer to match the current viewing per- It is also possible to apply zoom to a sound field although
spective. The origin of the sound scene is taken as the this is a non-trivial task. However various translations
OMNICAM position and the relative positions of the au- can be applied to the sound field as described in [19] and
dio objects are changed based on the pan, tilt and zoom [20] which approximate zoom. Practically, the applica-
information from the VRN. When a pan or tilt command tion of zoom to the sound field component may depend
is parsed by the AC the received angle is added to the cur- both on user preferences and the recording scenario as
rent azimuth or elevation angle of the audio objects re- described below.
spectively so they move appropriately in the reproduced
Regardless of pan, tilt and zoom information, some audio
sound field.
objects should remain stationary such as the commen-
For a given pan angle it is also required that the ambient tary/interview feeds and are therefore left static in the
sound field be rotated in order to simulate the change in scene based on flags in the metadata. The individual lev-
the viewing direction. This is done in the Audio Com- els of some of the audio objects can be manually altered
poser by multiplying the sound field by a correspond- as well, providing the listener with a basic mixing desk
ing rotation matrix. This is important for spatial coher- GUI in their rendering software that allows the control
ence as the sound field recording will contain informa- of some of the audio objects and the balance between
tion from some of the separately recorded audio objects background and foreground levels which can be useful
and these need to be rendered to the same position as for increasing speech intelligibility for hearing impaired
the corresponding audio objects in the reproduced sound users [11].
field at the user end.
After the specific sound fields have been appropriately
For zoom, the AC parses the zoom angle from the up-
rotated and the audio objects manipulated we combine
date XML message and changes the relative levels and
the component sound fields and audio objects into a re-
positions of the audio objects accordingly. A level dif-
sultant sound field and deliver this to the Audio Presenter
ference is applied between the sound field signals and
for rendering. This is possible due to the linear character
the audio objects to match what would be expected in a
of the SFD. The resulting HOA description can be trans-
real listening scenario (i.e. a level factor of 1/distance
mitted to the end user which can render the sound field
is applied based on the calculated listener distance from
onto a specific loudspeaker setup.
the given audio object or sound field). As a user zooms
into the scene the angles of the audio objects will also 6.2. User Preferences
change (they will increase up to the point that the object The way the audio is processed based on the navigation
appears behind the listener). The degree to which this of the visual content depends very much on user pref-
angle increases is a function of the screen size and the erences. With modern first person shooter games, peo-
listener distance from the screen, thus some approxima- ple are used to the audio changing based on their every
tions have to be made to accommodate multiple listeners move. The relative directions of audio sources alters as
in the space. The AC therefore has a series of variables they rotate their head and get louder as they move to-
that allow the user to control the extent to which they wards them etc. However this is a completely different
would like to experience audio zoom or disable the fea- paradigm to that of current televisions broadcasts where
ture if they wish to do so. the audio content is traditionally very static with discrete
MPEG-4 Standard. In 116th Conv. Audio Eng. Soc., Conference on Spatial Audio, Detmold, Germany,
Berlin, Germay, 2004. November 2011.
[6] Jürgen Herre, Heiko Purnhagen, Jeroen Koppens, [16] Robert Oldfield, Ben Shirley, and Jens Spille.
Oliver Hellmuth, Jonas Engdegå rd, Johannes Object-based audio for interactive football broad-
Hilper, Lars Villemoes, Leon Terentiv, Cornelia cast. Multimedia Tools and Applications, pages 1–
Falch, Andreas Hölzer, María Luis Valero, Barbara 25, 2013. doi: 10.1007/s11042-013-1472-2.
Resch, Harald Mundt, and Hyen-O Oh. MPEG
Spatial Audio Object Coding – The ISO/MPEG [17] MPEG-H. State of the Art in compres-
Standard for Efficient Coding of Interactive Audio sion and transmission of 3D Video: Part
Scenes. J. Audio Eng. Soc, 60(9):655 – 673, 2012. 3 - 3D Audio, ISO/IEC 23008, draft/open,
http://mpeg.chiariglione.org/standards/mpeg-h/3d-
[7] Matthias Geier, Jens Ahrens, and Sascha Spors. audio.
Object-based Audio Reproduction and the Audio
Scene Description Format. Organised Sound, 15 [18] MPEG. ISO 14496-3 (MPEG-4 Audio) Final Com-
(3):219 – 227, 2010. mittee Draft. MPEG Document W2203, 1998.
[8] C. Q Robinson, S Mehta, and N Tsingos. Scalable [19] Jens Ahrens. Analytic Methods of Sound Field Syn-
Format and Tools to Extend the Possibilities of Cin- thesis. Springer, Heidelberg, Germany, 2012.
ema Audio. SMPTE Motion Imaging Journal, 121
[20] Nail A. Gumerov and Ramani Duraiswami. Fast
(8):63 – 69, 2012.
Multipole Methods for the Helmholtz Equation in
[9] I Forrester and A Churnside. The creation of a per- Three Dimensions. Elsevier, first edition, 2004.
ceptive audio drama. NEM Summit, 2012.
[10] Mark Mann, Anthony WP Churnside, Andrew
Bonney, and Frank Melchior. Object-based audio
applied to football broadcasts. In Proceedings of
the 2013 ACM international workshop on Immer-
sive media experiences, pages 13–16. ACM, 2013.
[11] Benjamin Guy Shirley. Improving Television sound
for people with hearing impairments. PhD thesis,
University of Salford, 2013.
[12] Oliver Schreer, Ingo Feldmann, Christian Weissig,
Peter Kauff, and Ralf Schäfer. Ultrahigh-
Resolution Panoramic Imaging for Format-
Agnostic Video Production. Proceedings of IEEE,
101(1):99 – 114, 2013.
[13] Jerome Daniel. Représentation de champs acous-
tiques, application à la transmission et à la repro-
duction de scènes sonores complexes dans un con-
texte multimédia,. PhD thesis, PhD thesis Univer-
sitè Paris, 2001.
[14] D Malham. Space in Music – Music in Space,. PhD
thesis, University of York, UK, 2003.
[15] Sven Kordon, Alexander Krüger, Johann-Markus
Batke, and Holger Kropp. Optimization of Spheri-
cal Microphone Array Recordings. In International