100% found this document useful (1 vote)
1K views370 pages

Exploring Music Contents

Exploring Music Contents

Uploaded by

Roby Sambora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views370 pages

Exploring Music Contents

Exploring Music Contents

Uploaded by

Roby Sambora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 370

Lecture Notes in Computer Science 6684

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Sølvi Ystad Mitsuko Aramaki
Richard Kronland-Martinet
Kristoffer Jensen (Eds.)

Exploring
Music Contents
7th International Symposium, CMMR 2010
Málaga, Spain, June 21-24, 2010
Revised Papers

13
Volume Editors

Sølvi Ystad
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]

Mitsuko Aramaki
CNRS-INCM, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]

Richard Kronland-Martinet
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]

Kristoffer Jensen
Aalborg University Esbjerg, Niels Bohr Vej 8, 6700 Esbjerg, Denmark
E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-642-23125-4 e-ISBN 978-3-642-23126-1
DOI 10.1007/978-3-642-23126-1
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2011936382

CR Subject Classification (1998): J.5, H.5, C.3, H.5.5, G.3, I.5

LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web


and HCI

© Springer-Verlag Berlin Heidelberg 2011

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

Computer Music Modeling and Retrieval (CMMR) 2010 was the seventh event
of this international conference series that was initiated in 2003. Since its start,
the conference has been co-organized by the University of Aalborg, Esbjerg, Den-
mark (http://www.aaue.dk) and the Laboratoire de Mécanique et d’Acoustique
in Marseille, France (http://www.lma.cnrs-mrs.fr) and has taken place in France,
Italy and Denmark. The six previous editions of CMMR offered a varied overview
of recent music information retrieval (MIR) and sound modeling activities in ad-
dition to alternative fields related to human interaction, perception and cognition.
This year’s CMMR took place in Málaga, Spain, June 21–24, 2010. The
conference was organized by the Application of Information and Communica-
tions Technologies Group (ATIC) of the University of Málaga (Spain), together
with LMA and INCM (CNRS, France) and AAUE (Denmark). The conference
featured three prominent keynote speakers working in the MIR area, and the
program of CMMR 2010 included in addition paper sessions, panel discussions,
posters and demos.
The proceedings of the previous CMMR conferences were published in the
Lecture Notes in Computer Science series (LNCS 2771, LNCS 3310, LNCS 3902,
LNCS 4969, LNCS 5493 and LNCS 5954), and the present edition follows the
lineage of the previous ones, including a collection of 22 papers within the topics
of CMMR. These articles were specially reviewed and corrected for this proceed-
ings volume.
The current book is divided into five main chapters that reflect the present
challenges within the field of computer music modeling and retrieval. The chap-
ters span topics from music interaction, composition tools and sound source
separation to data mining and music libraries. One chapter is also dedicated
to perceptual and cognitive aspects that are currently the subject of increased
interest in the MIR community. We are confident that CMMR 2010 brought
forward the research in these important areas.
We would like to thank Isabel Barbancho and her team at the Application of
Information and Communications Technologies Group (ATIC) of the University
of Málaga (Spain) for hosting the 7th CMMR conference and for ensuring a
successful organization of both scientific and social matters. We would also like
to thank the Program Committee members for their valuable paper reports and
thank all the participants who made CMMR 2010 a fruitful and convivial event.
Finally, we would like to thank Springer for accepting to publish the CMMR
2010 proceedings in their LNCS series.

March 2011 Sølvi Ystad


Mitsuko Aramaki
Richard Kronland-Martinet
Kristoffer Jensen
Organization

The 7th International Symposium on Computer Music Modeling and Retrieval


(CMMR2010) was co-organized by the University of Málaga (Spain) Aalborg
University (Esbjerg, Denmark), and LMA/INCM-CNRS (Marseille, France).

Symposium Chair
Isabel Barbancho University of Málaga, Spain

Symposium Co-chairs
Kristoffer Jensen AAUE, Denmark
Sølvi Ystad CNRS-LMA, France

Demonstration and Panel Chairs


Ana M. Barbancho University of Málaga, Spain
Lorenzo J. Tardón University of Málaga, Spain

Program Committee
Paper and Program Chairs
Mitsuko Aramaki CNRS-INCM, France
Richard Kronland-Martinet CNRS-LMA, France

CMMR 2010 Referees

Mitsuko Aramaki Brian Gygi


Federico Avanzini Goffredo Haus
Rolf Bader Kristoffer Jensen
Isabel Barbancho Anssi Klapuri
Ana M. Barbancho Richard Kronland-Martinet
Mathieu Barthet Marc Leman
Antonio Camurri Sylvain Marchand
Laurent Daudet Grégory Pallone
Olivier Derrien Andreas Rauber
Simon Dixon David Sharp
Barry Eaglestone Bob L. Sturm
Gianpaolo Evangelista Lorenzo J. Tardón
Cédric Févotte Vesa Välimäki
Bruno Giordano Sølvi Ystad
Emilia Gómez
Table of Contents

Part I: Music Production, Interaction and


Composition Tools
Probabilistic and Logic-Based Modelling of Harmony . . . . . . . . . . . . . . . . . 1
Simon Dixon, Matthias Mauch, and Amélie Anglade

Interactive Music Applications and Standards . . . . . . . . . . . . . . . . . . . . . . . 20


Rebecca Stewart, Panos Kudumakis, and Mark Sandler

Interactive Music with Active Audio CDs . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


Sylvain Marchand, Boris Mansencal, and Laurent Girin

Pitch Gestures in Generative Modeling of Music . . . . . . . . . . . . . . . . . . . . . 51


Kristoffer Jensen

Part II: Music Structure Analysis - Sound Source


Separation
An Entropy Based Method for Local Time-Adaptation of the
Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Marco Liuni, Axel Röbel, Marco Romito, and Xavier Rodet

Transcription of Musical Audio Using Poisson Point Processes and


Sequential MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Pete Bunch and Simon Godsill

Single Channel Music Sound Separation Based on Spectrogram


Decomposition and Note Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Wenwu Wang and Hafiz Mustafa

Notes on Nonnegative Tensor Factorization of the Spectrogram


for Audio Source Separation: Statistical Insights and towards
Self-Clustering of the Spatial Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Cédric Févotte and Alexey Ozerov

Part III: Auditory Perception, Artificial Intelligence


and Cognition
What Signal Processing Can Do for the Music . . . . . . . . . . . . . . . . . . . . . . . 116
Isabel Barbancho, Lorenzo J. Tardón, Ana M. Barbancho,
Andrés Ortiz, Simone Sammartino, and Cristina de la Bandera
VIII Table of Contents

Speech/Music Discrimination in Audio Podcast Using Structural


Segmentation and Timbre Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Mathieu Barthet, Steven Hargreaves, and Mark Sandler

Computer Music Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


Jesús L. Alvaro and Beatriz Barros

Abstract Sounds and Their Applications in Audio and Perception


Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Adrien Merer, Sølvi Ystad, Richard Kronland-Martinet, and
Mitsuko Aramaki

Part IV: Analysis and Data Mining


Pattern Induction and Matching in Music Signals . . . . . . . . . . . . . . . . . . . . 188
Anssi Klapuri

Unsupervised Analysis and Generation of Audio Percussion


Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Marco Marchini and Hendrik Purwins

Identifying Attack Articulations in Classical Guitar . . . . . . . . . . . . . . . . . . 219


Tan Hakan Özaslan, Enric Guaus, Eric Palacios, and
Josep Lluis Arcos

Comparing Approaches to the Similarity of Musical Chord Sequences . . . 242


W. Bas de Haas, Matthias Robine, Pierre Hanna,
Remco C. Veltkamp, and Frans Wiering

Part V: MIR - Music Libraries


Songs2See and GlobalMusic2One: Two Applied Research Projects in
Music Information Retrieval at Fraunhofer IDMT . . . . . . . . . . . . . . . . . . . . 259
Christian Dittmar, Holger Großmann, Estefanı́a Cano,
Sascha Grollmisch, Hanna Lukashevich, and Jakob Abeßer

MusicGalaxy: A Multi-focus Zoomable Interface for Multi-facet


Exploration of Music Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Sebastian Stober and Andreas Nürnberger

A Database Approach to Symbolic Music Content Management . . . . . . . . 303


Philippe Rigaux and Zoe Faget

Error-Tolerant Content-Based Music-Retrieval with Mathematical


Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Mikko Karvonen, Mika Laitinen, Kjell Lemström, and Juho Vikman
Table of Contents IX

Melodic Similarity through Shape Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 338


Julián Urbano, Juan Lloréns, Jorge Morato, and
Sonia Sánchez-Cuadrado

Content-Based Music Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356


Dirk Schönfuß

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361


Probabilistic and Logic-Based
Modelling of Harmony

Simon Dixon, Matthias Mauch, and Amélie Anglade

Centre for Digital Music,


Queen Mary University of London,
Mile End Rd, London E1 4NS, UK
[email protected]
http://www.eecs.qmul.ac.uk/~simond

Abstract. Many computational models of music fail to capture essential


aspects of the high-level musical structure and context, and this limits
their usefulness, particularly for musically informed users. We describe
two recent approaches to modelling musical harmony, using a probabilis-
tic and a logic-based framework respectively, which attempt to reduce the
gap between computational models and human understanding of music.
The first is a chord transcription system which uses a high-level model of
musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian frame-
work, achieving state-of-the-art performance. The second approach uses
inductive logic programming to learn logical descriptions of harmonic
sequences which characterise particular styles or genres. Each approach
brings us one step closer to modelling music in the way it is conceptu-
alised by musicians.

Keywords: Chord transcription, inductive logic programming, musical


harmony.

1 Introduction

Music is a complex phenomenon. Although music is described as a “universal


language”, when viewed as a paradigm for communication it is difficult to find
agreement on what constitutes a musical message (is it the composition or the
performance?), let alone the meaning of such a message. Human understand-
ing of music is at best incomplete, yet there is a vast body of knowledge and
practice regarding how music is composed, performed, recorded, reproduced and
analysed in ways that are appreciated in particular cultures and settings. It is
the computational modelling of this “common practice” (rather than philosoph-
ical questions regarding the nature of music) which we address in this paper. In
particular, we investigate harmony, which exists alongside melody, rhythm and
timbre as one of the fundamental attributes of Western tonal music.
Our starting point in this paper is the observation that many of the com-
putational models used in the music information retrieval and computer music
research communities fail to capture much of what is understood about music.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 1–19, 2011.

c Springer-Verlag Berlin Heidelberg 2011
2 S. Dixon, M. Mauch, and A. Anglade

Two examples are the bag-of-frames approach to music similarity [5], and the pe-
riodicity pattern approach to rhythm analysis [13], which are both independent
of the order of musical notes, whereas temporal order is an essential feature of
melody, rhythm and harmonic progression. Perhaps surprisingly, much progress
has been made in music informatics in recent years1, despite the naivete of the
musical models used and the claims that some tasks have reached a “glass ceil-
ing” [6].
The continuing progress can be explained in terms of a combination of factors:
the high level of redundancy in music, the simplicity of many of the tasks which
are attempted, and the limited scope of the algorithms which are developed. In
this regard we agree with [14], who review the first 10 years of ISMIR confer-
ences and list some challenges which the community “has not fully engaged with
before”. One of these challenges is to “dig deeper into the music itself”, which
would enable researchers to address more musically complex tasks; another is to
“expand ... musical horizons”, that is, broaden the scope of MIR systems.
In this paper we present two approaches to modelling musical harmony, aiming
at capturing the type of musical knowledge and reasoning a musician might use
in performing similar tasks. The first task we address is that of chord transcrip-
tion from audio recordings. We present a system which uses a high-level model
of musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian framework, and
generates the content of a “lead-sheet” containing the sequence of chord sym-
bols, including their bass notes and metrical positions, and the key signature and
any modulations over time. This system achieves state-of-the-art performance,
being rated first in its category in the 2009 and 2010 MIREX evaluations. The
second task to which we direct our attention is the machine learning of logical
descriptions of harmonic sequences in order to characterise particular styles or
genres. For this work we use inductive logic programming to obtain represen-
tations such as decision trees which can be used to classify unseen examples or
provide insight into the characteristics of a data corpus.
Computational models of harmony are important for many application areas
of music informatics, as well as for music psychology and musicology itself. For
example, a harmony model is a necessary component of intelligent music no-
tation software, for determining the correct key signature and pitch spelling of
accidentals where music is obtained from digital keyboards or MIDI files. Like-
wise processes such as automatic transcription are benefitted by tracking the
harmonic context at each point in the music [24]. It has been shown that har-
monic modelling improves search and retrieval in music databases, for example
in order to find variations of an example query [36], which is useful for musi-
cological research. Theories of music cognition, if expressed unambiguously, can
be implemented and tested on large data corpora and compared with human
annotations, in order to verify or refine concepts in the theory.

1
Progress is evident for example in the annual MIREX series of evaluations of
music information retrieval systems (http://www.music-ir.org/mirex/wiki/2010:
Main_Page)
Probabilistic and Logic-Based Modelling of Harmony 3

The remainder of the paper is structured as follows. The next section provides
an overview of research in harmony modelling. This is followed by a section
describing our probabilistic model of chord transcription. In section 4, we present
our logic-based approach to modelling of harmony, and show how this can be
used to characterise and classify music. The final section is a brief conclusion
and outline of future work.

2 Background
Research into computational analysis of harmony has a history of over four
decades since [44] proposed a grammar-based analysis that required the user
to manually remove any non-harmonic notes (e.g. passing notes, suspensions
and ornaments) before the algorithm processed the remaining chord sequence.
A grammar-based approach was also taken by [40], who developed a set of chord
substitution rules, in the form of a context-free grammar, for generating 12-bar
Blues sequences. [31] addressed the problem of extracting patterns and substitu-
tion rules automatically from jazz standard chord sequences, and discussed how
the notions of expectation and surprise are related to the use of these patterns
and rules.
Closely related to grammar-based approaches are rule-based approaches, which
were used widely in early artificial intelligence systems. [21] used an elimination
process combined with heuristic rules in order to infer the tonality given a fugue
melody from Bach’s Well-Tempered Clavier. [15] presents an expert system con-
sisting of about 350 rules for generating 4-part harmonisations of melodies in the
style of Bach Chorales. The rules cover the chord sequences, including cadences
and modulations, as well as the melodic lines of individual parts, including voice
leading. [28] developed an expert system with a complex set of rules for recognis-
ing consonances and dissonances in order to infer the chord sequence. Maxwell’s
approach was not able to infer harmony from a melodic sequence, as it considered
the harmony at any point in time to be defined by a subset of the simultaneously
sounding notes.
[41] addressed some of the weaknesses of earlier systems with a combined
rhythmic and harmonic analysis system based on preference rules [20]. The
system assigns a numerical score to each possible interpretation based on the
preference rules which the interpretation satisfies, and searches the space of all
solutions using dynamic programming restricted with a beam search. The sys-
tem benefits from the implementation of rules relating harmony and metre, such
as the preference rule which favours non-harmonic notes occurring on weak met-
rical positions. One claimed strength of the approach is the transparency of the
preference rules, but this is offset by the opacity of the system parameters such
as the numeric scores which are assigned to each rule.
[33] proposed a counting scheme for matching performed notes to chord tem-
plates for variable-length segments of music. The system is intentionally simplis-
tic, in order that the framework might easily be extended or modified. The main
contributions of the work are the graph search algorithms, inspired by Temper-
ley’s dynamic programming approach, which determine the segmentation to be
4 S. Dixon, M. Mauch, and A. Anglade

used in the analysis. The proposed graph search algorithm is shown to be much
more efficient than standard algorithms without differing greatly in the quality
of analyses it produces.
As an alternative to the rule-based approaches, which suffer from the cu-
mulative effects of errors, [38] proposed a probabilistic approach to functional
harmonic analysis, using a hidden Markov model. For each time unit (measure
or half-measure), their system outputs the current key and the scale degree of
the current chord. In order to make the computation tractable, a number of
simplifying assumptions were made, such as the symmetry of all musical keys.
Although this reduced the number of parameters by at least two orders of mag-
nitude, the training algorithm was only successful on a subset of the parameters,
and the remaining parameters were set by hand.
An alternative stream of research has been concerned with multidimensional
representations of polyphonic music [10,11,42] based on the Viewpoints approach
of [12]. This representation scheme is for example able to preserve information
about voice leading which is otherwise lost by approaches that treat harmony as
a sequence of chord symbols.
Although most research has focussed on analysing musical works, some work
investigates the properties of entire corpora. [25] compared two corpora of chord
sequences, belonging to jazz standards and popular (Beatles) songs respectively,
and found key- and context-independent patterns of chords which occurred fre-
quently in each corpus. [26] examined the statistics of the chord sequences of sev-
eral thousand songs, and compared the results to those from a standard natural
language corpus in an attempt to find lexical units in harmony that correspond
to words in language. [34,35] investigated whether stochastic language models in-
cluding naive Bayes classifiers and 2-, 3- and 4-grams could be used for automatic
genre classification. The models were tested on both symbolic and audio data,
where an off-the-shelf chord transcription algorithm was used to convert the audio
data to a symbolic representation. [39] analysed the Beatles corpus using proba-
bilistic N-grams in order to show that the dependency of a chord on its context
extends beyond the immediately preceding chord (the first-order Markov assump-
tion). [9] studied differences in the use of harmony across various periods of classi-
cal music history, using root progressions (i.e. the sequence of root notes of chords
in a progression) reduced to 2 categories (dominant and subdominant) to give a
representation called harmonic vectors. The use of root progressions is one of the
representations we use in our own work in section 4 [2].
All of the above systems process symbolic input, such as that found in a score,
although most of the systems do not require the level of detail provided by the
score (e.g. key signature, pitch spelling), which they are able to reconstruct from
the pitch and timing data. In recent years, the focus of research has shifted to the
analysis of audio files, starting with the work of [16], who computed a chroma
representation (salience of frequencies representing the 12 Western pitch classes,
independent of octave) which was matched to a set of chord templates using the
inner product. Alternatively, [7] modelled chords with a 12-dimensional Gaussian
distribution, where chord notes had a mean of 1, non-chord notes had a mean of 0,
Probabilistic and Logic-Based Modelling of Harmony 5

and the covariance matrix had high values between pairs of chord notes. A hidden
Markov model was used to infer the most likely sequence of chords, where state
transition probabilities were initialised based on the distance between chords on
a special circle of fifths which included minor chords near to their relative major
chord. Further work on audio-based harmony analysis is reviewed thoroughly in
three recent doctoral theses, to which the interested reader is referred [22,18,32].

3 A Probabilistic Model for Chord Transcription

Music theory, perceptual studies, and musicians themselves generally agree that
no musical quality can be treated individually. When a musician transcribes
the chords of a piece of music, the chord labels are not assigned solely on the
basis of local pitch content of the signal. Musical context such as the key, met-
rical position and even the large-scale structure of the music play an important
role in the interpretation of harmony. [17, Chapter 4] conducted a survey among
human music transcription experts, and found that they use several musical con-
text elements to guide the transcription process: not only is a prior rough chord
detection the basis for accurate note transcription, but the chord transcription
itself depends on the tonal context and other parameters such as beats, instru-
mentation and structure.
The goal of our recent work on chord transcription [24,22,23] is to propose
computational models that integrate musical context into the automatic chord
estimation process. We employ a dynamic Bayesian network (DBN) to combine
models of metrical position, key, chord, bass note and beat-synchronous bass and
treble chroma into a single high-level musical context model. The most probable
sequence of metrical positions, keys, chords and bass notes is estimated via
Viterbi inference.
A DBN is a graphical model representing a succession of simple Bayesian
networks in time. These are assumed to be Markovian and time-invariant, so
the model can be expressed recursively in two time slices: the initial slice and
the recursive slice. Our DBN is shown in Figure 1. Each node in the network
represents a random variable, which might be an observed node (in our case
the bass and treble chroma) or a hidden node (the key, metrical position, chord
and bass pitch class nodes). Edges in the graph denote dependencies between
variables. In our DBN the musically interesting behaviour is modelled in the
recursive slice, which represents the progress of all variables from one beat to
the next. In the following paragraphs we explain the function of each node.
Chord. Technically, the dependencies of the random variables are described in the
conditional probability distribution of the dependent variable. Since the highest
number of dependencies join at the chord variable, it takes a central position
in the network. Its conditional probability distribution is also the most com-
plex: it depends not only on the key and the metrical position, but also on the
chord variable in the previous slice. The chord variable has 121 different chord
states (see below), and its dependency on the previous chord variable enables
6 S. Dixon, M. Mauch, and A. Anglade

metric pos. Mi−1 Mi

key Ki−1 Ki

chord Ci−1 Ci

bass Bi−1 Bi

bs
bass chroma Xi−1 Xibs

tr
treble chroma Xi−1 Xitr

Fig. 1. Our network model topology, represented as a DBN with two slices and six
layers. The clear nodes represent random variables, while the observed ones are shaded
grey. The directed edges represent the dependency structure. Intra-slice dependency
edges are drawn solid, inter-slice dependency edges are dashed.

the reinforcement of smooth sequences of these states. The probability distribu-


tion of chords conditional on the previous chord strongly favours the chord that
was active in the previous slice, similar to a high self-transition probability in
a hidden Markov model. While leading to a chord transcription that is stable
over time, dependence on the previous chord alone is not sufficient to model ad-
herence to the key. Instead, it is modelled conditionally on the key variable: the
probability distribution depends on the chord’s fit with the current key, based on
an expert function motivated by Krumhansl’s chord-key ratings [19, page 171].
Finally, the chord variable’s dependency on the metrical position node allows us
to favour chord changes at strong metrical positions to achieve a transcription
that resembles more closely that of a human transcriber.
Probabilistic and Logic-Based Modelling of Harmony 7

density
4 2

1 0 0.2 0.4 0.6 0.8 1


note salience
(a) metrical position model (b) model of a single chroma pitch class
Fig. 2

Key and metrical position. The dependency structure of the key and metrical
position variables are comparatively simpler, since they depend only on the re-
spective predecessor. The emphasis on smooth, stable key sequences is handled
in the same way as it is in chords, but the 24 states representing major and minor
keys have even higher self-transition probability, and hence they will persist for
longer stretches of time. The metrical position model represents a 44 meter and
hence has four states. The conditional probability distribution strongly favours
“normal” beat transitions, i.e. from one beat to the next, but it also allows for
irregular transitions in order to accommodate temporary deviations from 44 me-
ter and occasional beat tracking errors. In Figure 2a black arrows represent a
transition probability of 1−ε (where ε = 0.05) to the following beat. Grey arrows
represent a probability of ε/2 to jump to different beats through self-transition
or omission of the expected beat.
Bass. The random variable that models the bass has 13 states, one for each of
the pitch classes, and one “no bass” state. It depends on both the current chord
and the previous chord. The current chord is the basis of the most probable bass
notes that can be chosen. The highest probability is assigned to the “nominal”
chord bass pitch class2 , lower probabilities to the remaining chord pitch classes,
and the rest of the probability mass is distributed between the remaining pitch
classes. The additional use of the dependency on the previous chord allows us
to model the behaviour of the bass note on the first beat of the chord differently
from its behaviour on later beats. We can thus model the tendency for the played
bass note to coincide with the “nominal” bass note of the chord (e.g. the note B
in the B7 chord), while there is more variation in the bass notes played during
the rest of the duration of the chord.
Chroma. The chroma nodes provide models of the bass and treble chroma au-
dio features. Unlike the discrete nodes previously discussed, they are continuous
because the 12 elements of the chroma vector represent relative salience, which
2
The chord symbol itself always implies a bass note, but the bass line might include
other notes not specified by the chord symbol, as in the case of walking bass.
8 S. Dixon, M. Mauch, and A. Anglade

can assume any value between zero and unity. We represent both bass and treble
chroma as multidimensional Gaussian random variables. The bass chroma vari-
able has 13 different Gaussians, one for every bass state, and the treble chroma
node has 121 Gaussians, one for every chord state. The means of the Gaussians
are set to reflect the nature of the chords: to unity for pitch classes that are
part of the chord, and to zero for the rest. A single variate in the 12-dimensional
Gaussian treble chroma distribution models one pitch class, as illustrated in Fig-
ure 2b. Since the chroma values are normalised to the unit interval, the Gaussian
model functions similar to a regression model: for a given chord the Gaussian
density increases with increasing salience of the chord notes (solid line), and
decreases with increasing salience of non-chord notes (dashed line). For more
details see [22].
One important aspect of the model is the wide variety of chords it uses.
It models ten different chord types (maj, min, maj/3, maj/5, maj6, 7, maj7,
min7, dim, aug) and the “no chord” class N. The chord labels with slashes
denote chords whose bass note differs from the chord root, for example D/3
represents a D major chord in first inversion (sometimes written D/F). The
recognition of these chords is a novel feature of our chord recognition algorithm.
Figure 3 shows a score rendered using exclusively the information in our model.
In the last four bars, marked with a box, the second chord is correctly annotated
as D/F. The position of the bar lines is obtained from the metrical position
variable, the key signature from the key variable, and the bass notes from the
bass variable. The chord labels are obtained from the chord variable, replicated
as notes in the treble staff for better visualisation. The crotchet rest on the first
beat of the piece indicates that here, the Viterbi algorithm inferred that the “no
chord” model fits best.
Using a standard test set of 210 songs used in the MIREX chord detection
task, our basic model achieved an accuracy of 73%, with each component of the
model contributing significantly to the result. This improves on the best result at

G B
7
Em G
7
C F C G D/F  Em Bm G
7

  
   
 

     
  
       

       
     

Fig. 3. Excerpt of automatic output of our algorithm (top) and song book version
(bottom) of the pop song “Friends Will Be Friends” (Deacon/Mercury). The song
book excerpt corresponds to the four bars marked with a box.
Probabilistic and Logic-Based Modelling of Harmony 9

It Won’t Be Long
ground truth chorus verse chorus bridge verse chorus bridge verse chorus outro
segmentation

automatic part n1 part A part B part A part B part A part n2


segmentation

chord correct
using auto seg.

chord correct 1
baseline meth.0.5
0
0 20 40 60 80 100 120
time/s

Fig. 4. Segmentation and its effect on chord transcription for the Beatles’ song “It
Won’t Be Long” (Lennon/McCartney). The top 2 rows show the human and automatic
segmentation respectively. Although the structure is different, the main repetitions are
correctly identified. The bottom 2 rows show (in black) where the chord was transcribed
correctly by our algorithm using (respectively not using) the segmentation information.

MIREX 2009 for pre-trained systems. Further improvements have been made via
two extensions of this model: taking advantage of repeated structural segments
(e.g. verses or choruses), and refining the front-end audio processing.
Most musical pieces have segments which occur more than once in the piece,
and there are two reasons for wishing to identify these repetitions. First, multiple
sets of data provide us with extra information which can be shared between the
repeated segments to improve detection performance. Second, in the interest of
consistency, we can ensure that the repeated sections are labelled with the same
set of chord symbols. We developed an algorithm that automatically extracts the
repetition structure from a beat-synchronous chroma representation [27], which
ranked first in the 2009 MIREX Structural Segmentation task.
After building a similarity matrix based on the correlation between beat-
synchronous chroma vectors, the method finds sets of repetitions whose ele-
ments have the same length in beats. A repetition set composed of n elements
with length d receives a score of (n − 1)d, reflecting how much space a hypothet-
ical music editor could save by typesetting a repeated segment only once. The
repetition set with the maximum score (“part A” in Figure 4) is added to the
final list of structural elements, and the process is repeated on the remainder of
the song until no valid repetition sets are left.
The resulting structural segmentation is then used to merge the chroma repre-
sentations of matching segments. Despite the inevitable errors propagated from
incorrect segmentation, we found a significant performance increase (to 75% on
the MIREX score) by using the segmentation. In Figure 4 the beneficial effect
of using the structural segmentation can clearly be observed: many of the white
stripes representing chord recognition errors are eliminated by the structural
segmentation method, compared to the baseline method.
10 S. Dixon, M. Mauch, and A. Anglade

A further improvement was achieved by modifying the front end audio pro-
cessing. We found that by learning chord profiles as Gaussian mixtures, the
recognition rate of some chords can be improved. However this did not result
in an overall improvement, as the performance on the most common chords de-
creased. Instead, an approximate pitch transcription method using non-negative
least squares was employed to reduce the effect of upper harmonics in the chroma
representations [23]. This results in both a qualitative (reduction of specific er-
rors) and quantitative (a substantial overall increase in accuracy) improvement
in results, with a MIREX score of 79% (without using segmentation), which
again is significantly better than the state of the art. By combining both of the
above enhancements we reach an accuracy of 81%, a statistically significant im-
provement over the best result (74%) in the 2009 MIREX Chord Detection tasks
and over our own previously mentioned results.

4 Logic-Based Modelling of Harmony


First order logic (FOL) is a natural formalism for representing harmony, as it is
sufficiently general for describing combinations and sequences of notes of arbi-
trary complexity, and there are well-studied approaches for performing inference,
pattern matching and pattern discovery using subsets of FOL. A further advan-
tage of logic-based representations is that a system’s output can be presented in
an intuitive way to non-expert users. For example, a decision tree generated by
our learning approach provides much more intuition about what was learnt than
would a matrix of state transition probabilities. In this work we focus in particu-
lar on inductive logic programming (ILP), which is a machine learning approach
using logic programming (a subset of FOL) to uniformly represent examples,
background knowledge and hypotheses. An ILP system takes as input a set of
positive and negative examples of a concept, plus some background knowledge,
and outputs a logic program which “explains” the concept, in the sense that
all of the positive examples but (ideally) none of the negative examples can be
derived from the logic program and background knowledge.
ILP has been used for various musical tasks, including inference of harmony
[37] and counterpoint [30] rules from musical examples, as well as rules for ex-
pressive performance [43]. In our work, we use ILP to learn sequences of chords
that might be characteristic of a musical style [2], and test the models on classi-
fication tasks [3,4,1]. In each case we represent the harmony of a piece of music
by a list of chords, and learn models which characterise the various classes of
training data in terms of features derived from subsequences of these chord lists.

4.1 Style Characterisation


In our first experiments [2], we analysed two chord corpora, consisting of the
Beatles studio albums (180 songs, 14132 chords) and a set of jazz standards
from the Real Book (244 songs, 24409 chords) to find harmonic patterns that
differentiate the two corpora. Chord sequences were represented in terms of the
interval between successive root notes or successive bass notes (to make the
Probabilistic and Logic-Based Modelling of Harmony 11

sequences key-independent), plus the category of each chord (reduced to a triad


except in the case of the dominant seventh chord). For the Beatles data, where
the key had been annotated for each piece, we were also able to express the
chord symbols in terms of the scale degree relative to the key, rather than its
pitch class, giving a more musically satisfying representation. Chord sequences of
length 4 were used, which we had previously found [25] to be a good compromise
of sufficient length to capture the context (and thus the function) of the chords,
without the sequences being overspecific, in which case few or no patterns would
be found.
Two models were built, one using the Beatles corpus as positive examples
and the other using the Real Book corpus as positive examples. The ILP system
Aleph was employed, which finds a minimal set of rules which cover (i.e. describe)
all positive examples (and a minimum number of negative examples). The models
built by Aleph consisted of 250 rules for the Beatles corpus and 596 rules for the
Real Book. Note that these rules cover every 4-chord sequence in every song,
so it is only the rules which cover many examples that are relevant in terms
of characterising the corpus. Also, once a sequence has been covered, it is not
considered again by the system, so the output is dependent on the order of
presentation of the examples.
We briefly discuss some examples of rules with the highest coverage. For the
Beatles corpus, the highest coverage (35%) was the 4-chord sequence of major
triads (regardless of roots). Other highly-ranked patterns of chord categories
(5% coverage) had 3 major triads and one minor triad in the sequence. This is
not surprising, in that popular music generally has a less rich harmonic vocab-
ulary than jazz. Patterns of root intervals were also found, including a [perfect
4th, perfect 5th, perfect 4th] pattern (4%), which could for example be inter-
preted as a I - IV - I - IV progression or as V - I - V - I. Since the root
interval does not encode the key, it is not possible to distinguish between these
interpretations (and it is likely that the data contains instances of both). At
2% coverage, the interval sequence [perfect 4th, major 2nd, perfect 4th] (e.g.
I - IV - V - I) is another well-known chord sequence.
No single rule covered as many Real Book sequences as the top rule for the
Beatles, but some typical jazz patterns were found, such as [perfect 4th, perfect
4th, perfect 4th] (e.g. ii - V - I - IV, coverage 8%), a cycle of descending
fifths, and [major 6th, perfect 4th, perfect 4th] (e.g. I - vi - ii - V, coverage
3%), a typical turnaround pattern.
One weakness with this first experiment, in terms of its goal as a pattern
discovery method, is that the concept to learn and the vocabulary to describe
it (defined in the background knowledge) need to be given in advance. Differ-
ent vocabularies result in different concept descriptions, and a typical process
of concept characterisation is interactive, involving several refinements of the
vocabulary in order to obtain an interesting theory. Thus, as we refine the vo-
cabulary we inevitably reduce the problem to a pattern matching task rather
than pattern discovery. A second issue is that since musical styles have no for-
mal definition, it is not possible to quantify the success of style characterisation
12 S. Dixon, M. Mauch, and A. Anglade

directly, but only indirectly, by using the learnt models to classify unseen exam-
ples. Thus the following harmony modelling experiments are evaluated via the
task of genre classification.

4.2 Genre Classification


For the subsequent experiments we extended the representation to allow variable
length patterns and used TILDE, a first-order logic decision tree induction algo-
rithm for modelling harmony [3,4]. As test data we used a collection of 856 pieces
(120510 chords) covering 3 genres, each of which was divided into a further 3
subgenres: academic music (Baroque, Classical, Romantic), popular music (Pop,
Blues, Celtic) and jazz (Pre-bop, Bop, Bossa Nova). The data is represented in
the Band in a Box format, containing a symbolic encoding of the chords, which
were extracted and encoded using a definite clause grammar (DCG) formalism.
The software Band in a Box is designed to produce an accompaniment based on
the chord symbols, using a MIDI synthesiser. In further experiments we tested
the classification method using automatic chord transcription (see section 3)
from the synthesised audio data, in order to test the robustness of the system
to errors in the chord symbols.
The DCG representation was developed for natural language processing to
express syntax or grammar rules in a format which is both human-readable
and machine-executable. Each predicate has two arguments (possibly among
other arguments), an input list and an output list, where the output list is
always a suffix of the input list. The difference between the two lists corre-
sponds to the subsequence described by the predicate. For example, the pred-
icate gap(In,Out) states that the input list of chords (In) commences with a
subsequence corresponding to a “gap”, and the remainder of the input list is
equal to the output list (Out). In our representation, a gap is an arbitrary se-
quence of chords, which allows the representation to skip any number of chords
at the beginning of the input list without matching them to any harmony con-
cept. Extra arguments can encode parameters and/or context, so that the term
degreeAndCategory(Deg,Cat,In,Out,Key) states that the list In begins with
a chord of scale degree Deg and chord category Cat in the context of the key
Key. Thus the sequence:
gap(S,T),
degreeAndCategory(2,min7,T,U,gMajor),
degreeAndCategory(5,7,U,V,gMajor),
degreeAndCategory(1,maj7,V,[],gMajor)
states that the list S starts with any chord subsequence (gap), followed by a
minor 7th chord on the 2nd degree of G major (i.e. Amin7), followed by a (dom-
inant) 7th chord on the 5th degree (D7) and ending with a major 7th chord on
the tonic (Gmaj7).
TILDE learns a classification model based on a vocabulary of predicates sup-
plied by the user. In our case, we described the chords in terms of their root note,
Probabilistic and Logic-Based Modelling of Harmony 13

genre(g,A,B,Key)
gap(A,C),degAndCat(5,maj,C,D,Key),degAndCat(1,min,D,E,Key) ?
Y N
gap(A,F),degAndCat(2,7,F,G,Key),degAndCat(5,maj,G,H,Key) ?
academic
N
Y ...
gap(H,I),degAndCat(1,maj,I,J,Key),degAndCat(5,7,J,K,Key) ?
Y N
gap(H,L),degAndCat(2,min7,L,M,Key),degAndCat(5,7,M,N,Key) ?
academic jjazz
Y N
jazz academic

Fig. 5. Part of the decision tree for a binary classifier for the classes Jazz and Academic

Table 1. Results compared with the baseline for 2-class, 3-class and 9-class classifica-
tion tasks
Classification Task Baseline Symbolic Audio
Academic – Jazz 0.55 0.947 0.912
Academic – Popular 0.55 0.826 0.728
Jazz – Popular 0.61 0.891 0.807
Academic – Popular – Jazz 0.40 0.805 0.696
All 9 subgenres 0.21 0.525 0.415

scale degree, chord category, and intervals between successive root notes, and we
constrained the learning algorithm to generate rules containing subsequences of
length at least two chords. The model can be expressed as a decision tree, as
shown in figure 5, where the choice of branch taken is based on whether or not
the chord sequence matches the predicates at the current node, and the class
to which the sequence belongs is given by the leaf of the decision tree reached
by following these choices. The decision tree is equivalent to an ordered set of
rules or a Prolog program. Note that a rule at a single node of a tree cannot
necessarily be understood outside of its context in the tree. In particular, a rule
by itself cannot be used as a classifier.
The results for various classification tasks are shown in Table 1. All results are
significantly above the baseline, but performance clearly decreases for more dif-
ficult tasks. Perfect classification is not to be expected from harmony data, since
other aspects of music such as instrumentation (timbre), rhythm and melody
are also involved in defining and recognising musical styles.
Analysis of the most common rules extracted from the decision tree models
built during these experiments reveals some interesting and well-known jazz,
academic and popular music harmony patterns. For each rule shown below, the
coverage expresses the fraction of songs in each class that match the rule. For
example, while a perfect cadence is common to both academic and jazz styles,
the chord categories distinguish the styles very well, with academic music using
triads and jazz using seventh chords:
14 S. Dixon, M. Mauch, and A. Anglade

genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(5,maj,C,D,Key),
degreeAndCategory(1,maj,D,E,Key),
gap(E,B).

[Coverage: academic=133/235; jazz=10/338]

genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(5,7,C,D,Key),
degreeAndCategory(1,maj7,D,E,Key),
gap(E,B).

[Coverage: jazz=146/338; academic=0/235]

A good indicator of blues is the sequence: ... - I7 - IV7 - ...

genre(blues,A,B,Key) :- gap(A,C),
degreeAndCategory(1,7,C,D,Key),
degreeAndCategory(4,7,D,E,Key),
gap(E,B).

[Coverage: blues=42/84; celtic=0/99; pop=2/100]

On the other hand, jazz is characterised (but not exclusively) by the sequence:
... - ii7 - V7 - ...

genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(2,min7,C,D,Key),
degreeAndCategory(5,7,D,E,Key),
gap(E,B).

[Coverage: jazz=273/338; academic=42/235; popular=52/283]

The representation also allows for longer rules to be expressed, such as the
following rule describing a modulation to the dominant key and back again in
academic music: ... - II7 - V - ... - I - V7 - ...

genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(2,7,C,D,Key),
degreeAndCategory(5,maj,D,E,Key),
gap(E,F),
degreeAndCategory(1,maj,F,G,Key),
degreeAndCategory(5,7,G,H,Key),
gap(H,B).

[Coverage: academic=75/235; jazz=0/338; popular=1/283]


Probabilistic and Logic-Based Modelling of Harmony 15

Although none of the rules are particularly surprising, these examples illus-
trate some meaningful musicological concepts that are captured by the rules. In
general, we observed that Academic music is characterised by rules establish-
ing the tonality, e.g. via cadences, while Jazz is less about tonality, and more
about harmonic colour, e.g. the use of 7th, 6th, augmented and more complex
chords, and Popular music harmony tends to have simpler harmonic rules as
melody is predominant in this style. The system is also able to find longer rules
that a human might not spot easily. Working from audio data, even though the
transcriptions are not fully accurate, the classification and rules still capture the
same general trends as for symbolic data.
For genre classification we are not advocating a harmony-based approach
alone. It is clear that other musical features are better predictors of genre.
Nevertheless, the positive results encouraged a further experiment in which we
integrated the current classification approach with a state-of-the-art genre classi-
fication system, to test whether the addition of a harmony feature could improve
its performance.

4.3 Genre Classification Using Harmony and Low-Level Features


In recent work [1] we developed a genre classification framework combining both
low-level signal-based features and high-level harmony features. A state-of-the-
art statistical genre classifier [8] using 206 features, covering spectral, temporal,
energy, and pitch characteristics of the audio signal, was extended using a ran-
dom forest classifier containing rules for each genre (classical, jazz and pop)
derived from chord sequences. We extended our previous work using the first-
order logic induction algorithm TILDE, to learn a random forest instead of a
single decision tree from the chord sequence corpus described in the previous
genre classification experiments. The random forest model achieved better clas-
sification rates (88% on the symbolic data and 76% on the audio data) for the
three-class classification problem (previous results 81% and 70% respectively).
Having trained the harmony classifier, its output was added as an extra feature
to the low-level classifier and the combined classifier was tested on three-genre
subsets of two standard genre classification data sets (GTZAN and ISMIR04)
containing 300 and 448 recordings respectively. Multilayer perceptrons and sup-
port vector machines were employed to classify the test data using 5×5-fold
cross-validation and feature selection. Results are shown in table 2 for the sup-
port vector machine classifier, which outperformed the multilayer perceptrons.
Results indicate that the combination of low-level features with the harmony-
based classifier produces improved genre classification results despite the fact

Table 2. Best mean classification results (and number of features used) for the two
data sets using 5×5-fold cross-validation and feature selection

Classifier GTZAN data set ISMIR04 data set


SVM without harmony feature 0.887 (60 features) 0.938 (70 features)
SVM with harmony feature 0.911 (50 features) 0.953 (80 features)
16 S. Dixon, M. Mauch, and A. Anglade

that the classification rate of the harmony-based classifier alone is poor. For
both datasets the improvements over the standard classifier (as shown in table
2) were found to be statistically significant.

5 Conclusion
We have looked at two approaches to the modelling of harmony which aim to “dig
deeper into the music”. In our probabilistic approach to chord transcription, we
demonstrated the advantage of modelling musical context such as key, metrical
structure and bass line, and simultaneously estimating all of these variables
along with the chord. We also developed an audio feature using non-negative
least squares that reflects the notes played better than the standard chroma
feature, and therefore reduces interference from harmonically irrelevant partials
and noise. A further improvement of the system was obtained by modelling the
global structure of the music, identifying repeated sections and averaging features
over these segments. One promising avenue of further work is the separation of
the audio (low-level) and symbolic (high-level) models which are conceptually
distinct but modelled together in current systems. A low-level model would be
concerned only with the production or analysis of audio — the mapping from
notes to features; while a high-level model would be a musical model handling
the mapping from chord symbols to notes.
Using a logic-based approach, we showed that it is possible to automatically
discover patterns in chord sequences which characterise a corpus of data, and
to use such models as classifiers. The advantage with a logic-based approach is
that models learnt by the system are transparent: the decision tree models can
be presented to users as sets of human readable rules. This explanatory power is
particularly relevant for applications such as music recommendation. The DCG
representation allows chord sequences of any length to coexist in the same model,
as well as context information such as key. Our experiments found that the more
musically meaningful Degree-and-Category representation gave better classifica-
tion results than using root intervals. The results using transcription from audio
data were encouraging in that although some information was lost in the tran-
scription process, the classification results remained well above the baseline, and
thus this approach is still viable when symbolic representations of the music are
not available. Finally, we showed that the combination of high-level harmony
features with low-level features can lead to genre classification accuracy im-
provements in a state-of-the-art system, and believe that such high-level models
provide a promising direction for genre classification research.
While these methods have advanced the state of the art in music informatics,
it is clear that in several respects they are not yet close to an expert musician’s
understanding of harmony. Limiting the representation of harmony to a list of
chord symbols is inadequate for many applications. Such a representation may
be sufficient as a memory aid for jazz and pop musicians, but it allows only a very
limited specification of chord voicing (via the bass note), and does not permit
analysis of polyphonic texture such as voice leading, an important concept in
many harmonic styles, unlike the recent work of [11] and [29]. Finally, we note
Probabilistic and Logic-Based Modelling of Harmony 17

that the current work provides little insight into harmonic function, for example
the ability to distinguish harmony notes from ornamental and passing notes and
to recognise chord substitutions, both of which are essential characteristics of a
system that models a musician’s understanding of harmony. We hope to address
these issues in future work.

Acknowledgements. This work was performed under the OMRAS2 project,


supported by the Engineering and Physical Sciences Research Council, grant
EP/E017614/1. We would like to thank Chris Harte, Matthew Davies and others
at C4DM who contributed to the annotation of the audio data, and the Pattern
Recognition and Artificial Intelligence Group at the University of Alicante, who
provided the Band in a Box data.

References
1. Anglade, A., Benetos, E., Mauch, M., Dixon, S.: Improving music genre classi-
fication using automatically induced harmony rules. Journal of New Music Re-
search 39(4), 349–361 (2010)
2. Anglade, A., Dixon, S.: Characterisation of harmony with inductive logic program-
ming. In: 9th International Conference on Music Information Retrieval, pp. 63–68
(2008)
3. Anglade, A., Ramirez, R., Dixon, S.: First-order logic classification models of mu-
sical genres based on harmony. In: 6th Sound and Music Computing Conference,
pp. 309–314 (2009)
4. Anglade, A., Ramirez, R., Dixon, S.: Genre classification using harmony rules in-
duced from automatic chord transcriptions. In: 10th International Society for Music
Information Retrieval Conference, pp. 669–674 (2009)
5. Aucouturier, J.J., Defréville, B., Pachet, F.: The bag-of-frames approach to audio
pattern recognition: A sufficient model for urban soundscapes but not for poly-
phonic music. Journal of the Acoustical Society of America 122(2), 881–891 (2007)
6. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
7. Bello, J.P., Pickens, J.: A robust mid-level representation for harmonic content in
music signals. In: 6th International Conference on Music Information Retrieval,
pp. 304–311 (2005)
8. Benetos, E., Kotropoulos, C.: Non-negative tensor factorization applied to music
genre classification. IEEE Transactions on Audio, Speech, and Language Process-
ing 18(8), 1955–1967 (2010)
9. Cathé, P.: Harmonic vectors and stylistic analysis: A computer-aided analysis of
the first movement of Brahms’ String Quartet Op. 51-1. Journal of Mathematics
and Music 4(2), 107–119 (2010)
10. Conklin, D.: Representation and discovery of vertical patterns in music. In:
Anagnostopoulou, C., Ferrand, M., Smaill, A. (eds.) ICMAI 2002. LNCS (LNAI),
vol. 2445, pp. 32–42. Springer, Heidelberg (2002)
11. Conklin, D., Bergeron, M.: Discovery of contrapuntal patterns. In: 11th Interna-
tional Society for Music Information Retrieval Conference, pp. 201–206 (2010)
12. Conklin, D., Witten, I.: Multiple viewpoint systems for music prediction. Journal
of New Music Research 24(1), 51–73 (1995)
18 S. Dixon, M. Mauch, and A. Anglade

13. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, pp.
159–165 (2003)
14. Downie, J., Byrd, D., Crawford, T.: Ten years of ISMIR: Reflections on challenges
and opportunities. In: 10th International Society for Music Information Retrieval
Conference, pp. 13–18 (2009)
15. Ebcioğlu, K.: An expert system for harmonizing chorales in the style of J. S. Bach.
In: Balaban, M., Ebcioiğlu, K., Laske, O. (eds.) Understanding Music with AI, pp.
294–333. MIT Press, Cambridge (1992)
16. Fujishima, T.: Realtime chord recognition of musical sound: A system using Com-
mon Lisp Music. In: Proceedings of the International Computer Music Conference,
pp. 464–467 (1999)
17. Hainsworth, S.W.: Techniques for the Automated Analysis of Musical Audio. Ph.D.
thesis, University of Cambridge, Cambridge, UK (2003)
18. Harte, C.: Towards Automatic Extraction of Harmony Information from Music
Signals. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music
(2010)
19. Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford University Press,
Oxford (1990)
20. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
21. Longuet-Higgins, H., Steedman, M.: On interpreting Bach. Machine Intelligence 6,
221–241 (1971)
22. Mauch, M.: Automatic Chord Transcription from Audio Using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London, Cen-
tre for Digital Music (2010)
23. Mauch, M., Dixon, S.: Approximate note transcription for the improved identi-
fication of difficult chords. In: 11th International Society for Music Information
Retrieval Conference, pp. 135–140 (2010)
24. Mauch, M., Dixon, S.: Simultaneous estimation of chords and musical context from
audio. IEEE Transactions on Audio, Speech and Language Processing 18(6), 1280–
1289 (2010)
25. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering chord idioms
through Beatles and Real Book songs. In: 8th International Conference on Music
Information Retrieval, pp. 111–114 (2007)
26. Mauch, M., Müllensiefen, D., Dixon, S., Wiggins, G.: Can statistical language mod-
els be used for the analysis of harmonic progressions? In: International Conference
on Music Perception and Cognition (2008)
27. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: 10th International Society for Music Information Retrieval
Conference, pp. 231–236 (2009)
28. Maxwell, H.: An expert system for harmonizing analysis of tonal music. In:
Balaban, M., Ebcioiğlu, K., Laske, O. (eds.) Understanding Music with AI, pp.
334–353. MIT Press, Cambridge (1992)
29. Mearns, L., Tidhar, D., Dixon, S.: Characterisation of composer style using high-
level musical features. In: 3rd ACM Workshop on Machine Learning and Music
(2010)
30. Morales, E.: PAL: A pattern-based first-order inductive system. Machine Learn-
ing 26(2-3), 227–252 (1997)
31. Pachet, F.: Surprising harmonies. International Journal of Computing Anticipatory
Systems 4 (February 1999)
Probabilistic and Logic-Based Modelling of Harmony 19

32. Papadopoulos, H.: Joint Estimation of Musical Content Information from an Audio
Signal. Ph.D. thesis, Université Pierre et Marie Curie – Paris 6 (2010)
33. Pardo, B., Birmingham, W.: Algorithms for chordal analysis. Computer Music
Journal 26(2), 27–49 (2002)
34. Pérez-Sancho, C., Rizo, D., Iñesta, J.M.: Genre classification using chords and
stochastic language models. Connection Science 21(2-3), 145–159 (2009)
35. Pérez-Sancho, C., Rizo, D., Iñesta, J.M., de León, P.J.P., Kersten, S., Ramirez,
R.: Genre classification of music by tonal harmony. Intelligent Data Analysis 14,
533–545 (2010)
36. Pickens, J., Bello, J., Monti, G., Sandler, M., Crawford, T., Dovey, M., Byrd, D.:
Polyphonic score retrieval using polyphonic audio queries: A harmonic modelling
approach. Journal of New Music Research 32(2), 223–236 (2003)
37. Ramirez, R.: Inducing musical rules with ILP. In: Proceedings of the International
Conference on Logic Programming, pp. 502–504 (2003)
38. Raphael, C., Stoddard, J.: Functional harmonic analysis using probabilistic models.
Computer Music Journal 28(3), 45–52 (2004)
39. Scholz, R., Vincent, E., Bimbot, F.: Robust modeling of musical chord sequences us-
ing probabilistic N-grams. In: IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 53–56 (2009)
40. Steedman, M.: A generative grammar for jazz chord sequences. Music Percep-
tion 2(1), 52–77 (1984)
41. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference rule ap-
proach. Computer Music Journal 23(1), 10–27 (1999)
42. Whorley, R., Wiggins, G., Rhodes, C., Pearce, M.: Development of techniques for
the computational modelling of harmony. In: First International Conference on
Computational Creativity, pp. 11–15 (2010)
43. Widmer, G.: Discovering simple rules in complex data: A meta-learning algorithm
and some surprising musical discoveries. Artificial Intelligence 146(2), 129–148
(2003)
44. Winograd, T.: Linguistics and the computer analysis of tonal harmony. Journal of
Music Theory 12(1), 2–49 (1968)
Interactive Music Applications and Standards

Rebecca Stewart, Panos Kudumakis, and Mark Sandler

Queen Mary, University of London,


London, UK
{rebecca.stewart,panos.kudumakis,mark.sandler}@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/digitalmusic

Abstract. Music is now consumed in interactive applications that allow


for the user to directly influence the musical performance. These appli-
cations are distributed as games for gaming consoles and applications for
mobile devices that currently use proprietary file formats, but standard-
ization orgranizations have been working to develop an interchangeable
format. This paper surveys the applications and their requirements. It
then reviews the current standards that address these requirements fo-
cusing on the MPEG Interactive Music Application Format. The paper
closes by looking at additional standards that address similar applica-
tions and outlining the further requirements that need to be met.

Keywords: interactive music, standards.

1 Introduction

The advent of the Internet and the exploding popularity of file sharing web sites
have challenged the music industry’s traditional supply model that relied on the
physical distribution of music recordings such as vinyl records, cassettes, CDs,
etc [5], [3]. In this direction, new interactive music services have emerged [1],
[6], [7]. However, a standardized file format is inevitably required to provide the
interoperability between various interactive music players and interactive music
applications.
Video games and music consumption, once discrete markets, are now merging.
Games for dedicated gaming consoles such as the Microsoft XBox, Nintendo Wii
and Sony Playstation and applications for smart phones using the Apple iPhone
and Google Android platforms are incorporating music creation and manipulation
into applications which encourage users to purchase music. These games can even
be centered around specific performers such as the Beatles [11] or T-Pain [14].
Many of these games follow a format inspired by karaoke. In its simplest case,
audio processing for karaoke applications involves removing the lead vocals so
that a live singer can perform with the backing tracks. This arrangement grew in
complexity by including automatic lyric following as well. Karaoke performance
used to be relegated to a setup involving a sound system with microphone and
playback capabilities within a dedicated space such as a karaoke bar or living
room, but it has found a revitalized market with mobile devices such as smart

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 20–30, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Interactive Music Applications and Standards 21

phones. Karaoke is now no longer limited to a certain place or equipment, but can
performed with a group of friends with a gaming console in a home or performed
with a smart phone, recorded and uploaded online to share with others.
A standard format is needed to allow for the same musical content to be pro-
duced once and used with multiple applications. We will look at the current com-
mercial applications for interactive music and discuss what requirements need to
be met. We will then look at three standards that address these requirements:
the MPEG-A Interactive Music Application Format (IM AF), IEEE 1599 and
interaction eXtensible Music Format (iXMF). We conclude by discussing what
improvements still need to be made for these standards to meet the requirements
of currently commercially-available applications.

2 Applications

Karaoke-influenced video games have become popular as titles such as Guitar


Hero and Rock Band have brought interactive music to a broad market [11].
The video games are centered around games controllers that emulate musical
instruments such as the guitar and drum set. The players follow the music as
they would in karaoke, but instead of following lyrics and singing, they follow
colored symbols which indicate when to press the corresponding button. With
Rock Band karaoke singing is available – backing tracks and lyrics are provided
so that a player can sing along. However, real-time pitch-tracking has enhanced
the gameplay as the player’s intonation and timing are scored.
The company Smule produces applications for the Apple iPhone, iPod Touch
and iPad. One of their most popular applications for the platform is called I Am
T-Pain [14]. The application allows users to sing into their device and automat-
ically processes their voice with the auto-tune effects that characterize the artist
T-Pain’s vocals. The user can do this in a karaoke-style performance by purchas-
ing and downloading files containing the backing music to a selection of T-Pain’s
released tracks. The song’s lyrics then appear on the screen synchronized with
the music as for karaoke, and the user’s voice is automatically processed with
an auto-tune effect. The user can change the auto-tune settings to change the
key and mode or use a preset. The freestyle mode allows the user to record their
voice without music and with the auto-tuner. All of the user’s performances can
be recorded and uploaded online and easily shared on social networks.
Smule has built on the karaoke concept with the release of Glee Karaoke [13].
The application is branded by the US TV show Glee and features the music
performed on the show. Like the I Am T-Pain application, Glee Karaoke lets
users purchase and download music bundled with lyrics so that they can perform
the vocal portion of the song themselves. Real-time pitch correction and auto-
matic three-part harmony generation are available to enhance the performance.
Users can also upload performances to share online, but unlike I Am T-Pain,
Glee Karaoke users can participate in a competitive game. Similar to the Gui-
tar Hero and Rock Band games, users get points for completing songs and for
correctly singing on-pitch.
22 R. Stewart, P. Kudumakis, and M. Sandler

3 Requirements
If the music industry continues to produce content for interactive music applica-
tions, a standard distribution format is needed. Content then will not need to be
individually authored for each application. At the most basic level, a standard
needs to allow:
– Separate tracks or groups of tracks
– Apply signal processing to those tracks or groups
– Markup those tracks or stems to include time-based symbolic information
Once tracks or groups of tracks are separated from the full mix of the song,
additional processing or information can be included to enhance the interactivity
with those tracks.

3.1 Symbolic Information


Karaoke-style applications involving singing require lyrical information as a bare
minimum, though it is expected that that information is time-aligned with the
audio content. As seen in Rock Band, I Am T-Pain and Glee Karaoke, additional
information regarding the correct pitch and timing is also needed.
A standard for interactive music applications also needs to accommodate mul-
tiple parallel sequences of notes. This is especially important for multiple player
games like Rock Band where each player has a different instrument and stream
of timings and pitches.

3.2 Audio Signal Processing


The most simplistic interactive model of multiple tracks requires basic mixing
capabilities so that those tracks can be combined to create a single mix of the
song. A traditional karaoke mix could easily be created within this model by
muting the vocal track, but this model could also be extended. Including au-
dio effects as in I Am T-Pain and Glee Karaoke allows users to add musical
content (such as their singing voice) to the mix and better emulate the original
performance.
Spatial audio signal processing is also required for more advanced applica-
tions. This could be as simple as panning a track between the left and right
channels of a stereo song, but could grow in complexity when considering ap-
plications for gaming consoles. Many games allow for surround sound playback,
usually over a 5.1 loudspeaker setup, so the optimal standard would allow for
flexible loudspeaker configurations. Mobile applications could take advantage of
headphone playback and use binaural audio to create an immersive 3D space.

4 MPEG-A IM AF
The MPEG-A Interactive Music Application Format (IM AF) standard struc-
tures the playback of songs that have multiple, unmixed audio tracks [8], [9], [10].
Interactive Music Applications and Standards 23

IM AF creates a container for the tracks, the associated metadata and symbolic
data while also managing how the audio tracks are played. Creating an IM AF
file involves formatting different types of media data, especially multiple audio
tracks with interactivity data and storing them into an ISO-Base Media File
Format. An IM AF file is composed of:
Multiple audio tracks representing the music (e.g. instruments and/or voices).
Groups of audio tracks – a hierarchical structure of audio tracks (e.g. all
guitars of a song can be gathered in the same group).
Preset data – pre-defined mixing information on multiple audio tracks (e.g.
karaoke and rhythmic version).
User mixing data and interactivity rules, information related to user in-
teraction (e.g. track/group selection, volume control).
Metadata used to describe a song, music album, artist, etc.
Additional media data that can be used to enrich the users interaction space
(e.g. timed text synchronized with audio tracks which can represent the lyrics
of a song, images related to the song, music album, artist, etc).

4.1 Mixes
The multiple audio tracks are combined to produce a mix. The mix is defined
by the playback level of tracks and may be determined by the music content
creator or by the end-user.
An interactive music player utilizing IM AF could allow users to re-mix music
tracks by enabling them to select the number of instruments to be listened to
and adjust the volume of individual tracks to their particular taste. Thus, IM
AF enables users to publish and exchange this re-mixing data, enabling other
users with IM AF players to experience their particular music taste creations.
Preset mixes of tracks could also be available. In particular IM AF supports two
possible mix modes for interaction and playback: preset-mix mode and user-mix
mode.
In the preset-mix mode, the user selects one preset among the presets stored
in IM AF, and then the audio tracks are mixed using the preset parameters
associated with the selected preset. Some preset examples are:
General preset – composed of multiple audio tracks by music producer.
Karaoke preset – composed of multiple audio tracks except vocal tracks.
A cappella preset – composed of vocal and chorus tracks.
Figure 1 shows an MPEG-A IM AF player. In user-mix mode, the user selects/de-
selects the audio tracks/groups and controls the volume of each of them. Thus,
in user-mix mode, audio tracks are mixed according to the user’s control and
taste; however, they should comply with the interactivity rules stored in the
IM AF. User interaction should conform to certain rules defined by the music
composers with the aim to fit their artistic creation. However, the rules defini-
tion is optional and up to the music composer, they are not imposed by the IM
AF format. In general there are two categories of rules in IM AF: selection and
24 R. Stewart, P. Kudumakis, and M. Sandler

Fig. 1. An interactive music application. The player on the left shows the song being
played in a preset mix mode and the player on the right shows the user mix mode.




  
 


 
  



 !

 ! 
"



Fig. 2. Logic for interactivity rules and mixes within IM AF

mixing rules. The selection rules relate to the selection of the audio tracks and
groups at rendering time whereas the mixing rules relate to the audio mixing.
Note that the interactivity rules allow the music producer to define the amount
of freedom available in IM AF users mixes. The interactivity rules analyser in the
player verifies whether the user interaction conforms to music producers rules.
Figure 2 depicts in a block diagram the logic for both the preset-mix and the
user-mix usage modes.
IM AF supports four types of selection rules, as follows:

Min/max rule specifying both minimum and maximum number of track/


groups of the group that might be in active state.
Exclusion rule specifying that several track/groups of a song will never be in
the active state at the same time.
Interactive Music Applications and Standards 25

Not mute rule defining a track/group always in the active state.


Implication rule specifying that the activation of a track/group implies the
activation of another track/group.
IM AF also supports four types of mixing rules, as follows:
Limits rule specifying the minimum and maximum limits of the relative vol-
ume of each track/group.
Equivalence rule specifying an equivalence volume relationship between
tracks/groups.
Upper rule specifying a superiority volume relationship between tracks/groups.
Lower rule specifying an inferiority volume relationship between tracks/groups.
Backwards compatibility with legacy non-interactive players is also supported by
IM AF. For legacy music players or devices that are not capable of simultaneous
decoding the multiple audio tracks, a special audio track stored in IM AF file
can still be played.

4.2 File Structure


The file formats accepted within an IM AF file are described in Table 1. IM AF
holds files describing images associated with the audio such as an album cover,
timed text for lyrics, other metadata allowed in MPEG-7 and the audio content.
IM AF also supports a number of brands according to application domain. These
depend on the device processing power capabilities (e.g. mobile phone, laptop
computer and high fidelity devices) which consequently define the maximum
number of audio tracks that can be decoded simultaneously in an IM AF player
running on a particular device. IM AF brands are summarized in Table 2. In all
IM AF brands, the associated data and metadata are supported.
The IM AF file format structure is derived from the ISO-Base Media File
Format standard. As such it facilitates interchange, management, editing and
presentation of different type media data and their associated metadata in a
flexible and extensible way. The object-oriented nature of ISO-Base Media File

Table 1. The file formats accepted within an IM AF file

Type Component Name Specification

File Format ISO Base Media File Format (ISO-BMFF) ISO/IEC 14496-12:2008
MPEG-4 Audio AAC Profile ISO/IEC 14496-3:2005
MPEG-D SAOC ISO/IEC 23003-2:2010
Audio
MPEG-1 Audio Layer III (MP3) ISO/IEC 11172-3:1993
PCM -
Image JPEG ISO/IEC 10918-1:1994
Text 3GPP Timed Text 3GPP TS 26.245:2004
Metadata MPEG-7 MDS ISO/IEC 15938-5:2003
26 R. Stewart, P. Kudumakis, and M. Sandler

Table 2. Brands supported by IM AF. For im04 and im12, simultaneously decoded
audio tracks consist of tracks related to SAOC, which are a downmix signal and SAOC
bitstream. The downmix signal should be encoded using AAC or MP3. For all brands,
the maximum channel number of each track is restricted to 2 (stereo).

Audio Max No Max Freq. Profile/


Application
Brands AAC MP3 SAOC PCM Tracks /bits Level

im01 X X 4
AAC/Level 2
im02 X X 6 Mobile
im03 X X 8 48 kHz/16 bits

AAC/Level 2
im04 X X X 2
SAOC Baseline/2

im11 X X X 16 AAC/Level 2
Normal
AAC/Level 2
im12 X X X 2
SAOC Baseline/3

im21 X X 32 96 kHz/24 bits AAC/Level 5 High-end

Format, inherited in IM AF, enables simplicity in the file structure in terms of


objects that have their own names, sizes and defined specifications according to
their purpose.
Figure 3 illustrates the IM AF file format structure. It mainly consists of ftyp,
moov and mdat type information objects/boxes. The ftyp box contains informa-
tion on file type and compatibility. The moov box describes the presentation of
the scene and usually includes more than one trak boxes. A trak box contains
the presentation description for a specific media type. A media type in each trak
box could be audio, image or text. The trak box supports time information for
synchronization with media described in other trak boxes. The mdat box con-
tains the media data described in the trak boxes. Instead of a native system file
path, a trak box may include an URL to locate the media data. In this way the
mdat box maintains a compact representation enabling consequently efficient
exchange and sharing of IM AF files.
Furthermore, in the moov box some specific information is also included: the
group container box grco; the preset container box prco; and the rules container
box ruco for storing group, preset and rules information, respectively. The grco
box contains zero or more group boxes designated as grup describing the group
hierarchy structure of audio tracks and/or groups. The prco box contains one
or more prst boxes which describe the predefined mixing information in the
absence of user interaction. The ruco box contains zero or more selection rules
boxes rusc and/or mixing rules boxes rumx describing the interactivity rules
related to selection and/or mixing of audio tracks.
Interactive Music Applications and Standards 27



  
       
 
+  

     

           
 
  
 
  
  '(%)*
  

  
 
    
  

 
 
  '(%)*
'

   '

 
  '



 
  
 
 
 !"#$%&

 

 
  

 
    

 
 
    
  
  
       
  
 



 
  
       
  


  

 




 !"#$%&

Fig. 3. IM AF file format

5 Related Formats

While the IM AF packages together the relevant metadata and content that an
interactive music application would require, other file formats have also been
developed as a means to organize and describe synchronized streams of infor-
mation for different applications. The two that will be briefly reviewed here are
IEEE 1599 [12] and iXMF [4].

5.1 IEEE 1599

IEEE 1599 is an XML-based format for synchronizing multiple streams of sym-


bolic and non-symbolic data validated against a document type definition (DTD).
It was proposed to IEEE Standards in 2001 and was previously referred to as
MX (Musical Application Using XML). The standard emphasizes the readability
28 R. Stewart, P. Kudumakis, and M. Sandler



   

  


     

 
 
    
*
 
 



   
   

  


 
    


#
 
  

$   %
  
' 
  
&

    !  
 



        
 
 


    



 
  
 
 
 '  

" " 
"  ! 
" "( ")* 
"

Fig. 4. The layers in IEEE 1599

of symbols by both humans and machines, hence the decision to represent all
information that is not audio or video sample data within XML.
The standard is developed primarily for applications that provide additional
information surrounding a piece of music. Example applications include being
able to easily navigate between a score, multiple recordings of performances of
that score and images of the performers in the recordings [2].
The format consists of six layers that communicate with each other, but there
can be multiple instances of the same layer type. Figure 4 illustrates how the
layers interact. The layers are referred to as:

General – holds metadata relevant to entire document.


Logic – logical description of score symbols.
Structural – description of musical objects and their relationships.
Notational – graphical representation of the score.
Performance – computer-based descriptions of a musical performance.
Audio – digital audio recording.

5.2 iXMF
Another file format that perform a similar task with a particular focus on video
games is iXMF (interaction eXtensible Music Format) [4]. The iXMF standard
is targeted for interactive audio within games development. XMF is a meta file
format that bundles multiple files together and iXMF uses this same meta file
format as its structure.
Interactive Music Applications and Standards 29

iXMF uses a structure in which a moment in time can trigger an event. The
triggered event can encompass a wide array of activities such as the playing of
an audio file or the execution of specific code. The overall structure is described
in [4] as:

– An iXMF file is a collection of Cues.


– A Cue is a collection of Media Chunks and Scripts.
– A Media Chunk is a contiguous region in a playable media file.
– A Script is rules describing how a Media Chunk is played.

The format allows for both audio and symbolic information information such
as MIDI to be included. The Scripts then allow for real-time adaptive audio
effects. iXMF has been developed to create interactive soundtracks for video
games environments, so the audio can be generated in real-time based on a user’s
actions and other external factors. There are a number of standard Scripts that
perform basic tasks such as starting or stopping a Cue, but this set of Scripts
can also be extended.

6 Discussion
Current commercial applications built around interactive music require real-time
playback and interaction with multiple audio tracks. Additionally, symbolic in-
formation, including text, is needed to accommodate the new karaoke-like games
such as Guitar Hero. The IM AF standard fulfils most of the requirements, but
not all. In particular it lacks the ability to include symbolic information like MIDI
note and instrument data. IEEE 1599 and iXMF both can accommodate MIDI
data, though lack some of the advantages of IM AF such as direct integration
with a number of MPEG formats.
One of the strengths of iXMF is its Scripts which can define time-varying
audio effects. These kind of effects are needed for applications such as I Am
T-Pain and Glee Karaoke. IM AF is beginning to consider integrating these
effects such as equalization, but greater flexibility will be needed so that the
content creators can create and manipulate their own audio signal processing
algorithms. The consumer will also need to be able to manually adjust the audio
effects applied to the audio in order to build applications like the MXP4 Studio
[7] with IM AF.
As interactive music applications may be used in a variety of settings, from
dedicated gaming consoles to smart phones, any spatialization of the audio needs
to be flexible and automatically adjust to the most appropriate format. This
could range from stereo speakers to surround sound systems or binaural audio
over headphones. IM AF is beginning to support SAOC (Spatial Audio Object
Coding) which addresses this very problem and differentiates it from similar
standards.
While there are a number of standard file formats that have been developed in
parallel to address slightly differing application areas within interactive music,
IM AF is increasingly the best choice for karaoke-style games. There are still
30 R. Stewart, P. Kudumakis, and M. Sandler

underdeveloped or missing features, but by determining the best practice put


forth in similar standards, IM AF can become an interchangeable file format for
creators to distribute their music to multiple applications. The question then
remains: will the music industry embrace IM AF – enabling interoperability of
interactive music services and applications for the benefit of end users – or will it
try to lock them down in proprietary standards for the benefit of few oligopolies?

Acknowledgments. This work was supported by UK EPSRC Grants: Platform


Grant (EP/E045235/1) and Follow On Fund (EP/H008160/1).

References
1. Audizen, http://www.audizen.com (last viewed, February 2011)
2. Ludovico, L.A.: The new standard IEEE 1599, introduction and examples. J. Mul-
timedia 4(1), 3–8 (2009)
3. Goel, S., Miesing, P., Chandra, U.: The Impact of Illegal Peer-to-Peer File Sharing
on the Media Industry. California Management Review 52(3) (Spring 2010)
4. IASIG Interactive XMF Workgroup: Interactive XMF specification: file for-
mat specification. Draft 0.9.1a (2008), http://www.iasig.org/pubs/ixmf_
draft-v091a.pdf
5. IFPI Digital Musi Report 2009: New Business Models for a Changing Environment.
International Federation of the Phonographic Industry (January 2009)
6. iKlax Media, http://www.iklaxmusic.com (last viewed February 2011)
7. Interactive Music Studio by MXP4, Inc., http://www.mxp4.com/
interactive-music-studio (last viewed February 2011)
8. ISO/IEC 23000-12, Information technology – Multimedia application for-
mat (MPEG-A) – Part 12: Interactive music application format (2010),
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.
htm?csnumber=53644
9. ISO/IEC 23000-12/FDAM 1 IM AF Conformance and Reference Software, N11746,
95th MPEG Meeting, Daegu, S. Korea (2011)
10. Kudumakis, P., Jang, I., Sandler, M.: A new interactive MPEG format for the
music industry. In: 7th Int. Symposium on Computer Music Modeling and Retrieval
(CMMR 2010), Málaga, Spain (2010)
11. Kushner, D.: The Making of the Beatles: Rock Band. IEEE Spectrum 46(9), 30–35
(2009)
12. Ludovico, L.A.: IEEE 1599: a multi-layer approach to music description. J. Multi-
media 4(1), 9–14 (2009)
13. Smule, Inc.: Glee Karaoke iPhone Application, http://glee.smule.com/ (last
viewed February 2011)
14. Smule, Inc.: I Am T-Pain iPhone Application, http://iamtpain.smule.com/ (last
viewed February 2011)
Interactive Music with Active Audio CDs

Sylvain Marchand, Boris Mansencal, and Laurent Girin

LaBRI – CNRS, University of Bordeaux, France


{sylvain.marchand,boris.mansencal}@labri.fr
GIPSA-lab – CNRS, Grenoble Institute of Technology, France
[email protected]

Abstract. With a standard compact disc (CD) audio player, the only
possibility for the user is to listen to the recorded track, passively: the
interaction is limited to changing the global volume or the track. Imagine
now that the listener can turn into a musician, playing with the sound
sources present in the stereo mix, changing their respective volumes and
locations in space. For example, a given instrument or voice can be either
muted, amplified, or more generally moved in the acoustic space. This
will be a kind of generalized karaoke, useful for disc jockeys and also for
music pedagogy (when practicing an instrument). Our system shows that
this dream has come true, with active CDs fully backward compatible
while enabling interactive music. The magic is that “the music is in the
sound”: the structure of the mix is embedded in the sound signal itself,
using audio watermarking techniques, and the embedded information is
exploited by the player to perform the separation of the sources (patent
pending) used in turn by a spatializer.

Keywords: interactive music, compact disc, audio watermarking, source


separation, sound spatialization.

1 Introduction
Composers of acousmatic music conduct different stages through the composi-
tion process, from sound recording (generally stereophonic) to diffusion (mul-
tiphonic). During live interpretation, they interfere decisively on spatialization
and coloration of pre-recorded sonorities. For this purpose, the musicians gen-
erally use a(n un)mixing console. With two hands, this requires some skill and
becomes hardly tractable with many sources or speakers.
Nowadays, the public is also eager to interact with the musical sound. In-
deed, more and more commercial CDs come with several versions of the same
musical piece. Some are instrumental versions (for karaoke), other are remixes.
The karaoke phenomenon gets generalized from voice to instruments, in musical
video games such as Rock Band 1 . But in this case, to get the interaction the
user has to buy the video game, which includes the multitrack recording.
Yet, the music industry is still reluctant to release the multitrack version of
musical hits. The only thing the user can get is a standard CD, thus a stereo
1
See URL: http://www.rockband.com

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 31–50, 2011.

c Springer-Verlag Berlin Heidelberg 2011
32 S. Marchand, B. Mansencal, and L. Girin

mix, or its dematerialized version available for download. The CD is not dead:
imagine a CD fully backward compatible while permitting musical interaction. . .
We present the proof of concept of the active audio CD, as a player that
can read any active disc – in fact any 16-bit PCM stereo sound file, decode the
musical structure present in the sound signal, and use it to perform high-quality
source separation. Then, the listener can see and manipulate the sound sources
in the acoustic space. Our system is composed of two parts.
First, a CD reader extracts the audio data of the stereo track and decodes
the musical structure embedded in the audio signal (Section 2). This additional
information consists of the combination of active sources for each time-frequency
atom. As shown in [16], this permits an informed source separation of high quality
(patent pending). In our current system, we get up to 5 individual tracks out of
the stereo mix.
Second, a sound spatializer is able to map in real time all the sound sources
to any position in the acoustic space (Section 3). Our system supports either
binaural (headphones) or multi-loudspeaker configurations. As shown in [14],
the spatialization is done in the spectral domain, is based on acoustics and
interaural cues, and the listener can control the distance and the azimuth of
each source.
Finally, the corresponding software implementation is described in Section 4.

2 Source Separation
In this section, we present a general overview of the informed source separation
technique which is at the heart of the active CD player. This technique is based
on a two-step coder-decoder configuration [16][17], as illustrated on Fig. 1. The
decoder is the active CD player, that can process separation only on mix signals
that have been generated by the coder. At the coder, the mix signal is generated
as a linear instantaneous stationary stereo (LISS) mixture, i.e. summation of
source signals with constant-gain panning coefficients. Then, the system looks
for the two sources that better “explain” the mixture (i.e. the two source signals
that are predominant in the mix signal) at different time intervals and frequency
channels, and the corresponding source indexes are embedded into the mixture
signal as side-information using watermarking. The watermarked mix signal is
then quantized to 16-bits PCM. At the decoder, the only available signal is the
watermarked and quantized mix signal. The side-information is extracted from
the mix signal and used to separate the source signals by a local time / frequency
mixture inversion process.

2.1 Time-Frequency Decomposition


The voice / instrument source signals are non-stationary, with possibly large
temporal and spectral variability, and they generally strongly overlap in the
time domain. Decomposing the signals in the time-frequency (TF) domain leads
to a sparse representation, i.e. few TF coefficients have a high energy and the
overlapping of signals is much lower in the TF domain than in the time domain
Interactive Music with Active Audio CDs 33

S1[f,t]
s1[n] MDCT

s2[n] MDCT Ift


Ora cle
e s tima tor

sI[n] MDCT Coding


SI[f,t]
xL[n] XL[f,t] XLW[ f ,t] 16-bits ~
MDCT IMDCT P CM xLW [n]
Mixing Wa te r-
(LIS S ) xR[n] XR[f,t] ma rking XRW[ f ,t] ~
MDCT IMDCT 16-bits x RW [ n ]
P CM
Mixing
ma trix A

Sˆ1[ f ,t]
~ IMDCT sˆ1 [ n ]
~ W XLW[ f ,t] Ift
x [ n]
L MDCT Wa te r- 2×2
~ De coding IMDCT sˆ2 [ n]
ma rk inve r-
~ XRW[ f , t]
xRW [n] MDCT e xtra ction s ion
A
IMDCT sˆI [ n]
SˆI [ f , t]

Fig. 1. Informed source separation coder and decoder

[29][7][15][20]. Therefore, the separation of source signals can be carried out more
efficiently in the TF domain. The Modified Discrete Cosine Transform (MDCT)
[21] is used as the TF decomposition since it presents several properties very
suitable for the present problem: good energy concentration (hence emphasizing
audio signals sparsity), very good robustness to quantization (hence robustness
to quantization-based watermarking), orthogonality and perfect reconstruction.
Detailed description of the MDCT equations are not provided in the present
paper, since it can be found in many papers, e.g. [21]. The MDCT is applied on
the source signals and on the mixture signal at the input of the coder to enable
the selection of predominant sources in the TF domain. Watermarking of the
resulting side-information is applied on the MDCT coefficients of the mix signal
and the time samples of the watermarked mix signal are provided by inverse
MDCT (IMDCT). At the decoder, the (PCM-quantized) mix signal is MDCT-
transformed and the side-information is extracted from the resulting coefficients.
Source separation is also carried out in the MDCT domain, and the resulting
separated MDCT coefficients are used to reconstruct the corresponding time-
domain separated source signals by IMDCT. Technically, the MDCT / IMDCT
is applied on signal time frames of W = 2048 samples (46.5ms for a sampling
frequency fs = 44.1kHz), with a 50%-overlap between consecutive frames (of
1024 frequency bins). The frame length W is chosen to follow the dynamics of
music signals while providing a frequency resolution suitable for the separation.
Appropriate windowing is applied at both analysis and synthesis to ensure the
“perfect reconstruction” property [21].
34 S. Marchand, B. Mansencal, and L. Girin

2.2 Informed Source Separation

Since the MDCT is a linear transform, the LISS source separation problem
remains LISS in the transformed domain. For each frequency bin f and time bin
t, we thus have:
X(f, t) = A · S(f, t) (1)
where X(f, t) = [X1 (f, t), X2 (f, t)]T denotes the stereo mixture coefficients vec-
tor and S(f, t) = [S1 (f, t), · · · , SN (f, t)]T denotes the N -source coefficients vec-
tor. Because of audio signal sparsity in the TF domain, only at most 2 sources are
assumed to be relevant, i.e. of significant energy, at each TF bin (f, t). Therefore,
the mixture is locally given by:

X(f, t) ≈ AIf t SIf t (f, t) (2)

where If t denotes the set of 2 relevant sources at TF bin (f, t). AIf t represents
the 2 × 2 mixing sub-matrix made with the Ai columns of A, i ∈ If t . If I f t
denotes the complementary set of non-active (or at least poorly active) sources
at TF bin (f, t), the source signals at bin (f, t) are estimated by [7]:

ŜIf t (f, t) = A−1If t X(f, t)
(3)
ŜI f t (f, t) = 0

where A−1 If t denotes the inverse of AIf t . Note that such a separation technique
exploits the 2-channel spatial information of the mixture signal and relaxes the
restrictive assumption of a single active source at each TF bin, as made in
[29][2][3].
The side-information that is transmitted between coder and decoder (in ad-
dition to the mix signal) mainly consists of the coefficients of the mixing matrix
A and the combination of indexes If t that identifies the predominant sources in
each TF bin. This contrasts with classic blind and semi-blind separation meth-
ods where those both types of information have to be estimated from the mix
signal only, generally in two steps which can both be a very challenging task and
source of significant errors.
As for the mixing matrix, the number of coefficients to be transmitted is quite
low in the present LISS configuration2 . Therefore, the transmission cost of A is
negligible compared to the transmission cost of If t , and it occupies a very small
portion of the watermarking capacity.
As for the source indexes, If t is determined at the coder for each TF bin
using the source signals, the mixture signal, and the mixture matrix A, as the
combination that provides the lower mean squared error (MSE) between the
original source signals and the estimated source signals obtained with Equation
(3) (see [17] for details). This process follows the line of oracle estimators, as in-
troduced in [26] for the general purpose of evaluating the performances of source
2
For 5-source signals, if A is made of normalized column vectors depending on source
azimuths, then we have only 5 coefficients.
Interactive Music with Active Audio CDs 35

separation algorithms, especially in the case of underdetermined mixtures and


sparse separation techniques. Note that because of the orthogonality / perfect
reconstruction property of the MDCT, the selection of the optimal source com-
bination can be processed separately at each TF bin, in spite of the overlap-add
operation at source signal reconstruction by IMDCT [26]. When the number of
sources is reasonable (typically about 5 for a standard western popular music
song), Ĩf t can be found by exhaustive search, since in contrast to the decod-
ing process, the encoding process is done offline and is therefore not subdue to
real-time constraints.
It is important to note that at the coder, the optimal combination is de-
termined from the “original” (unwatermarked) mix signal. In contrast, at the
decoder, only the watermarked mix signal is available, and the source separation
is obtained by applying Equation (3) using the MDCT coefficients of the water-
W
marked (and 16-bit PCM quantized) mix signal X̃ (f, t) instead of the MDCT
coefficients of the original mix signal X(f, t). However, it has been shown in [17]
that the influence of the watermarking (and PCM quantization) process on sep-
aration performance is negligible. This is because the optimal combination for
each TF bin can be coded with a very limited number of bits before being em-
bedded into the mixture signal. For example, for a 5-source mixture, the number
of combinations of two sources among five is 10 and a 4-bit fixed-size code is
appropriate (although non optimal) for encoding If t . In practice, the source sep-
aration process can be limited to the [0; 16]kHz bandwidth, because the energy of
audio signals is generally very low beyond 16kHz. Since the MDCT decomposi-
tion provides as many coefficients as time samples, the side-information bit-rate
is 4×Fs ×16, 000/(Fs/2) = 128kbits/s (Fs = 44, 1kHz is the sampling frequency),
which can be split in two 64kbits/s streams, one for each of the stereo channels.
This is about 1/4 of the maximum capacity of the watermarking process (see
below), and for such capacity, the distortion of the MDCT coefficients by the
watermarking process is sufficiently low to not corrupt the separation process of
Equation (3). In fact, the main source of degradation in the separation process
relies in the sparsity assumption, i.e. the fact that “residual” non-predominant,
but non-null, sources may interfere as noise in the local inversion process.
Separation performances are described in details in [17] for “real-world” 5-
source LISS music mixtures of different musical styles. Basically, source en-
hancement from input (mix) to output (separated) ranges from 17dB to 25dB
depending on sources and mixture, which is remarkable given the difficulty of
such underdetermined mixtures. The rejection of competing sources is very effi-
cient and the source signals are clearly isolated, as confirmed by listening tests.
Artefacts (musical noise) are present but are quite limited. The quality of the
isolated source signals makes them usable for individual manipulation by the
spatializer.

2.3 Watermarking Process


The side-information embedding process is derived from the Quantization Index
Modulation (QIM) technique of [8], applied to the MDCT coefficients of the
36 S. Marchand, B. Mansencal, and L. Girin

10
w 11
X (t, f )
01
ΔQIM
X(t, f ) 00
10
11
01
00
10
11
01
00
10
Δ(t, f ) 11
01
00
10
11
01
00
00 01 11 10

Fig. 2. Example of QIM using a set of quantizers for C(t, f ) = 2 and the resulting
global grid. We have Δ(t, f ) = 2C(t,f ) · ΔQIM . The binary code 01 is embedded into
the MDCT coefficient X(t, f ) by quantizing it to X w (t, f ) using the quantizer indexed
by 01.

mixture signal in combination with the use of a psycho-acoustic model (PAM) for
the control of inaudibility. It has been presented in details in [19][18]. Therefore,
we just present the general lines of the watermarking process in this section, and
we refer the reader to these papers for technical details.
The embedding principle is the following. Let us denote by C(t, f ) the capac-
ity at TF bin (t, f ), i.e. the maximum size of the binary code to be embedded
in the MDCT coefficient at that TF bin (under inaudibility constraint). We will
see below how C(t, f ) is determined for each TF bin. For each TF bin (t, f ), a
set of 2C(t,f ) uniform quantizers is defined, which quantization levels are inter-
twined, and each quantizer represents a C(t, f )-bit binary code. Embedding a
given binary code on a given MDCT coefficient is done by quantizing this coef-
ficient with the corresponding quantizer (i.e. the quantizer indexed by the code
to transmit; see Fig. 2). At the decoder, recovering the code is done by compar-
ing the transmitted MDCT coefficient (potentially corrupted by transmission
noise) with the 2C(t,f ) quantizers, and selecting the quantizer with the quan-
tization level closest to the transmitted MDCT coefficient. Note that because
the capacity values depend on (f, t), those values must be transmitted to the
decoder to select the right set of quantizers. For this, a fixed-capacity embedding
“reservoir” is allocated in the higher frequency region of the spectrum, and the
Interactive Music with Active Audio CDs 37

capacity values are actually defined within subbands (see [18] for details). Note
also that the complete binary message to transmit (here the set of If t codes) is
split and spread across the different MDCT coefficients according to the local
capacity values, so that each MDCT coefficient carries a small part of the com-
plete message. Conversely, the decoded elementary messages have to be concate-
nated to recover the complete message. The embedding rate R is given by the
average total number of embedded bits per second of signal. It is obtained by
summing the capacity C(t, f ) over the embedded region of the TF plane and
dividing the result by the signal duration.
The performance of the embedding process is determined by two related con-
straints: the watermark decoding must be robust to the 16-bit PCM conversion
of the mix signal (which is the only source of noise because the “perfect recon-
struction” property of MDCT ensures transparency of IMDCT/MDCT chained
processes), and the watermark must be inaudible. The time-domain PCM quan-
tization leads to additive white Gaussian noise on MDCT coefficients, which
induces a lower bound for ΔQIM the minimum distance between two different
levels of all QIM quantizers (see Fig. 2). Given that lower bound, the inaudibility
constraint induces an upper bound on the number of quantizers, hence a cor-
responding upper bound on the capacity C(t, f ) [19][18]. More specifically, the
constraint is that the power of the embedding error in the worst case remains
under the masking threshold M (t, f ) provided by a psychoacoustic model. The
PAM is inspired from the MPEG-AAC model [11] and adapted to the present
data hiding problem. It is shown in [18] that the optimal capacity is given by:
  α
 
1 M (t, f ) · 10 10
C α (t, f ) = log +1 (4)
2 2 Δ2QIM

where . denotes the floor function, and α is a scaling factor (in dB) that enables
users to control the trade-off between signal degradation and embedding rate
by translating the masking threshold. Signal quality is expected to decrease as
embedding rate increases, and vice-versa. When α > 0dB, the masking threshold
is raised. Larger values of the quantization error allow for larger capacities (and
thus higher embedding rate), at the price of potentially lower quality. At the
opposite, when α < 0dB, the masking threshold is lowered, leading to a “safety
margin” for the inaudibility of the embedding process, at the price of lower
embedding rate. It can be shown that the embedding rate Rα corresponding to
C α and the basic rate R = R0 are related by [18]:
log2 (10)
Rα  R + α · · Fu (5)
10
(Fu being the bandwidth of the embedded frequency region). This linear relation
enables to easily control the embedding rate by the setting of α.
The inaudibility of the watermarking process has been assessed by subjective
and objective tests. In [19][18], Objective Difference Grade (ODG) scores [24][12]
were calculated for a large range of embedding rates and different musical styles.
ODG remained very close to zero (hence imperceptibility of the watermark)
38 S. Marchand, B. Mansencal, and L. Girin

for rates up to about 260kbps for musical styles such as pop, rock, jazz, funk,
bossa, fusion, etc. (and “only” up to about 175kbps for classical music). Such
rates generally correspond to the basic level of the masking curve allowed by
the PAM (i.e. α = 0dB). More “comfortable” rates can be set between 150
and 200kbits/s to guarantee transparent quality for the embedded signal. This
flexibility is used in our informed source separation system to fit the embedding
capacity with the bit-rate of the side-information, which is at the very reasonable
value of 64kbits/s/channel. Here, the watermarking is guaranteed to be “highly
inaudible”, since the masking curve is significantly lowered to fit the required
capacity.

3 Sound Spatialization

Now that we have recovered the different sound sources present in the original
mix, we can allow the user to manipulate them in space. We consider each punc-
tual and omni-directional sound source in the horizontal plane, located by its (ρ, θ)
coordinates, where ρ is the distance of the source to the head center and θ is the
azimuth angle. Indeed, as a first approximation in most musical situations, both
the listeners and instrumentalists are standing on the (same) ground, with no rel-
ative elevation. Moreover, we consider that the distance ρ is large enough for the
acoustic wave to be regarded as planar when reaching the ears.

3.1 Acoustic Cues

In this section, we intend to perform real-time high-quality (convolutive) mixing.


The source s will reach the left (L) and right (R) ears through different acoustic
paths, characterizable with a pair of filters, which spectral versions are called
Head-Related Transfer Functions (HRTFs). HRTFs are frequency- and subject-
dependent. The CIPIC database [1] samples different listeners and directions of
arrival.
A sound source positioned to the left will reach the left ear sooner than the
right one, in the same manner the right level should be lower due to wave prop-
agation and head shadowing. Thus, the difference in amplitude or Interaural
Level Difference (ILD, expressed in decibels – dB) [23] and difference in arrival
time or Interaural Time Difference (ITD, expressed in seconds) [22] are the main
spatial cues for the human auditory system [6].
Interaural Level Differences. After Viste [27], the ILDs can be expressed as
functions of sin(θ), thus leading to a sinusoidal model:

ILD(θ, f ) = α(f ) sin(θ) (6)

where α(f ) is the average scaling factor that best suits our model, in the least-
square sense, for each listener of the CIPIC database (see Fig. 3). The overall
error of this model over the CIPIC database for all subjects, azimuths, and
frequencies is of 4.29dB.
Interactive Music with Active Audio CDs 39

40
α

30
level scaling factor

20

10

0
0 2 4 6 8 10 12 14 16 18 20
β

4
time scaling factor

0
0 2 4 6 8 10 12 14 16 18 20
Frequency [kHz]

Fig. 3. Frequency-dependent scaling factors: α (top) and β (bottom)

Interaural Time Differences. Because of the head shadowing, Viste uses for
the ITDs a model based on sin(θ) + θ, after Woodworth [28]. However, from the
theory of the diffraction of an harmonic plane wave by a sphere (the head), the
ITDs should be proportional to sin(θ). Contrary to the model by Kuhn [13], our
model takes into account the inter-subject variation and the full-frequency band.
The ITD model is then expressed as:

ITD(θ, f ) = β(f )r sin(θ)/c (7)

where β is the average scaling factor that best suits our model, in the least-
square sense, for each listener of the CIPIC database (see Fig. 3), r denotes the
head radius, and c is the sound celerity. The overall error of this model over the
CIPIC database is 0.052ms (thus comparable to the 0.045ms error of the model
by Viste).

Distance Cues. In ideal conditions, the intensity of a source is halved (de-


creases by −6dB) when the distance is doubled, according to the well-known
Inverse Square Law [5]. Applying only this frequency-independent rule to a sig-
nal has no effect on the sound timbre. But when a source moves far from the
listener, the high frequencies are more attenuated than the low frequencies. Thus
40 S. Marchand, B. Mansencal, and L. Girin

the sound spectrum changes with the distance. More precisely, the spectral cen-
troid moves towards the low frequencies as the distance increases. In [4], the
authors show that the frequency-dependent attenuation due to atmospheric at-
tenuation is roughly proportional to f 2 , similarly to the ISO 9613-1 norm [10].
Here, we manipulate the magnitude spectrum to simulate the distance between
the source and the listener. Conversely, we would measure the spectral centroid
(related to brightness) to estimate the source’s distance to listener.
In a concert room, the distance is often simulated by placing the speaker near
/ away from the auditorium, which is sometimes physically restricted in small
rooms. In fact, the architecture of the room plays an important role and can
lead to severe modifications in the interpretation of the piece.
Here, simulating the distance is a matter of changing the magnitude of each
short-term spectrum X. More precisely, the ISO 9613-1 norm [10] gives the
frequency-dependent attenuation factor in dB for given air temperature, humid-
ity, and pressure conditions. At distance ρ, the magnitudes of X(f ) should be
attenuated by D(f, ρ) decibels:

D(f, ρ) = ρ · a(f ) (8)

where a(f ) is the frequency-dependent attenuation, which will have an impact


on the brightness of the sound (higher frequencies being more attenuated than
lower ones).
More precisely, the total absorption in decibels per meter a(f ) is given by a
rather complicated formula:
a(f )   12  − 52
≈ 8.68 · F 2 1.84 · 10−11 TT0 P0 + TT0
P

0.01275 · e−2239.1/T /[Fr,O + (F 2 /Fr,O )]



+0.1068 · e−3352/T /[Fr,N + (F 2 /Fr,N )] (9)

where F = f /P , Fr,O = fr,O /P , Fr,N = fr,N /P are frequencies scaled by the


atmospheric pressure P , and P0 is the reference atmospheric pressure (1 atm), f
is the frequency in Hz, T is the atmospheric temperature in Kelvin (K), T0 is the
reference atmospheric temperature (293.15K), fr,O is the relaxation frequency
of molecular oxygen, and fr,N is the relaxation frequency of molecular nitrogen
(see [4] for details).
The spectrum thus becomes:

X(ρ, f ) = 10(XdB (t,f )−D(f,ρ))/20 (10)

where XdB is the spectrum X in dB scale.

3.2 Binaural Spatialization


In binaural listening conditions using headphones, the sound from each earphone
speaker is heard only by one ear. Thus the encoded spatial cues are not affected
by any cross-talk signals between earphone speakers.
Interactive Music with Active Audio CDs 41

To spatialize a sound source to an expected azimuth θ, for each short-term


spectrum X, we compute the pair of left (XL ) and right (XR ) spectra from the
spatial cues corresponding to θ, using Equations (6) and (7), and:
XL (t, f ) = HL (t, f )X(t, f ) with HL (t, f ) = 10+Δa (f )/2 e+jΔφ (f )/2 , (11)
−Δa (f )/2 −jΔφ (f )/2
XR (t, f ) = HR (t, f )X(t, f ) with HR (t, f ) = 10 e (12)
(because of the symmetry among the left and right ears), where Δa and Δφ are
given by:
Δa (f ) = ILD(θ, f )/20, (13)
Δφ (f ) = ITD(θ, f ) · 2πf. (14)
This is indeed a convolutive model, the convolution turning into a multiplica-
tion in the spectral domain. Moreover, the spatialization coefficients are complex.
The control of both amplitude and phase should provide better audio quality [25]
than amplitude-only spatialization. Indeed, we reach a remarkable spatialization
realism through informal listening tests with AKG K240 Studio headphones.

3.3 Multi-loudspeaker Spatialization


In a stereophonic display, the sound from each loudspeaker is heard by both
ears. Thus, as in the transaural case, the stereo sound reaches the ears through
four acoustic paths, corresponding to transfer functions (Cij , i representing the
speaker and j the ear), see Fig. 4. Here, we generate these paths artificially using
the binaural model (using the distance and azimuth of the source to the ears for
H, and of the speakers to the ears for C). Since we have:
XL = HL X = CLL KL X +CLR KR X (15)
   
YL YR
XR = HR X = CRL KL X +CRR KR X (16)
   
YL YR

the best panning coefficients under CIPIC conditions for the pair of speakers to
match the binaural signals at the ears (see Equations (11) and (12)) are then
given by:
KL (t, f ) = C · (CRR HL − CLR HR ) , (17)
KR (t, f ) = C · (−CRL HL + CLL HR ) (18)
with the determinant computed as:
C = 1/ (CLL CRR − CRL CLR ) . (19)
During diffusion, the left and right signals (YL , YR ) to feed left and right
speakers are obtained by multiplying the short-term spectra X with KL and
KR , respectively:
YL (t, f ) = KL (t, f )X(t, f ) = C · (CRR XL − CLR XR ) , (20)
YR (t, f ) = KR (t, f )X(t, f ) = C · (−CRL XL + CLL XR ) . (21)
42 S. Marchand, B. Mansencal, and L. Girin

KL KR

YL YR
sound image

SL SR
HL HR
C R
R
L CL
C LL

CRR
XL XR

Fig. 4. Stereophonic loudspeaker display: the sound source X reaches the ears L, R
through four acoustic paths (CLL , CLR , CRL , CRR )

sound image

S2 S

S3 L R S1

S4

Fig. 5. Pairwise paradigm: for a given sound source, signals are dispatched only to the
two speakers closest to it (in azimuth)

In a setup with many speakers we use the classic pair-wise paradigm [9],
consisting in choosing for a given source only the two speakers closest to it (in
Interactive Music with Active Audio CDs 43

player spatializer
file sources N sources ...
reader separator
N output N input M output
2 channels ports ports ports
active CD M speakers
Fig. 6. Overview of the software system architecture

azimuth): one at the left of the source, the other at its right (see Fig. 5). The
left and right signals computed for the source are then dispatched accordingly.

4 Software System
Our methods for source separation and sound spatialization have been imple-
mented as a real-time software system, programmed in C++ language and using
Qt43 , JACK4 , and FFTW5 . These libraries were chosen to ensure portability and
performance on multiple platforms. The current implementation has been tested
on Linux and MacOS X operating systems, but should work with very minor
changes on other platforms, e.g. Windows.
Fig. 6 shows an overview of the architecture of our software system. Source
separation and sound spatialization are implemented as two different modules.
We rely on JACK audio ports system to route audio streams between these two
modules in real time.
This separation in two modules was mainly dictated by a different choice of
distribution license: the source separation of the active player should be patented
and released without sources, while the spatializer will be freely available under
the GNU General Public License.

4.1 Usage
Player. The active player is presented as a simple audio player, based on JACK.
The graphical user interface (GUI) is a very common player interface. It allows
to play or pause the reading / decoding. The player reads “activated” stereo files,
from an audio CD or file, and then decodes the stereo mix in order to extract
the N (mono) sources. Then these sources are transferred to N JACK output
ports, currently named QJackPlayerSeparator:outputi, with i in [1; N ].

Spatializer. The spatializer is also a real-time application, standalone and based


on JACK. It has N inputs ports that correspond to the N sources to spatialize.
These ports are to be connected, with the JACK ports connection system, to the
N outputs ports of the active player. The spatializer can be configured to work
with headphones (binaural configuration) or with M loudspeakers.
3
See URL: http://trolltech.com/products/qt
4
See URL: http://jackaudio.org
5
See URL: http://www.fftw.org
44 S. Marchand, B. Mansencal, and L. Girin

Fig. 7. From the stereo mix stored on the CD, our player is allowing the listener
(center ) to manipulate 5 sources in the acoustic space, using here an octophonic display
(top) or headphones (bottom)

Fig. 7 shows the current interface of the spatializer, which displays a bird’s
eye view of the audio scene. The user’s avatar is in the middle, represented by
a head viewed from above. He is surrounded by various sources, represented as
Interactive Music with Active Audio CDs 45

Fig. 8. Example of configuration files: 5-source configuration (top), binaural output


configuration (middle), and then 8-speaker configuration (bottom) files

notes in colored discs. When used in a multi-speaker configuration, speakers may


be represented in the scene. If used in a binaural configuration, the user’s avatar
is represented wearing headphones.
With this graphical user interface, the user can interactively move each source
individually. He picks one of the source representation and drags it around. The
corresponding audio stream is then spatialized, in real time, according to the
new source position (distance and azimuth). The user can also move his avatar
among the sources, as if the listener was moving on the stage, between the
instrumentalists. In this situation, the spatialization changes for all the sources
simultaneously, according to their new relative positions to the moving user
avatar.
Inputs and outputs are set via two configuration files (see Fig. 8). A source
configuration file defines the number of sources. For each source, this file gives
the name of the output port to which a spatializer input port will be connected,
and also its original azimuth and distance. Fig. 8 shows the source configuration
file to connect to the active player with 5 ports. A speaker configuration file
defines the number of speakers. For each speaker, this file gives the name of the
physical (soundcard) port to which a spatializer output port will be connected,
and the azimuth and distance of the speaker. The binaural case is distinguished
46 S. Marchand, B. Mansencal, and L. Girin

sources speakers
 
  
 
   
   

   


  

   

 
   
   

 

  



 

Fig. 9. Processing pipeline for the spatialization of N sources on M speakers

by the fact that it has only two speakers with neither azimuth nor distance
specified. Fig. 8 shows the speaker configuration files for binaural and octophonic
(8-speaker) configuration.

4.2 Implementation

Player. The current implementation is divided into three threads. The main
thread is the Qt GUI. A second thread reads and bufferizes data from the stereo
file, to be able to compensate for any physical CD reader latency. The third
thread is the JACK process function. It separates the data for the N sources and
feeds the output ports accordingly. In the current implementation, the number
of output sources is fixed to N = 5.
Our source separation implementation is rather efficient as for a Modified
Discrete Cosine Transform (MDCT) of W samples, we only do a Fast Fourier
Transform (FFT) of size W/4. Indeed, a MDCT of length W is almost equivalent
to a type-IV DCT of length W/2 that can be computed with a FFT of length
W/4. Thus, as we use MDCT and IMDCT of size W = 2048, we only do FFT
and IFFT of 512 samples.

Spatializer. The spatializer is currently composed of two threads: a main


thread, the Qt GUI, and the JACK process function.
Fig. 9 shows the processing pipeline for the spatialization. For each source, xi
is first transformed into the spectral domain with a FFT to obtain its spectrum
Xi . This spectrum is attenuated for distance ρi (see Equation (10)). Then, for
an azimuth θi , we obtain the left (XiL ) and right (XiR ) spectra (see Equations
Interactive Music with Active Audio CDs 47

(11) and (12)). The dispatcher then chooses the pair (j, j + 1) of speakers sur-
rounding the azimuth θi , transforms the spectra XiL and XiR by the coefficients
corresponding to this speaker pair (see Equations (20) and (21)), and adds the
resulting spectra Yj and Yj+1 in the spectra of these speakers. Finally, for each
speaker, its spectrum is transformed with an IFFT to obtain back in the time
domain the mono signal yj for the corresponding output.
Source spatialization is more computation-intensive than source separation,
mainly because it requires more transforms (N FFTs and M IFFTs) of larger
size W = 2048. For now, source spatialization is implemented as a serial pro-
cess. However, we can see that this pipeline is highly parallel. Indeed, almost
everything operates on separate data. Only the spectra of the speakers may be
accessed concurrently, to accumulate the spectra of sources that would be spa-
tialized to the same or neighbouring speaker pairs. These spectra should then
be protected with mutual exclusion mechanisms. A future version will take ad-
vantage of multi-core processor architectures.

4.3 Experiments

Our current prototype has been tested on an Apple MacBook Pro, with an Intel
Core 2 Duo 2.53GHz processor, connected to headphones or to a 8-speaker sys-
tem, via a MOTU 828 MKII soundcard. For such a configuration, the processing
power is well contained. In order to run in real time, given a signal sampling

Fig. 10. Enhanced graphical interface with pictures of instruments for sources and
propagating sound waves represented as colored circles
48 S. Marchand, B. Mansencal, and L. Girin

frequency of 44.1kHz and windows of 2048 samples, the overall processing time
should be less than 23ms. With our current implementation, 5-source separation
and 8-speaker spatialization, this processing time is in fact less than 3ms on the
laptop mentioned previously. Therefore, the margin to increase the number of
sources to separate and/or the number of loudspeakers is quite confortable. To
confirm this, we exploited the split of the source separation and spatialization
modules to test the spatializer without the active player, since the latter is cur-
rently limited to 5 sources. We connected to the spatializer a multi-track player
that reads several files simultaneously and exposes these tracks as JACK output
ports. Tests showed that the spatialization can be applied to roughly 48 sources
on 8 speakers, or 40 sources on 40 speakers on this computer.
These performances allow us to have some processing power for other com-
putations, to improve user experience for example. Fig. 10 shows an example of
an enhanced graphical interface where the sources are represented with pictures
of the instruments, and the propagation of the sound waves is represented for
each source by time-evolving colored circles. The color of each circle is computed
from the color (spectral envelope) of the spectrum of each source and updated
in real time as the sound changes.

5 Conclusion and Future Work

We have presented a real-time system for musical interaction from stereo files,
fully backward-compatible with standard audio CDs. This system consists of a
source separator and a spatializer.
The source separation is based on the sparsity of the source signals in the
spectral domain and the exploitation of the stereophony. This system is char-
acterized by a quite simple separation process and by the fact that some side-
information is inaudibly embedded in the signal itself to guide the separation
process. Compared to (semi-)blind approaches also based on sparsity and lo-
cal mixture inversion, the informed aspect of separation guarantees the optimal
combination of the sources, thus leading to a remarkable increase of quality of
the separated signals.
The sound spatialization is based on a simplified model of the head-related
transfer functions, generalized to any multi-loudspeaker configuration using a
transaural technique for the best pair of loudspeaker for each sound source.
Although this quite simple technique does not compete with the 3D accuracy of
Ambisonics or holophony (Wave Field Synthesis), it is very flexible (no specific
loudspeaker configuration) and suitable for a large audience (no hot-spot effect)
with sufficient sound quality.
The resulting software system is able to separate 5-source stereo mixtures
(read from audio CD or 16-bit PCM files) in real time and it enables the user to
remix the piece of music during restitution with basic functions such as volume
and spatialization control. The system has been demonstrated in several coun-
tries with excellent feedback from the users / listeners, with a clear potential in
terms of musical creativity, pedagogy, and entertainment.
Interactive Music with Active Audio CDs 49

For now, the mixing model imposed by the informed source separation is
generally over-simplistic when professional / commercial music production is
at stake. Extending the source separation technique to high-quality convolutive
mixing is part of our future research.
As shown in [14], the model we use for the spatialization is more general, and
can be used as well to localize audio sources. Thus we would like to add the
automatic detection of the speaker configuration to our system, from a pair of
microphones placed in the audience, as well as the automatic fine tuning of the
spatialization coefficients to improve the 3D sound effect.
Regarding performance, lots of operations are on separated data and thus
could easily be parallelized on modern hardware architectures. Last but not least,
we are also porting the whole application to mobile touch devices, such as smart
phones and tablets. Indeed, we believe that these devices are perfect targets for
a system in between music listening and gaming, and gestural interfaces with
direct interaction to move the sources are very intuitive.

Acknowledgments
This research was partly supported by the French ANR (Agence Nationale de la
Recherche) DReaM project (ANR-09-CORD-006).

References
1. Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF
database. In: Proceedings of the IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics (WASPAA), New Paltz, New York, pp. 99–102
(2001)
2. Araki, S., Sawada, H., Makino, S.: K-means based underdetermined blind speech
separation. In: Makino, S., et al. (eds.) Blind Source Separation, pp. 243–270.
Springer, Heidelberg (2007)
3. Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source
separation for arbitrarily arranged multiple sensors. Signal Processing 87(8), 1833–
1847 (2007)
4. Bass, H., Sutherland, L., Zuckerwar, A., Blackstock, D., Hester, D.: Atmospheric
absorption of sound: Further developments. Journal of the Acoustical Society of
America 97(1), 680–683 (1995)
5. Berg, R.E., Stork, D.G.: The Physics of Sound, 2nd edn. Prentice Hall, Englewood
Cliffs (1994)
6. Blauert, J.: Spatial Hearing. revised edn. MIT Press, Cambridge (1997); Transla-
tion by J.S. Allen
7. Bofill, P., Zibulevski, M.: Underdetermined blind source separation using sparse
representations. Signal Processing 81(11), 2353–2362 (2001)
8. Chen, B., Wornell, G.: Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions
on Information Theory 47(4), 1423–1443 (2001)
9. Chowning, J.M.: The simulation of moving sound sources. Journal of the Acoustical
Society of America 19(1), 2–6 (1971)
50 S. Marchand, B. Mansencal, and L. Girin

10. International Organization for Standardization, Geneva, Switzerland: ISO 9613-


1:1993: Acoustics – Attenuation of Sound During Propagation Outdoors – Part 1:
Calculation of the Absorption of Sound by the Atmosphere (1993)
11. ISO/IEC JTC1/SC29/WG11 MPEG: Information technology Generic coding of
moving pictures and associated audio information Part 7: Advanced Audio Coding
(AAC) IS13818-7(E) (2004)
12. ITU-R: Method for objective measurements of perceived audio quality (PEAQ)
Recommendation BS1387-1 (2001)
13. Kuhn, G.F.: Model for the interaural time differences in the azimuthal plane. Jour-
nal of the Acoustical Society of America 62(1), 157–167 (1977)
14. Mouba, J., Marchand, S., Mansencal, B., Rivet, J.M.: RetroSpat: a perception-
based system for semi-automatic diffusion of acousmatic music. In: Proceedings of
the Sound and Music Computing (SMC) Conference, Berlin, pp. 33–40 (2008)
15. O’Grady, P., Pearlmutter, B.A., Rickard, S.: Survey of sparse and non-sparse meth-
ods in source separation. International Journal of Imaging Systems and Technol-
ogy 15(1), 18–33 (2005)
16. Parvaix, M., Girin, L.: Informed source separation of underdetermined instantaneous
stereo mixtures using source index embedding. In: IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas (2010)
17. Parvaix, M., Girin, L.: Informed source separation of linear instantaneous under-
determined audio mixtures by source index embedding. IEEE Transactions on Au-
dio, Speech, and Language Processing (accepted, pending publication, 2011)
18. Pinel, J., Girin, L., Baras, C.: A high-rate data hiding technique for uncompressed
audio signals. IEEE Transactions on Audio, Speech, and Language Processing
(submitted)
19. Pinel, J., Girin, L., Baras, C., Parvaix, M.: A high-capacity watermarking technique
for audio signals based on MDCT-domain quantization. In: International Congress
on Acoustics (ICA), Sydney, Australia (2010)
20. Plumbley, M.D., Blumensath, T., Daudet, L., Gribonval, R., Davies, M.E.: Sparse
representations in audio and music: From coding to source separation. Proceedings
of the IEEE 98(6), 995–1005 (2010)
21. Princen, J., Bradley, A.: Analysis/synthesis filter bank design based on time do-
main aliasing cancellation. IEEE Transactions on Acoustics, Speech, and Signal
Processing 64(5), 1153–1161 (1986)
22. Strutt (Lord Rayleigh), J.W.: Acoustical observations i. Philosophical Magazine 3,
456–457 (1877)
23. Strutt (Lord Rayleigh), J.W.: On the acoustic shadow of a sphere. Philosophical
Transactions of the Royal Society of London 203A, 87–97 (1904)
24. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J.,
Colomes, C.: PEAQ - the ITU standard for objective measurement of perceived
audio quality. Journal of the Audio Engineering Society 48(1), 3–29 (2000)
25. Tournery, C., Faller, C.: Improved time delay analysis/synthesis for parametric
stereo audio coding. Journal of the Audio Engineering Society 29(5), 490–498
(2006)
26. Vincent, E., Gribonval, R., Plumbley, M.D.: Oracle estimators for the benchmark-
ing of source separation algorithms. Signal Processing 87, 1933–1950 (2007)
27. Viste, H.: Binaural Localization and Separation Techniques. Ph.D. thesis, École
Polytechnique Fédérale de Lausanne, Switzerland (2004)
28. Woodworth, R.S.: Experimental Psychology. Holt, New York (1954)
29. Yılmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Transactions on Signal Processing 52(7), 1830–1847 (2004)
Pitch Gestures in Generative Modeling of Music

Kristoffer Jensen

Aalborg University Esbjerg, Niels Bohr Vej 8,


6700 Esbjerg, Denmark
[email protected]

Abstract. Generative models of music are in need of performance and gesture


additions, i.e. inclusions of subtle temporal and dynamic alterations, and
gestures so as to render the music musical. While much of the research
regarding music generation is based on music theory, the work presented here is
based on the temporal perception, which is divided into three parts, the
immediate (subchunk), the short-term memory (chunk), and the superchunk. By
review of the relevant temporal perception literature, the necessary performance
elements to add in the metrical generative model, related to the chunk memory,
are obtained. In particular, the pitch gestures are modeled as rising, falling, or as
arches with positive or negative peaks.

Keywords: gesture; human cognition; perception; chunking; music generation.

1 Introduction
Music generation has more and more uses in today’s media. Be it in computer games,
interactive music performances, or in interactive films, the emotional effect of the
music is primordial in the appreciation of the media. While traditionally, the music
has been generated in pre-recorded loops that is mixed on-the-fly, or recorded in
traditional orchestras, the better understanding and models of generative music is
believed to push the interactive generative music into the multimedia. Papadopoulos
and Wiggins (1999) gave an early overview of the methods of algorithmic
composition, deploring “that the music that they produce is meaningless: the
computers do not have feelings, moods or intentions”. While vast progress has been
made in the decade since this statement, there is still room for improvement.
The cognitive understanding of musical time perception is the basis of the work
presented here. According to Kühl (2007), this memory can be separated into three
time-scales, the short, microtemporal, related to microstructure, the mesotemporal,
related to gesture, and the macrotemporal, related to form. These time-scales are
named (Kühl and Jensen 2008) subchunk, chunk and superchunk, and subchunks
extend from 30 ms to 300 ms; the conscious mesolevel of chunks from 300 ms to 3
sec; and the reflective macrolevel of superchunks from 3 sec to roughly 30−40 sec.
The subchunk is related to individual notes, the chunk to meter and gesture, and the
superchunk is related to form. The superchunk was analyzed and used for in a
generative model in Kühl and Jensen (2008), and the chunks were analyzed in Jensen
and Kühl (2009). Further analysis of the implications of how temporal perception is

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 51–59, 2011.
© Springer-Verlag Berlin Heidelberg 2011
52 K. Jensen

related to durations and timing of existing music, and anatomic and perceptual finding
from the literature is given in section 2 along with an overview of the previous work
in rhythm. Section 3 presents the proposed model on the inclusion of pitch gestures in
music generation using statistical methods, and the section 4 discusses the integration
of the pitch gesture in the generative music model. Finally, section 5 offers a
conclusion.

2 Cognitive and Perceptual Aspects of Rhythm


According to Snyder (2000), a beat is single point in time, while the pulse is recurring
beats. Accent gives salience to beat, and meter is the organization of beats into a
cyclical structure. This may or may not be different to the rhythmic grouping, which
is generally seen as a phrase bounded by accented notes. Lerdahl and Jackendorff
(1983) gives many examples of grouping and meter, and show how this is two
independent elements; Grouping – segmentation on different levels is concerned with
elements that has duration, and Meter – regular alternation of strong and weak beats is
concerned with durationless elements. While grouping and meter are independent, the
percept is more stable when they are congruent.
The accentuation of some of the beats gives perceptually salience to the beat (Patel
and Peretz 1997). This accenting can be done (Handel 1989) by for instance an
intensity rise, by increasing the duration or the interval between the beats, or by
increasing the frequency difference between the notes.
Samson et al (2000) shows that the left temporal lobe processes rapid auditory
sequences, while there are also activities in front lobe. The specialized skills related to
rhythm are developed in the early years, for instance Malbrán (2000) show how 8-
year-old children can perform precise tapping. However, while the tapping is more
precise for high tempo, drifting is ubiquitous. Gordon (1987) has determined that the
perceptual attack time (PAT) is most often located at the point of the largest rise of
the amplitude of the sound. However, in the experiment, the subjects had problems
synchronizing many of the sounds, and Gordon concludes that the PAT is more vague
for non-percussive sounds, and spectral cues may also interfere in the determination
of the attack. Zwicker and Fastl (1999) introduced the notion of subjective duration,
and showed that the subjective duration is longer than the objective durations for
durations below 100ms. Even more subjective deviations are found, if pauses are
compared to tones or noises. Zwicker and Fastl found that long sounds (above 1
second) has the same subjective durations than pauses, while shorter pauses has
significantly longer subjective durations than sounds. Approximately 4 times longer
for 3.2kHz tone, while 200Hz tone and white noise have approximately half the
subjective duration, as compared to pauses. This is true for durations of around 100-
200 ms, while the difference evens out to disappear at 1sec durations. Finally Zwicker
and Fastl related the subjective duration to temporal masking, and give indications
that musicians would play tones shorter than indicated in order to fulfill the subjective
durations of the notated music. Fraisse (1982) give an overview of his important
research in rhythm perception. He states the range in which synchronization is
possible to be between 200 to 1800 msec (33-300 BPM). Fraisse furthermore has
analyzed classical music, and found two main durations that he calls temps longs
Pitch Gestures in Generative Modeling of Music 53

(>400msec) & temps courts, and two to one ratios only found between temps longs
and courts. As for natural tempo, when subjects are asked to reproduce temporal
intervals, they tend to overestimate short intervals (making them longer) and under-
estimate long intervals (making them shorter). At an interval of about 500 msec to
600 msec, there is little over- or under-estimation. However, there are large
differences across individuals, the spontaneous tempo is found to be between 1.1 to 5
taps per second, with 1.7 taps per second most occurring. There are also many
spontaneous motor movements that occur at the rate of approximately 2/sec, such as
walking, sucking in the newborn, and rocking.
Friberg (1991), and Widmer (2002) give rules to how the dynamics and timing
should be changed according to the musical position of the notes. Dynamic changes
include 6db increase (doubling), and up to 100msec deviations to the duration,
depending on the musical position of the notes. With these timing changes, Snyder
(2000) indicate the categorical perception of beats, measures and patterns. The
perception of deviations of the timing is examples of within-category distinctions.
Even with large deviations from the nominal score, the notes are recognized as falling
on the beats.
As for melodic perception, Thomassen (1982) investigated the role of interval as
melodic accents. In a controlled experiment, he modeled the anticipation using an
attention span of three notes, and found that the accent perception is described ‘fairly
well’. The first of two opposite frequency changes gives the strongest accentuation.
Two changes in the same direction are equally effective. The larger of two changes
are more powerful, as are frequency rises as compared to frequency falls.

3 Model of Pitch Gestures


Music is typically composed, giving intended and inherent emotions in the structural
aspects, which is then enhanced and altered by the performers, who change the
dynamics, articulations, vibrato, and timing to render the music enjoyable and
musical. In this work, the gestures in the pitch contour are investigated. Jensen and
Kühl (2009) investigated the gestures of music through a simple model, with positive
or negative slope, and with positive or negative arches, as shown in figure 1. For the
songs analyzed, Jensen and Kühl found more negative than positive slopes and
slightly more positive than negative arches. Huron (1996) analyzed the Essen Folk
music database, and found - by averaging all melodies - positive arches. Further
analyses were done by comparing the first and last note to the mean of the
intermediate notes, revealing more positive than negative arches (39% and 10%
respectively), and more negative than positive slopes (29% and 19% respectively).
According to Thomassen (1982) the falling slopes has less powerful accents, and they
would thus require less attention.
The generative model is made through statistical models based on data from a
musical database (The Digital Tradition 2010). From this model, note and interval
occurrences are counted. These counts are then normalized, and used as probability
density functions for note and intervals, respectively. This statistics are shown in
figure 2. As can be seen, the intervals are not symmetrical. This corroborates the
finding in Jensen and Kühl (2009) that more falling than rising slopes are found in the
54 K. Jensen

Fig. 1. Different shapes of a chunk. Positive (a-c) or negative arches (g-i), rising (a,d,g) or
falling slopes (c,f,i).

Fig. 2. Note (top) and interval probability density function obtained from The Digital Tradition
folk database
Pitch Gestures in Generative Modeling of Music 55

pitch of music. According to Vos and Troost (1989), the smaller intervals occur more
often in descending form, while the larger ones occur mainly in ascending form.
However, since the slope and arch are modelled in this work, the pdf of the intervals
are mirrored and added around zero, and subsequently weighted and copied back to
recreate the full interval pdf. It is later made possible to create a melodic contour with
a given slope and arch characteristics, as detailed below.
In order to generative pitch contours with gestures, the model in figure 1 is used.
For the pitch contour, only the neutral gesture (e) in figure 1, the falling and rising
slope (d) and (f), and the positive and negative arches (b) and (h) are modeled here.
The gestures are obtained by weighting the positive and negative slope of the interval
probability density function with a weight,

pdf i = [w⋅ pdf i+ ,(1 − w)⋅ pdf i+ ]. (1)

Here, pdfi+ is the mirrored/added positive interval pdf, and w is the weight. If
w=0.5, a neutral gesture is obtained, and if w<0.5, a positive slope is obtained, and if
w>0.5, a negative slope is obtained. In order to obtained an arch, the value of the
weight is changed to w=1- w, in the middle of the gesture.
In order to obtain a musical scale, the probability density function for the intervals
(pdfi) is multiplied with a suitable pdfs for the scale, such as the one illustrated in
figure 2 (top),
pdf = shift( pdf i ,n0 )⋅ pdf s ⋅ wr . (2)

As pdfs is only defined for one octave, it is circularly repeated. The interval
probabilities, pdfi, are shifted for each note n0. This is done under the hypothesis that
the intervals and scale notes are independent. So as to retain the register,
approximately, a register weight wr is further multiplied to the pdf. This weight is one
for one octave, and decreases exponentially on both sides, in order to lower the
possibility of obtaining notes far from the original register.
In order to obtain successive notes, the cumulative density function, cdf, is
calculated from eq (2), and used to model the probability that r is less than or equal to
the note intervals cdf(n0). If r is a random variable with uniform distribution in the
interval (0,1), then n0 can be found as the index of the first occurrence of cdf>r.
Examples of pitch contours obtained by setting w=0, and w=1, respectively, are
shown in figure 3. The rising and falling pitches are reset after each gesture, in order
to stay at the same register throughout the melody.
The positive and negative slopes are easily recognized when listening to the
resulting melodies, because of the abrupt pitch fall at the end of each gesture. The
arches, in comparison, are more in need of loudness and/or brightness variations in
order to make them perceptually recognized. Without this, a positive slope can be
confused for a negative arch that is shifted in time, or a positive or negative slope,
likewise shifted in time. Normally, an emphasis at the beginning of each gesture is
sufficient for the slopes, while the arches may be in need of an emphasis at the peak
of the arch as well.
56 K. Jensen

Fig. 3. Pitch contours of four melodies with positive arch, rising slope, negative arch and
falling slope

4 Recreating Pitch Contours in Meter


In previous work (Kühl and Jensen 2008), a generative model that produces tonal
music with structures changes was presented. This model, that creates note values
based on a statistical model, also introduces changes at the structural level (each 30
seconds, approximately). These changes are introduced, based on analysis of music
using the musigram visualization tools (Kühl and Jensen 2008).
With respect to chroma, an observation was made that only a subset of the full scale
notes were used at each structural element. This subset was modified, by removing and
inserting new notes from the list of possible notes, at each structural boundary. The
timbre changes include varying the loudness and brightness between loud/bright and
soft/dark structural elements. The main rhythm changes were based on the
identification of short elements (10 seconds) with no discernable rhythm. A tempo drift
of up-to 10% and insertion of faster rhythmic elements (Tatum) at structural
boundaries were also identified. These structural changes were implemented in a
generative model, which flowchart can be seen in figure 4. While the structural
elements certainly were beneficial for the long-term interest of the music, the lack of
short-term changes (chunks) and a rhythm model impeded on the quality of the music.
The meter, that improves the resulting quality, is included in this generative model by
adjusting the loudness and brightness of each tone according to its accent. The pitch
contour is made through the model introduced in the previous section.
Pitch Gestures in Generative Modeling of Music 57

Fig. 4. The generative model including meter, gesture and form. Structural changes on the note
values, the intensity and the rhythm is made every 30 seconds, approximately, and gesture
changes are made on average every seven notes

The notes are created using a simple envelope model and the synthesis method
dubbed brightness creation function (bcf, Jensen 1999) that creates a sound with
exponentially decreasing amplitudes that allows the continuous control of the
brightness. The accent affects the note, so that the loudness brightness is doubled, and
the duration is increased by 25 %, with 75% of the elongation made by advancing the
start of the note, as found in Jensen (2010).
These findings are put into a generative model of tonal music. A subset of notes
(3-5) is chosen at each new form (superchunk), together with a new dynamic level. At
the chunk level, new notes are created in a metrical loop, and the gestures are added
to the pitch contour and used for additional gesture emphasis. Finally, at the
microtemporal (subchunk) level, expressive deviations are added in order to render
the loops musical. The interaction of the rigid meter with the more loose pitch gesture
renders the generated notes a more musical sense, by the incertitude and the double
stream that results. The pure rising and falling pitch gestures are still clearly
perceptible, while the arches are less present. By setting w in eq(1) to something in
between (0,1), e.g. 0.2, or 0.8, a more realistic, agreeable rising and falling gestures
are resulting. Still, the arches are more natural to the ear, while the rising and falling
demand more attention, in particular perhaps the rising gestures.
58 K. Jensen

5 Conclusion
The automatic generation of music is in need of model to render the music expressive.
This model is found using knowledge from time perception of music studies, and
further studies of the cognitive and perceptual aspects of rhythm. Indeed, the
generative model consists of three sources, corresponding to the immediate
microtemporal, the present mesotemporal and the long-term memory macroterminal.
This corresponds to the note, the gesture and the form in music. While a single stream
in each of the source may not be sufficient, so far the model incorporates the
macrotemporal superchunk, the metrical mesotemporal chunk and the microtemporal
expressive enhancements. The work presented here has introduced gestures in the
pitch contour, corresponding to the rising and falling slopes, and to the positive and
negative arches, which adds a perceptual stream to the more rigid meter stream.
The normal beat as is given by different researchers to be approximately 100 BPM,
and Fraisse (1982) furthermore shows the existence of two main note durations, one
above and one below 0.4 secs, with a ratio of two. Indications as to subjective time,
given by Zwicker and Fastl (1999) are yet to be investigated, but this may well be
creating uneven temporal intervals in conflict with the pulse.
The inclusion of the pitch gesture model certainly, in the author’s opinion, renders
the music more enjoyable, but more work remains before the generative model is
ready for general-purpose uses.

References
1. Fraisse, P.: Rhythm and Tempo. In: Deutsch, D. (ed.) The Psychology of Music, 1st edn.,
pp. 149–180. Academic Press, New York (1982)
2. Friberg, A.: Performance Rules for Computer-Controlled Contemporary Keyboard Music.
Computer Music Journal 15(2), 49–55 (1991)
3. Gordon, J.W.: The perceptual attack time of musical tones. Journal of the Acoustical
Society of America, 88–105 (1987)
4. Handel, S.: Listening. MIT Press, Cambridge (1989)
5. Huron, D.: The Melodic Arch in Western Folk songs. Computing in Musicology 10, 3–23
(1996)
6. Jensen, K.: Timbre Models of Musical Sounds, PhD Dissertation, DIKU Report 99/7
(1999)
7. Jensen, K.: Investigation on Meter in Generative Modeling of Music. In: Proceedings of
the CMMR, Malaga, June 21-24 (2010)
8. Jensen, K., Kühl, O.: Towards a model of musical chunks. In: Ystad, S., Kronland-
Martinet, R., Jensen, K. (eds.) CMMR 2008. LNCS, vol. 5493, pp. 81–92. Springer,
Heidelberg (2009)
9. Kühl, O., Jensen, K.: Retrieving and recreating musical form. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 270–282. Springer,
Heidelberg (2008)
10. Kühl, O.: Musical Semantics. Peter Lang, Bern (2007)
11. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. The MIT Press,
Cambridge (1983)
Pitch Gestures in Generative Modeling of Music 59

12. Malbrán, S.: Phases in Children’s Rhythmic Development. In: Zatorre, R., Peretz, I. (eds.)
The Biological Foundations of Music. Annals of the New York Academy of Sciences
(2000)
13. Papadopoulos, G., Wiggins, G.: AI methods for algorithmic composition: a survey, a
critical view and future prospects. In: AISB Symposium on Musical Creativity, pp.
110–117 (1999)
14. Patel, A., Peretz, I.: Is music autonomous from language? A neuropsychological appraisal.
In: Deliège, I., Sloboda, J. (eds.) Perception and cognition of music, pp. 191–215.
Psychology Press, Hove (1997)
15. Samson, S., Ehrlé, N., Baulac, M.: Cerebral Substrates for Musical Temporal Processes.
In: Zatorre, R., Peretz, I. (eds.) The Biological Foundations of Music. Annals of the New
York Academy of Sciences (2000)
16. Snyder, B.: Music and Memory. An Introduction. The MIT Press, Cambridge (2000)
17. The Digital Tradition (2010), http://www.mudcat.org/AboutDigiTrad.cfm
(visited December 1, 2010)
18. Thomassen, J.M.: Melodic accent: Experiments and a tentative model. J. Acoust. Soc.
Am. 71(6), 1596–1605 (1982)
19. Vos, P.G., Troost, J.M.: Ascending and Descending Melodic Intervals: Statistical Findings
and Their Perceptual Relevance. Music Perception 6(4), 383–396 (1089)
20. Widmer, G.: Machine discoveries: A few simple, robust local expression principles.
Journal of New Music Research 31, 37–50 (2002)
21. Zwicker, E., Fastl, H.: Psychoacoustics: facts and models, 2nd edn. Springer series in
information sciences. Springer, Berlin (1999)
An Entropy Based Method for Local
Time-Adaptation of the Spectrogram

Marco Liuni1,2, , Axel Röbel2 , Marco Romito1 , and Xavier Rodet2


1
Universitá di Firenze, Dip. di Matematica ”U. Dini”
Viale Morgagni, 67/a - 50134 Florence - ITALY
2
IRCAM - CNRS STMS, Analysis/Synthesis Team 1,
Place Igor-Stravinsky - 75004 Paris - FRANCE
{marco.liuni,axel.roebel,xavier.rodet}@ircam.fr
[email protected]
http://www.ircam.fr/anasyn.html

Abstract. We propose a method for automatic local time-adaptation of


the spectrogram of audio signals: it is based on the decomposition of a
signal within a Gabor multi-frame through the STFT operator. The spar-
sity of the analysis in every individual frame of the multi-frame is eval-
uated through the Rényi entropy measures: the best local resolution is
determined minimizing the entropy values. The overall spectrogram of the
signal we obtain thus provides local optimal resolution adaptively evolv-
ing over time. We give examples of the performance of our algorithm with
an instrumental sound and a synthetic one, showing the improvement in
spectrogram displaying obtained with an automatic adaptation of the res-
olution. The analysis operator is invertible, thus leading to a perfect re-
construction of the original signal through the analysis coefficients.

Keywords: adaptive spectrogram, sound representation, sound analy-


sis, sound synthesis, Rényi entropy, sparsity measures, frame theory.

1 Introduction
Far from being restricted to entertainment, sound processing techniques are re-
quired in many different domains: they find applications in medical sciences,
security instruments, communications among others. The most challenging class
of signals to consider is indeed music: the completely new perspective opened
by contemporary music, assigning a fundamental role to concepts as noise and
timbre, gives musical potential to every sound.
The standard techniques of digital analysis are based on the decomposition
of the signal in a system of elementary functions, and the choice of a specific
system necessarily has an influence on the result. Traditional methods based on
single sets of atomic functions have important limits: a Gabor frame imposes a
fixed resolution over all the time-frequency plane, while a wavelet frame gives a
strictly determined variation of the resolution: moreover, the user is frequently

This work is supported by grants from Region Ile-de-France.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 60–75, 2011.

c Springer-Verlag Berlin Heidelberg 2011
A Method for Local Time-Adaptation of the Spectrogram 61

asked to define himself the analysis window features, which in general is not a
simple task even for experienced users. This motivates the search for adaptive
methods of sound analysis and synthesis, and for algorithms whose parameters
are designed to change according to the analyzed signal features. Our research
is focused on the development of mathematical models and tools based on the
local automatic adaptation of the system of functions used for the decomposition
of the signal: we are interested in a complete framework for analysis, spectral
transformation and re-synthesis; thus we need to define an efficient strategy to
reconstruct the signal through the adapted decomposition, which must give a
perfect recovery of the input if no transformation is applied.
Here we propose a method for local automatic time-adaptation of the Short
Time Fourier Transform window function, through a minimization of the Rényi
entropy [22] of the spectrogram; we then define a re-synthesis technique with
an extension of the method proposed in [11]. Our approach can be presented
schematically in three parts:

1. a model for signal analysis exploiting concepts of Harmonic Analysis, and


Frame Theory in particular: it is a generally highly redundant decomposing
system belonging to the class of multiple Gabor frames [8],[14];
2. a sparsity measure defined on time-frequency localized subsets of the analysis
coefficients, in order to determine local optimal concentration;
3. a reduced representation obtained from the original analysis using the in-
formation about optimal concentration, and a synthesis method through an
expansion in the reduced system obtained.

We have realized a first implementation of this scheme in two different versions:


for both of them a sparsity measure is applied on subsets of analysis coefficients
covering the whole frequency dimension, thus defining a time-adapted analysis
of the signal. The main difference between the two concerns the first part of the
model, that is the single frames composing the multiple Gabor frame. This is a
key point as the first and third part of the scheme are strictly linked: the frame
used for re-synthesis is a reduction of the original multi-frame, so the entire
model depends on how the analysis multi-frame is designed. The section Frame
Theory in Sound Analysis and Synthesis treats this part of our research in more
details.
The second point of the scheme is related to the measure applied on the coeffi-
cients of the analysis within the multi-frame to determine local best resolutions.
We consider measures borrowed from Information Theory and Probability The-
ory according to the interpretation of the analysis within a frame as a probability
density [4]: our model is based on a class of entropy measures known as Rényi
entropies which extend the classical Shannon entropy. The fundamental idea
is that minimizing the complexity or information over a set of time-frequency
representations of the same signal is equivalent to maximizing the concentra-
tion and peakiness of the analysis, thus selecting the best resolution tradeoff
[1]: in the section Rényi Entropy of Spectrograms we describe how a sparsity
measure can consequently be defined through an information measure. Finally,
62 M. Liuni et al.

in the fourth section we provide a description of our algorithm and examples of


adapted spectrogram for different sounds.
Some examples of this approach can be found in the literature: the idea of
gathering a sparsity measure from Rényi entropies is detailed in [1], and in [13]
a local time-frequency adaptive framework is presented exploiting this concept,
even if no methods for perfect reconstruction are provided. In [21] sparsity is
obtained through a regression model; a recent development in this sense is con-
tained in [14] where a class of methods for analysis adaptation are obtained sepa-
rately in the time and frequency dimension together with perfect reconstruction
formulas: indeed no strategies for automatization are employed, and adaptation
has to be managed by the user. The model conceived in [18] belongs to this
same class but presents several novelties in the construction of the Gabor multi-
frame and in the method for automatic local time-adaptation. In [15] another
time-frequency adaptive spectrogram is defined considering a sparsity measure
called energy smearing, without taking into account the re-synthesis task. The
concept of quilted frame, recently introduced in [9], is the first promising effort
to establish a unified mathematical model for all the various frameworks cited
above.

2 Frame Theory in Sound Analysis and Synthesis


When analyzing a signal through its decomposition, the features of the repre-
sentation are influenced by the decomposing functions; the Frame Theory (see
[3],[12] for detailed mathematical descriptions) allows a unified approach when
dealing with different bases and systems, studying the properties of the operators
that they identify. The concept of frame extends the one of orthonormal basis in
a Hilbert space, and it provides a theory for the discretization of time-frequency
densities and operators [8], [20], [2]. Both the STFT and the Wavelet transform
can be interpreted within this setting (see [16] for a comprehensive survey of
theory and applications).
Here we summarize the basic definitions and theorems, and outline the fun-
damental step consisting in the introduction of Multiple Gabor Frames, which
is comprehensively treated in [8]. The problem of standard frames is that the
decomposing atoms are defined from the same original function, thus impos-
ing a limit on the type of information that one can deduce from the analysis
coefficients; if we were able to consider frames where several families of atoms
coexist, than we would have an analysis with variable information, at the price
of a higher redundancy.

2.1 Basic Definitions and Results


Given a Hilbert space H seen as a vector space on C, with its own scalar product,
we consider in H a set of vectors {φγ }γ∈Γ where the index set Γ may be infinite
and γ can also be a multi-index. The set {φγ }γ∈Γ is a frame for H if there exist
two positive non zero constants A and B, called frame bounds, such that for all
f ∈ H,
A Method for Local Time-Adaptation of the Spectrogram 63


Af 2 ≤ | f, φγ
|2 ≤ Bf 2 . (1)
γ∈Γ

We are interested in the case H = L2 (R) and Γ countable, as it represents


the standard situation where a signal f is decomposed through a countable set
of given functions {φk }k∈Z . The frame bounds A and B are the infimum and
supremum, respectively, of the eigenvalues of the frame operator U, defined as

Uf = f, φk
φk . (2)
k∈Z

For any frame {φk }k∈Z there exist dual frames {φ̃k }k∈Z such that for all f ∈
L2 (R)  
f= f, φk
φ̃k = f, φ̃k
φk , (3)
k∈Z k∈Z
so that given a frame it is always possible to perfectly reconstruct a signal f
using the coefficients of its decomposition through the frame. The inverse of the
frame operator allows the calculation of the canonical dual frame
φ̃k = U−1 φk (4)
which guarantees minimal-norm coefficients in the expansion.
A Gabor frame is obtained by time-shifting and frequency-transposing a win-
dow function g according to a regular grid. They are particularly interesting in
the applications as the analysis coefficients are simply given by sampling the
STFT of f with window g according to the nodes of a specified lattice. Given a
time step a and a frequency step b we write {un }n∈Z = an and {ξk }k∈Z = bk;
these two sequences generate the nodes of the time-frequency lattice Λ for the
frame {gn,k }(n,k)∈Z2 defined as
gn,k (t) = g(t − un )e2πiξk t ; (5)
the nodes are the centers of the Heisenberg boxes associated to the windows in
the frame. The lattice has to satisfy certain conditions for {gn,k } to be a frame
[7], which impose limits on the choice of the time and frequency steps: for certain
choices [6] which are often adopted in standard applications, the frame operator
takes the form of a multiplication,
 

Uf (t) = b−1 |g(t − un )|2 f (t) , (6)
n∈Z

and the dual frame is easily calculated by means of a straight multiplication of


the atoms in the original frame. The relation between the steps a, b and the
frame bounds A, B in this case is clear by (6), as the frame condition implies

0 < A ≤ b−1 |g(t − un )|2 ≤ B < ∞ . (7)
n∈Z

Thus we see that the frame bounds provide also information on the redundancy
of the decomposition of the signal within the frame.
64 M. Liuni et al.

2.2 Multiple Gabor Frames

In our adaptive framework, we look for a method to achieve an analysis with


multiple resolutions: thus we need to combine the information coming from the
decompositions of a signal in several frames of different window functions. Mul-
tiple Gabor frames have been introduced in [22] to provide the original Gabor
analysis with flexible multi-resolution techniques: given a set of index L ⊆ Z and
different frames {gn,k
l
}(n,k)∈Z2 with l ∈ L, a multiple Gabor frame is obtained
with a union of the single given frames. The different g l do not necessarily share
the same type or shape: in our method an original window is modified with a
finite number of scaling  
1 t
g l (t) = √ g ; (8)
l l
then all the scaled versions are used to build |L| different frames which constitute
the initial multi-frame.
A Gabor multi-frame has in general a significant redundancy which lowers the
readability of the analysis. A possible strategy to overcome this limit is proposed
in [14] where nonstationary Gabor frames are introduced, actually allowing the
choice of a different window for each time location of a global irregular lattice
Λ, or alternatively for each frequency location. This way, the window chosen is a
function of time or frequency position in the time-frequency space, not both. In
most applications, for this kind of frame there exist fast FFT based methods for
the analysis and re-synthesis steps. Referring to the time case, with the abuse
of notation gn(l) we indicate the window g l centered at a certain time n(l) = un
which is a function of the chosen window itself. Thus, a nonstationary Gabor
frame is given by the set of atoms

{gn(l) e2πibl kt , (n(l), bl k) ∈ Λ} , (9)

where bl is the frequency step associated to the window g l and k ∈ Z . If we


suppose that the windows g l have limited time support and a sufficiently small
frequency step bl , the frame operator U takes a similar form to the one in (6),
1
Uf (t) = |gn(l) (t)|2 f (t) . (10)
bl
n(l)


Here, if N = l b1l |gn(l) (s)|2  1 then U is invertible and the set (9) is a frame
whose dual frame is given by
1
g̃n(l),k (t) = gn(l) (t)e2πibl kt . (11)
N
Nonstationary Gabor frames belong to the recently introduced class of quilted
frames [9]: in this kind of decomposing systems the choice of the analysis window
depends on both the time and the frequency location, causing more difficulties
A Method for Local Time-Adaptation of the Spectrogram 65

for an analytic fast computation of a dual frame as in (11): future improvements


of our research concern the employment of such a decomposition model for au-
tomatic local adaptation of the spectrogram resolution both in the time and the
frequency dimension.

3 Rényi Entropy of Spectrograms


We consider the discrete spectrogram of a signal as a sampling of the square of
its continuous version
 2
 
2 
PSf (u, ξ) = |Sf (u, ξ)| =  f (t)g(t − u)e −2πiξt 
dt , (12)

where f is a signal, g is a window function and Sf (u, ξ) is the STFT of f through


g.
Such a sampling is obtained according to a regular lattice Λab , considering a
Gabor frame (5),
PSf [n, k] = |Sf [un , ξk ]|2 . (13)
With an appropriate normalization both the continuous and discrete spectro-
gram can be interpreted as probability densities. Thanks to this interpretation,
some techniques belonging to the domain of Probability and Information The-
ory can be applied to our problem: in particular, the concept of entropy can be
extended to give a sparsity measure of a time-frequency density. A promising
approach [1] takes into account Rényi entropies, a generalization of the Shannon
entropy: the application to our problem is related to the concept that mini-
mizing the complexity or information of a set of time-frequency representations
of a same signal is equivalent to maximizing the concentration, peakiness, and
therefore the sparsity of the analysis. Thus we will consider as best analysis the
sparsest one, according to the minimal entropy evaluation.
Given a signal f and its spectrogram PSf as in (12), the Rényi entropy of
order α > 0, α = 1 of PSf is defined as follows
  α
1 PSf (u, ξ)
HR (PS f ) = log  dudξ , (14)
α
1−α 2
R
  
R PSf (u , ξ )du dξ


where R ⊆ R2 and we omit its indication if equality holds. Given a discrete


spectrogram with time step a and frequency step b as in (13), we consider R as
a rectangle of the time-frequency plane R = [t1 , t2 ] × [ν1 , ν2 ] ⊆ R2 . It identifies
a sequence of points G ⊆ Λab where G = {(n, k) ∈ Z2 : t1 ≤ na ≤ t2 , ν1 ≤ kb ≤
ν2 }. As a discretization of the original continuous spectrogram, every sample in
PSf [n, k] is related to a time-frequency region of area ab; we thus obtain the
discrete Rényi entropy measure directly from (14),

1   PSf [n, k]

HG [PSf ] = log   
+ log2 (ab) . (15)
α
1−α 2
[n ,k ]∈G PSf [n , k ]
[n,k]∈G
66 M. Liuni et al.

We will focus on discretized spectrograms with a finite number of coefficients,


as dealing with digital signal processing requires to work with finite sampled
signals and distributions.
Among the general properties of Rényi entropies [17], [19] and [23] we recall
in particular those directly related with our problem. It is easy to show that for
every finite discrete probability density P the entropy Hα (P ) tends to coincide
with the Shannon entropy of P as the order α tends to one. Moreover, Hα (P ) is
a non increasing function of α, so

α1 < α2 ⇒ Hα1 (P ) ≥ Hα2 (P ) . (16)

As we are working with finite discrete densities we can also consider the case
α = 0 which is simply the logarithm of the number of elements in P ; as a
consequence H0 (P ) ≥ Hα (P ) for every admissible order α.
A third basic fact is that for every order α the Rényi entropy Hα is maximum
when P is uniformly distributed, while it is minimum and equal to zero when P
has a single non-zero value.
All of these results give useful information on the values of different measures
on a single density P as in (15), while the relations between the entropies of two
different densities P and Q are in general hard to determine analytically; in our
problem, P and Q are two spectrograms of a signal in the same time-frequency
area, based on two window functions with different scaling as in (8). In some
basic cases such a relation is achievable, as shown in the following example.

3.1 Best Window for Sinusoids

When the spectrograms of a signal through different window functions do not


depend on time, it is easy to compare their entropies: let PSs be the sampled
spectrogram of a sinusoid s over a finite region G with a window function g of
compact support; then PSs is simply a translation in the frequency domain of
ĝ, the Fourier transform of the window, and it is therefore time-independent.
We choose a bounded set L of admissible scaling factors, so that the discretized
support of the scaled windows g l still remains inside G for any l ∈ L. It is not
hard to prove that the entropy of a spectrogram taken with such a scaled version
of g is given by
α (PSsl ) = Hα (PSs ) − log2 l .
HG G
(17)

The sparsity measure we are using chooses as best window the one which mini-
mizes the entropy measure: we deduce from (17) that it is the one obtained with
the largest scaling factor available, therefore with the largest time-support. This
is coherent with our expectation as stationary signals, such as sinusoids, are best
analyzed with a high frequency resolution, because time-independency allows a
small time resolution. Moreover, this is true for any order α used for the entropy
calculus. Symmetric considerations apply whenever the spectrogram of a signal
does not depend on frequency, as for impulses.
A Method for Local Time-Adaptation of the Spectrogram 67

3.2 The α Parameter


The α parameter in (14) introduces a biasing on the spectral coefficients; to
have a qualitative description of this biasing, we consider a collection of simple
spectrograms composed by a variable amount of large and small coefficients. We
realize a vector D of length N = 100 generating numbers between 0 and 1 with
a normal random distribution; then we consider the vectors DM , 1 ≤ M ≤ N
such that

D[k] if k ≤ M
DM [k] = D[k]
20 if k > M
(18)

and then normalize to obtain a unitary sum. We then apply Rényi entropy
measures with α varying between 0 and 30: as we see from figure 1, there is a
relation between M and the slope of the entropy curves for the different values
of α. For α = 0, H0 [DM ] is the logarithm of the number of non-zero coefficients
and it is therefore constant; when α increases, we see that densities with a small
amount of large coefficients gradually decrease their entropy, faster than the
almost flat vectors corresponding to larger values of M . This means that by
increasing α we emphasize the difference between the entropy values of a peaky
distribution and that of a nearly flat one. The sparsity measure we consider
select as best analysis the one with minimal entropy, so reducing α rises the
probability of less peaky distributions to be chosen as sparsest: in principle, this
is desirable as weaker components of the signal, such as partials, have to be
taken into account in the sparsity evaluation. But as well, this principle should
be applied with care as a small coefficient in a spectrogram could be determined
by a partial as well as by a noise component; choosing an extremely small α,
the best window chosen could vary without a reliable relation with spectral
concentration depending on the noise level within the sound.

6
entropy

3 0
5
10
2 15
0 20
20 alpha
40 25
60
80 30
M 100

Fig. 1. Rényi entropy evaluations of the DM vectors with varying α; the distribution
becomes flatter as M increases
68 M. Liuni et al.

3.3 Time and Frequency Steps


A last remark regards the dependency of (15) on the time and frequency step a
and b used for the discretization of the spectrogram. When considering signals as
finite vectors, a signal and its Fourier Transform have the same length. Therefore
in the STFT the window length determines the number frequency points, while
the sampling rate sets frequency values: the definition of b is thus implicit in
the window choice. Actually, the FFT algorithm allows to specify a number
of frequency points larger than the signal length: further frequency values are
obtained as an interpolation between the original ones by properly adding zero
values to the signal. If the sampling rate is fixed, this procedure causes a smaller
b as a consequence of a larger number of frequency points. We have numerically
verified that such a variation of b has no impact on the entropy calculus, so that
the FFT size can be set according to implementation needs.
Regarding the time step a, we are working on the analytical demonstration
of a largely verified evidence: as long as the decomposing system is a frame the
entropy measure is invariant to redundancy variation, so the choice of a can be
ruled by considerations on the invertibility of the decomposing frame without
losing coherence between the information measure of the different analyses. This
is a key point, as it states that the sparsity measure obtained allows a total
independence between the hop sizes of the different analyses: with the imple-
mentation of proper structures to handle multi-hop STFTs we have obtained a
more efficient algorithm in comparison with those imposing a fixed hop size, as
[15] and the first version of the one we have realized.

4 Algorithm and Examples


We now summarize the main operations of the algorithm we have developed
providing examples of its application. For the calculation of the spectrograms
we use a Hanning window

h(t) = cos2 (πt)χ[− 12 , 12 ] , (19)

with χ the indicator function of the specified interval, but it is obviously possible
to generalize the results thus obtained to the entire class of compactly supported
window functions. In both the versions of our algorithm we create a multiple
Gabor frame as in (5), using as mother functions some scaled version of h,
obtained as in (8) with a finite set of positive real scaling factors L.
We consider consecutive segments of the signal, and for each segment we
calculate |L| spectrograms with the |L| scaled windows: the length of the anal-
ysis segment and the overlap between two consecutive segments are given as
parameters.
In the first version of the algorithm the different frames composing the multi-
frame have the same time step a and frequency step b: this guarantees that for
each signal segment the different frames have Heisenberg boxes whose centers
lay on a same lattice on the time-frequency plane, as illustrated in figure 2. To
A Method for Local Time-Adaptation of the Spectrogram 69

4096
window length(smps)

3043

2261

1680

1248
927
689
512
0.05 0.1 0.15
time

Fig. 2. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the first version of our algorithm

x 10
4 512-samples hann window
2
frequency

1.5

0.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 10
4 4096-samples hann window
2
frequency

1.5

0.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2


time

Fig. 3. Two different spectrograms of a B4 note played by a marimba, with Hanning


windows of sizes 512 (top) and 4096 (bottom) samples

guarantee that all the |L| scaled windows constitute a frame when translated
and modulated according to this global lattice, the time step a must be set
with the hop size assigned to the smallest window frame. On the other hand, as
the FFT of a discrete signal has the same number of points of the signal itself,
the frequency step b has to be the FFT size of the largest window analysis: for
the smaller ones, a zero-padding is performed.
70 M. Liuni et al.

window length (smps)

4096

2048

1024

512
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

4
x 10
2
frequency

1.5

0.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2


time

Fig. 4. Example of an adaptive analysis performed by the first version of our algorithm
with four Hanning windows of different sizes (512, 1024, 2048 and 4096 samples) on a
B4 note played by a marimba: on top, the best window chosen as a function of time; at
the bottom, the adaptive spectrogram. The entropy order is α = 0.7 and each analysis
segment contains twenty-four analyses frames with a sixteen-frames overlap between
consecutive segments.

Each signal segment identifies a time-frequency rectangle G for the entropy


evaluation: the horizontal edge is the time interval of the considered segment,
while the vertical one is the whole frequency lattice. For each spectrogram, the
rectangle G defines a subset of coefficients belonging to G itself. The |L| different
subsets do not correspond to the same part of signal, as windows have different
time supports. Therefore, a preliminary weighting of the signal has to be per-
formed before the calculations of the local spectrograms: this step is necessary to
balance the influence on the entropy calculus between coefficients which regard
parts of signal shared or not shared by the different analysis frames.
After the pre-weighting, we calculate the entropy of every spectrogram as in
(15). Having the |L| entropy values corresponding to the different local spec-
trograms, the sparsest local analysis is defined as the one with minimum Rényi
entropy: the window associated to the sparsest local analysis is chosen as best
window at all the time points contained in G.
The global time adapted analysis of the signal is finally realized by oppor-
tunely assembling the slices of local sparsest analyses: they are obtained with
a further spectrogram calculation of the unweighted signal, employing the best
windows selected at each time point.
In figure 4 we give an example of an adaptive analysis performed by our first
algorithm with four Hanning windows of different sizes on a real instrumental
sound, a B4 note played by a marimba: this sound combines the need for a
A Method for Local Time-Adaptation of the Spectrogram 71

4096
window length (smps)

3043

2261

1680

1248
927
689
512
0 0.05 0.1 0.15
time

Fig. 5. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the second version of our algorithm

good time resolution at the strike with that of a good frequency resolution on
the harmonic resonance. This is fully provided by the algorithm, as shown in
the adaptive spectrogram at the bottom of the figure 4. Moreover, we see that
the pre-echo of the analysis at the bottom of figure 3 is completely removed in
the adapted spectrogram.
The main difference in the second version of our algorithm concerns the indi-
vidual frames composing the multi-frame, which have the same frequency step
b but different time steps {al : l ∈ L}: the smallest and largest window sizes are
given as parameters together with |L|, the number of different windows com-
posing the multi-frame, and the global overlap needed for the analyses. The
algorithm fixes the intermediate sizes so that, for each signal segment, the dif-
ferent frames have the same overlap between consecutive windows, and so the
same redundancy.
This choice highly reduces the computational cost by avoiding unnecessary
small hop sizes for the larger windows, and as we have observed in the previous
section it does not affect the entropy evaluation. Such a structure generates an
irregular time disposition of the multi-frame elements in each signal segment,
as illustrated in figure 5; in this way we also avoid the problem of unshared
parts of signal between the systems, but we still have a different influence of the
boundary parts depending on the analysis frame: the beginning and the end of
the signal segment have a higher energy when windowed in the smaller frames.
This is avoided with a preliminary weighting: the beginning and the end of each
signal segment are windowed respectively with the first and second half of the
largest analysis window.
As for the first implementation, the weighting does not concern the decomposi-
tion for re-synthesis purpose, but only the analyses used for entropy evaluations.
72 M. Liuni et al.

window length (smps)


frequency

Fig. 6. Example of an adaptive analysis performed by the second version of our algo-
rithm with eight Hanning windows of different sizes from 512 to 4096 samples, on a
B4 note played by a marimba sampled at 44.1kHz: on top, the best window chosen
as a function of time; at the bottom, the adaptive spectrogram. The entropy order is
α = 0.7 and each analysis segment contains four frames of the largest window analysis
with a two-frames overlap between consecutive segments.

After the pre-weighting, the algorithm follows the same steps described above:
calculation of the |L| local spectrograms, evaluation of their entropy, selection of
the window providing minimum entropy, computation of the adapted spectro-
gram with the best window at each time point, thus creating an analysis with
time-varying resolution and hop size.
In figure 6 we give a first example of an adaptive analysis performed by the
second version of our algorithm with eight Hanning windows of different sizes:
the sound is still the B4 note of a marimba, and we can see that the two versions
give very similar results. Thus, if the considered application does not specifically
ask for a fixed hop size of the overall analysis, the second version is preferable
as it highly reduces the computational cost without affecting the best window
choice.
In figure 8 we give a second example with a synthetic sound, a sinusoid with si-
nusoidal frequency modulation: as figure 7 shows, a small window is best adapted
where the frequency variation is fast compared to the window length; on the
other hand, the largest window is better where the signal is almost stationary.

4.1 Re-synthesis Method


The re-synthesis method introduced in [11] gives a perfect reconstruction of the
signal as a weighted expansion of the coefficients of its STFT in the original
analysis frame. Let Sf [n, k] be the STFT of a signal f with window function h
and time step a; fixing n, through an iFFT we have a windowed segment of f
A Method for Local Time-Adaptation of the Spectrogram 73

4 512-samples hann window


x 10
2
frequency

1.5

0.5

0 0.5 1 1.5 2 2.5

4 4096-samples hann window


x 10
2
frequency

1.5

0.5

0 0.5 1 1.5 2 2.5


time

Fig. 7. Two different spectrograms of a sinusoid with sinusoidal frequency modulation,


with Hanning windows of sizes 512 (top) and 4096 (bottom) samples

fh (n, l) = h(na − l)f (l) , (20)


whose time location depends on n. An immediate perfect reconstruction of f is
given by +∞
h(na − l)fh (n, l)
f (l) = n=−∞
+∞ . (21)
n=−∞ h (na − l)
2

In our case, after the automatic selection step we dispose of a temporal sequence
with the best windows at each time position; in the first version we have a
fixed hop for all the windows, in the second one every window has its own
time step. In both the cases we have thus reduced the initial multi-frame to
a nonstationary Gabor frame: we extend the same technique of (21) using a
variable window h and time step a according to the composition of the reduced
multi-frame, obtaining a perfect reconstruction as well. The interest of (21) is
that the given distribution does not need to be the STFT of a signal: for example,
a transformation S ∗ [n, k] of the STFT of a signal could be considered. In this
case, (21) gives the signal whose STFT has minimal least squares error with
S ∗ [n, k].
As seen by the equations (9) and (11), the theoretical existence and the math-
ematical definition of the canonical dual frame for a nonstationary Gabor frame
like the one we employ has been provided [14]: it is thus possible to define the
whole analysis and re-synthesis framework within the Gabor theory. We are at
present working on the interesting analogies between the two approaches, to
establish a unified interpretation and develop further extensions.
74 M. Liuni et al.

Fig. 8. Example of an adaptive analysis performed by the second version of our al-
gorithm with eight Hanning windows of different sizes from 512 to 4096 samples, on
a sinusoid with sinusoidal frequency modulation synthesized at 44.1 kHz: on top, the
best window chosen as a function of time; at the bottom, the adaptive spectrogram.
The entropy order is α = 0.7 and each analysis segment contains four frames of the
largest window analysis with a three-frames overlap between consecutive segments.

5 Conclusions
We have presented an algorithm for time-adaptation of the spectrogram reso-
lution, which can be easily integrated in existent framework for analysis, trans-
formation and re-synthesis of an audio signal: the adaptation is locally obtained
through an entropy minimization within a finite set of resolutions, which can be
defined by the user or left as default. The user can also specify the time duration
and overlap of the analysis segments where entropy minimization is performed,
to privilege more or less discontinuous adapted analyses.
Future improvements of this method will concern the spectrogram adaptation
in both time and frequency dimensions: this will provide a decomposition of the
signal in several layers of analysis frames, thus requiring an extension of the
proposed technique for re-synthesis.

References
1. Baraniuk, R.G., Flandrin, P., Janssen, A.J.E.M., Michel, O.J.J.: Measuring Time-
Frequency Information Content Using the Rényi Entropies. IEEE Trans. Info. The-
ory 47(4) (2001)
2. Borichev, A., Gröchenig, K., Lyubarskii, Y.: Frame constants of gabor frames near
the critical density. J. Math. Pures Appl. 94(2) (2010)
A Method for Local Time-Adaptation of the Spectrogram 75

3. Christensen, O.: An Introduction To Frames And Riesz Bases. Birkhäuser, Boston


(2003)
4. Cohen, L.: Time-Frequency Distributions-A Review. Proceedings of the IEEE 77(7)
(1989)
5. Cohen, L.: Time-Frequency Analysis. Prentice-Hall, Upper Saddle River (1995)
6. Daubechies, I., Grossmann, A., Meyer, Y.: Painless nonorthogonal expansions. J.
Math. Phys. 27 (1986)
7. Daubechies, I.: The Wavelet Transform, Time-Frequency Localization and Signal
Analysis. IEEE Trans. Info. Theory 36(5) (1990)
8. Dörfler, M.: Gabor Analysis for a Class of Signals called Music. PhD thesis,
NuHAG, University of Vienna (2002)
9. Dörfler, M.: Quilted Gabor frames - a new concept for adaptive time-frequency
representation. eprint arXiv:0912.2363 (2010)
10. Flandrin, P.: Time-Frequency/ Time-Scale Analysis. Academic Press, San Diego
(1999)
11. Griffin, D.W., Lim, J.S.: Signal Estimation from Modified Short-Time Fourier
Transform. IEEE Trans. Acoust. Speech Signal Process. 32(2) (1984)
12. Gröchenig, K.: Foundations of Time-Frequency Analysis. Birkhäuser, Boston
(2001)
13. Jaillet, F.: Représentation et traitement temps-fréquence des signaux au-
dionumériques pour des applications de design sonore. PhD thesis, Université de
la Méditerranée - Aix-Marseille II (2005)
14. Jaillet, F., Balazs, P., Dörfler, M.: Nonstationary Gabor Frames. IN-
RIA a CCSD electronic archive server based on P.A.O.L (2009),
http://hal.inria.fr/oai/oai.php
15. Lukin, A., Todd, J.: Adaptive Time-Frequency Resolution for Analysis and Pro-
cessing of Audio. Audio Engineering Society Convention Paper (2006)
16. Mallat, S.: A wavelet tour on signal processing. Academic Press, San Diego (1999)
17. Rényi, A.: On Measures of Entropy and Information. In: Proc. Fourth Berkeley
Symp. on Math. Statist. and Prob., Berkeley, California, pp. 547–561 (1961)
18. Rudoy, D., Prabahan, B., Wolfe, P.: Superposition frames for adaptive time-
frequency analysis and fast reconstruction. IEEE Trans. Sig. Proc. 58(5) (2010)
19. Schlögl, F., Beck, C. (eds.): Thermodynamics of chaotic systems. Cambridge Uni-
versity Press, Cambridge (1993)
20. Sun, W.: Asymptotic properties of Gabor frame operators as sampling density
tends to infinity. J. Funct. Anal. 258(3) (2010)
21. Wolfe, P.J., Godsill, S.J., Dörfler, M.: Multi-Gabor Dictionaries for Audio Time-
Frequency Analysis. In: Proc. IEEE WASPAA (2001)
22. Zibulski, M., Zeevi, Y.Y.: Analysis of multiwindow Gabor-type schemes by frame
methods. Appl. Comput. Harmon. Anal. 4(2) (1997)
23. Zyczkowski, K.: Rényi Extrapolation of Shannon Entropy. Open Systems & Infor-
mation Dynamics 10(3) (2003)
Transcription of Musical Audio Using Poisson
Point Processes and Sequential MCMC

Pete Bunch and Simon Godsill

Signal Processing and Communications Laboratory


Department of Engineering
University of Cambridge
{pb404,sjg}@eng.cam.ac.uk
http://www-sigproc.eng.cam.ac.uk/~ sjg

Abstract. In this paper models and algorithms are presented for tran-
scription of pitch and timings in polyphonic music extracts. The data
are decomposed framewise into the frequency domain, where a Poisson
point process model is used to write a polyphonic pitch likelihood func-
tion. From here Bayesian priors are incorporated both over time (to link
successive frames) and also within frames (to model the number of notes
present, their pitches, the number of harmonics for each note, and inhar-
monicity parameters for each note). Inference in the model is carried out
via Bayesian filtering using a powerful Sequential Markov chain Monte
Carlo (MCMC) algorithm that is an MCMC extension of particle fil-
tering. Initial results with guitar music, both laboratory test data and
commercial extracts, show promising levels of performance.

Keywords: Automated music transcription, multi-pitch estimation,


Bayesian filtering, Poisson point process, Markov chain Monte Carlo,
particle filter, spatio-temporal dynamical model.

1 Introduction
The audio signal generated by a musical instrument as it plays a note is com-
plex, containing multiple frequencies, each with a time-varying amplitude and
phase. However, the human brain perceives such a signal as a single note, with
associated “high-level” properties such as timbre (the musical “texture”) and
expression (loud, soft, etc.). A musician playing a piece of music takes as input
a score, which describes the music in terms of these high-level properties, and
produces a corresponding audio signal. An accomplished musician is also able to
reverse the process, listening to a musical audio signal and transcribing a score.
A desirable goal is to automate this transcription process. Further developments
in computer “understanding” of audio signals of this type can be of assistance
to musicologists; they can also play an important part in source separation sys-
tems, as well as in automated mark-up systems for content-based annotation of
music databases.
Perhaps the most important property to extract in the task of musical tran-
scription is the note or notes playing at each instant. This will be the primary

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 76–83, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Sequential MCMC for Musical Transcription 77

aim of this work. As a subsidiary objective, it can be desirable to infer other


high level properties, such as timbre, expression and tempo.
Music transcription has become a large and active field over recent years, see
e.g. [6], and recently probabilistic Bayesian approaches have attracted attention,
see e.g. [5,2,1] to list but a few of the many recent contributions to this important
area. The method considered in this paper is an enhanced form of a frequency
domain model using a Poisson point process first developed in musical modelling
applications in [8,1]. The steps of the process are as follows. The audio signal is
first divided into frames, and an over-sampled Fast Fourier Transform (FFT) is
performed on each frame to generate a frequency spectrum. The predominant
peaks are then extracted from the amplitude of the frequency data. A likelihood
function for the observed spectral peaks may then be formulated according to an
inhomogeneous Poisson point process model (see [8] for the static single frame
formulation), conditional on all of the unknown parameters (the number of notes
present, their pitches, the number of harmonics for each note, and inharmonicity
parameters for each note). A prior distribution for these parameters, including
their evolution over time frames, then completes a Bayesian spatio-temporal
state space model. Inference in this model is carried out using a specially modified
version of the sequential MCMC algorithm [7], in which information about the
previous frame is collapsed onto a single univariate marginal representation of
the multipitch estimation.
To summarise the new contributions of this paper, we here explicitly model
within the Poisson process framework the number of notes present, the number
of harmonics for each note and the inharmonicity parameter for each note, and
we model the temporal evolution of the notes over time frames, all within a fully
Bayesian sequential updating scheme, implemented with sequential MCMC. This
contrasts with, for example, the static single frame-based approach of our earlier
Poisson process transcription work [8].

2 Models and Algorithms


2.1 The Poisson Likelihood Model
When a note is played on a musical instrument, a vibration occurs at a unique
“fundamental frequency”. In addition, an array of “partial frequencies” is also
generated. To a first order approximation, these occur at integer multiples of
the fundamental frequency. In fact, a degree of inharmonicity will usually occur,
especially for plucked or struck string instruments [4] (including the guitars con-
sidered as examples in this work). The inclusion of inharmonicity in the Poisson
likelihood models here adopted has not been considered before to our knowledge.
In this paper, compared with [8], we introduce an additional inharmonicity pa-
rameter for each musical pitch. This is incorporated in a similar fashion to the
inharmonicity model in [1], in which an entirely different time domain signal
model was adopted.
We consider firstly a single frame of data, as in [8], then extend to the se-
quential modelling of many frames. Examining the spectrum of a single note,
78 P. Bunch and S. Godsill

Fig. 1. An example of a single note spectrum, with the associated median threshold
(using a window of ±4 frequency bins) and peaks identified by the peak detection
algorithm (circles)

such as that in Figure 1, it is evident that a substantial part of the information


about pitch is contained in the frequency and amplitude of the spectral peaks.
The amplitudes of these peaks will vary according to the volume of the note,
the timbre of the instrument, and with time (high frequency partials will decay
faster, interference will cause beating, etc.), and are thus challenging to model.
Here, then, for reasons of simplicity and robustness of the model, we consider
only the frequencies at which peaks are observed. The set of observed peaks is
constructed by locating frequencies which have an amplitude larger than those
on each side, and which also exceeds a median filter threshold. See Figure 1. for
an illustration of the procedure.
For the Poisson likelihood model, the occurrence of peaks in the frequency
domain is assumed to follow an inhomogeneous Poisson process, in which an
intensity function μk gives the mean value of a Poisson distribution at the kth
frequency bin (μk is the integral, over the kth frequency bin width, of an intensity
function defined in continuous frequency, μ(f )). The principal advantage of such
a model is that we do not have to perform data association: there is no need
to identify uniquely which spectral peak belongs to which note harmonic. A
consequence of this simplification is that each harmonic in each musical note is
deemed capable of generating more than one peak in the spectrum. Examining
the k th FFT frequency bin, with Poisson intensity μk and in which Zk peaks
occur, the probability of observing n spectral peaks is given by the Poisson
distribution:
e−μk μnk
P (Zk = n) = (1)
n!
In fact, since it is not feasible to observe more than a single peak in each
frequency bin, we here consider only the binary detection of either zero peaks,
or ‘one or more’ peaks, as in [8]:
Sequential MCMC for Musical Transcription 79

Zero Peaks: P (Zk = 0) = e−μk One or more peaks: P (Zk ≥ 1) = 1 − e−μk


(2)
A single frame of data can thus be expressed as a binary vector where each
term inicates the presence or absence of a peak in the corresponding frequency
bin. As the bin observations are independent (following from the the Poisson
process assumption), the likelihood of the observed spectrum is given by:

K
P (Y |μ) = [yk (1 − e−μk ) + (1 − yk )e−μk ] (3)
k=1

where Y = {y1 , y2 , ..., yK } are the observed peak data in the K frequency
bins, such that yk = 1 if a peak is observed in the k th bin, and yk = 0 oth-
erwise.
 It only remains to formulate the intensity function μ(f ), and hence
μk = f ∈ kth bin μ(f )df . For this purpose, the Gaussian mixture model of Peel-
ing et al.[8] is used. Note that in this formulation we can regard each harmonic
of each note to be an independent Poisson process itself, and hence by the union
property of Poisson processes, all of the individual Poisson intensities add to
give a single overall intensity μ, as follows:

N
μ(f ) = μj (f ) + μc (4)
j=1
 

Hj
A (f − fj,h )2
μj (f ) =  exp − 2 (5)
2
2πσj,h 2σj,h
h=1

where j indicates the note number, h indicates the partial number, and N and
Hj are the numbers of notes and harmonics in each note, respectively. μc is a
constant that accounts for detected “clutter” peaks due to noise and non-musical
2
sounds. σj,h = κ2 h2 sets the variance of each Gaussian. A and κ are constant
parameters, chosen so as to give good performance on a set of test pieces. fj,h is
the frequency of the hth partial of the j th note, given by the inharmonic model
[4]: 
fj,h = f0,j h 1 + Bj h2 (6)
f0,j is the fundamental frequency of the jth note. Bj is the inharmonicity pa-
rameter for the note (of the order 10−4 ).
Three parameters for each note are variable and to be determined by the
inference engine: the fundamental, the number of partials,and the inharmonicity.
Moreover, the number of notes N is also treated as unknown in the fully Bayesian
framework.

2.2 Prior Distributions and Sequential MCMC Inference


The prior P (θ) over the unknown parameters θ in each time frame may be
decomposed, assuming parameters of different notes are independent, as:
80 P. Bunch and S. Godsill


N
P (θ) = P (N ) × P (f0,j ) × P (Hj |f0,j ) × P (Bj |Hj , f0,j ) (7)
j=1
In fact, we have here assumed all priors to be uniform over their expected ranges,
except for f0,j and N , which are stochastically linked to their values in previous
frames. To consider this linkage explicitly, we now introduce a frame number
label t and the corresponding parameters for frame t as θt , with frame peak data
Yt . In order to carry out optimal sequential updating we require a transition
density p(θ t |θt−1 ), and assume that the {θt } process is Markovian. Then we can
write the required sequential update as:
p(θ t−1:t |Y1:t ) ∝ p(θ t−1 |Y1:t−1 )p(θ t |θt−1 )p(Yt |θ t ) (8)
To see how this can be implemented in a sequential MCMC framework, assume
that at time t − 1 the inference problem is solved and a set of M >> 1 Monte
(i)
Carlo (dependent) samples {θt−1 } are available from the previous time’s target
distribution p(θt−1 |Y1:t−1 ). These samples are then formed into an empirical
distribution p̂(θt−1 ) which is used as an approximation to p(θ t−1 |Y1:t−1 ) in Eq.
(8). This enables the (approximated) time updated distribution p(θ t−1:t |Y1:t ) to
be evaluated pointwise, and hence a new MCMC chain can be run with Eq. (8)
as its target distribution. The converged samples from this chain are then used
to approximate the posterior distribution at time t, and the whole procedure
repeats as time step t increases.
The implementation of the MCMC at each time step is quite complex, since
it will involve updating all elements of the parameter vector θ t , including the
number of notes, the fundamental frequencies, the number of harmonics in each
note and the inharmonicity parameter for each note. This is carried out via a
combination of Gibbs sampling and Metropolis-within-Gibbs sampling, using a
Reversible Jump formulation wherever the parameter dimension (i.e. the number
of notes in the frame) needs to change, see [7] for further details of how such
schemes can be implemented in tracking and finance applications and [3] for
general information about MCMC. In order to enhance the practical performance
we modified the approximating density at t − 1, p̂(θ t−1 ) to be a univariate
density over one single fundamental frequency, which can be thought of as the
posterior distribution of fundamental frequency at time t − 1 with all the other
parameters marginalised, including the number of notes, and a univariate density
over the number of notes. This collapsing of the posterior distribution onto a
univariate marginal, although introducing an additional approximation into the
updating formula, was found to enhance the MCMC exploration at the next
time step significantly, since it avoids combinatorial updating issues that increase
dramatically with the dimension of the full parameter vector θt .
Having carried out the MCMC sampling at each time step, the fundamental
frequencies and their associated parameters (inharmonicity and number of har-
monics, if required) may be estimated. This estimation is based on extracting
maxima from the collapsed univariate distribution over fundamental frequency,
as described in the previous paragraph.
Sequential MCMC for Musical Transcription 81

(a) 2-note chords (b) 3-note chords

(c) Tears in Heaven

Fig. 2. Reversible Jump MCMC Results: Dots indicate note estimates. Line below in-
dicates estimate of the number of notes. Crosses in panels (a) and (b) indicate notes
estimated by the MCMC algorithm but removed by post-processing. A manually ob-
tained ground-truth is shown overlayed in panel (c).
82 P. Bunch and S. Godsill

3 Results

The methods have been evaluated on a selection of guitar music extracts, recorded
both in the laboratory and taken from commercial recordings. See Fig. 2 in which
three guitar extracts, two lab-generated (a) and (b) and one from a commercial
recording (c) are processed. Note that a few spurious note estimates arise, par-
ticularly around instants of note change, and many of these have been removed
by a post-processing stage which simply eliminates note estimates which last
for a single frame. The results are quite accurate, agreeing well with manually
obtained transcriptions.
When two notes an octave apart are played together, the upper note is not
found. See final chord of panel (a) in Figure 2. This is attributable to the two
notes sharing many of the same partials, making discrimination difficult based
on peak frequencies alone.
In the case of strong notes, the algorithm often correctly identifies up to 35
partial frequencies. In this regard, the use of inharmonicity modelling has proved
succesful: Without this feature, the estimate of the number of harmonics is often
lower, due to the inaccurate partial frequencies predicted by the linear model.
The effect of the sequential formulation is to provide a degree of smoothing
when compared to the frame-wise algorithm. Fewer single-frame spurious notes
appear, although these are not entirely removed, as shown in Figure 2. Octave
errors towards the end of notes are also reduced.

4 Conclusions and Future Work

The new algorithms have shown significant promise, especially given that the
likelihood function takes account only of peak frequencies and not amplitudes
or other information that may be useful for a transcription system. The good
performance so far obtained is a result of several novel modelling and algorith-
mic features, notably the formulation of a flexible frame-based model that can
account robustly for inharmonicities, unknown numbers of notes and unknown
numbers of harmonics in each note. A further key feature is the ability to link
frames together via a probabilistic model; this makes the algorithm more robust
in estimation of continuous fundamental frequency tracks from the data. A final
important component is the implementation through sequential MCMC, which
allows us to obtain reasonably accurate inferences from the models as posed.
The models may be improved in several ways, and work is underway to address
these issues. A major point is that the current Poisson model accounts only
for the frequencies of the peaks present. It is likely that performance may be
improved by including the peak amplitudes in the model. For example, this
might make it possible to distinguish more robustly when two notes an octave
apart are being played. Improvements are also envisaged in the dynamical prior
linking one frame to the next, which is currently quite crudely formulated. Thus,
further improvements will be possible if the dependency between frames is more
carefully considered, incorporating melodic and harmonic principles to generate
Sequential MCMC for Musical Transcription 83

likely note and chord transitions over time. Ideally also, the algorithm should be
able to run in real time, processing a piece of music as it is played. Currently,
however, the Matlab-based processing is at many times real time and we will
study the parallel processing possibilities (as a simple starting point, the MCMC
runs can be split into several shorter parallel chains at each time frame within
a parallel architecture).

References
1. Cemgil, A., Godsill, S.J., Peeling, P., Whiteley, N.: Bayesian statistical methods for
audio and music processing. In: O’Hagan, A., West, M. (eds.) Handbook of Applied
Bayesian Analysis, OUP (2010)
2. Davy, M., Godsill, S., Idier, J.: Bayesian analysis of polyphonic western tonal music.
Journal of the Acoustical Society of America 119(4) (April 2006)
3. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.): Markov Chain Monte Carlo
in Practice. Chapman and Hall, Boca Raton (1996)
4. Godsill, S.J., Davy, M.: Bayesian computational models for inharmonicity in musical
instruments. In: Proc. of IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, New Paltz, NY (October 2005)
5. Kashino, K., Nakadai, K., Kinoshita, T., Tanaka, H.: Application of the Bayesian
probability network to music scene analysis. In: Rosenthal, D.F., Okuno, H. (eds.)
Computational Audio Scene Analysis, pp. 115–137. Lawrence Erlbaum Associates,
Mahwah (1998)
6. Klapuri, A., Davy, M.: Signal processing methods for music transcription. Springer,
Heidelberg (2006)
7. Pang, S.K., Godsill1, S.J., Li, J., Septier, F.: Sequential inference for dynamically
evolving groups of objects. To appear: Barber, Cemgil, Chiappa (eds.) Inference and
Learning in Dynamic Models, CUP (2009)
8. Peeling, P.H., Li, C., Godsill, S.J.: Poisson point process modeling for poly-
phonic music transcription. Journal of the Acoustical Society of America Express
Letters 121(4), EL168–EL175 (2007)
Single Channel Music Sound Separation Based
on Spectrogram Decomposition and Note
Classification

Wenwu Wang and Hafiz Mustafa

Centre for Vision, Speech and Signal Processing (CVSSP)


University of Surrey, GU2 7XH, UK
{w.wang,hm00045}@surrey.ac.uk
http://www.surrey.ac.uk/cvssp

Abstract. Separating multiple music sources from a single channel mix-


ture is a challenging problem. We present a new approach to this problem
based on non-negative matrix factorization (NMF) and note classifica-
tion, assuming that the instruments used to play the sound signals are
known a priori. The spectrogram of the mixture signal is first decomposed
into building components (musical notes) using an NMF algorithm. The
Mel frequency cepstrum coefficients (MFCCs) of both the decomposed
components and the signals in the training dataset are extracted. The
mean squared errors (MSEs) between the MFCC feature space of the
decomposed music component and those of the training signals are used
as the similarity measures for the decomposed music notes. The notes are
then labelled to the corresponding type of instruments by the K nearest
neighbors (K-NN) classification algorithm based on the MSEs. Finally,
the source signals are reconstructed from the classified notes and the
weighting matrices obtained from the NMF algorithm. Simulations are
provided to show the performance of the proposed system.

Keywords: Non-negative matrix factorization, single-channel sound


separation, Mel frequency cepstrum coefficients, instrument classifica-
tion, K nearest neighbors, unsupervised learning.

1 Introduction

Recovering multiple unknown sources from a one-microphone signal, which is


an observed mixture of these sources, is referred to as the problem of single-
channel (or monaural) sound source separation. The single-channel problem is
an extreme case of under-determined separation problems, which are inherently
ill-posed, i.e., more unknown variables than the number of equations. To solve
the problem, additional assumptions (or constraints) about the sources or the
propagating channels are necessary. For an underdetermined system with two

The work of W. Wang was supported in part by an Academic Fellowship of the
RCUK/EPSRC (Grant number: EP/C509307/1).

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 84–101, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Single Channel Music Sound Separation 85

microphone recordings, it is possible to separate the sources based on spatial


diversity using determined independent component analysis (ICA) algorithms
and an iterative procedure [17]. However, unlike the techniques in e.g. ADRess
[2] and DUET [18] that require at least two mixtures, the cues resulting from the
sensor diversity are not available in the single channel case, and thus separation
is difficult to achieve based on ICA algorithms.
Due to the demand from several applications such as audio coding, music in-
formation retrieval, music editing and digital library, this problem has attracted
increasing research interest in recent years [14]. A number of methods have been
proposed to tackle this problem. According to the recent review by Li et al. [14],
these methods can be approximately divided into three categories: (1) signal
modelling based on traditional signal processing techniques, such as sinusoidal
modelling of the sources, e.g. [6], [23], [24]; (2) learning techniques based on sta-
tistical tools, such as independent subspace analysis [4] and non-negative matrix
(or tensor) factorization, e.g. [19], [20], [27], [28], [25], [8], [30]; (3) psychoacous-
tical mechanism of human auditory perception, such as computational auditory
scene analysis (CASA), e.g. [15], [3], [26], [32], [14]. Sinusoidal modelling methods
try to decompose the signal into a combination of sinusoids, and then estimate
their parameters (frequencies, amplitudes, and phases) from the mixture. These
methods have been used particularly for harmonic sounds. The learning based
techniques do not exploit explicitly the harmonic structure of the signals, in-
stead they use the statistical information that is estimated from the data, such
as the independence or sparsity of the separated components. The CASA based
techniques build separation systems on the basis of the perceptual theory by
exploiting the psychoacoustical cues that can be computed from the mixture,
such as common amplitude modulation.
In this paper, a new algorithm is proposed for the problem of single-channel
music source separation. The algorithm is based mainly on the combination of
note decomposition with note classification. The note decomposition is achieved
by a non-negative matrix factorization (NMF) algorithm. NMF has been pre-
viously used for music sound separation and transcription, see e.g. [11], [1], [7],
[20], [29], [30]. In this work, we first use the NMF algorithm in [25] to decom-
pose the spectrogram of the music mixture into building components (musical
notes). Then, Mel Frequency Cepstrum Coefficients (MFCCs) feature vectors
are extracted from the segmented frames of each decomposed note. To divide
the separated notes into their corresponding instrument categories, the K near-
est neighbor (NN) classifier [10] is used. The K-NN classifier is an algorithm
that is simple to implement and also provides good classification performance.
The source signals are reconstructed by combining the notes having same class
labels. The remainder of the paper is organized as follows. The proposed sepa-
ration system is described in Section 2 in detail. Some preliminary experimental
results are shown in Section 3. Discussions about the proposed method are given
in Section 4. Finally, Section 5 summarises the paper.
86 W. Wang and H. Mustafa

2 The Proposed Separation System


This section describes the details of the processes in our proposed sound source
separation system. First, the single-channel mixture of music sources is decom-
posed into basic building blocks (musical notes) by applying the NMF algorithm.
The NMF algorithm describes the mixture in the form of basis functions and
their corresponding weights (coefficients) which represent the strength of each
basis function in the mixture. The next step is to extract the feature vectors
of the musical notes and then classify the notes into different source streams.
Finally, the source signals are reconstructed by combining the notes with the
same class labels. In this work, we assume that the instruments used to generate
the music sources are known a priori. In particular, two kinds of instruments,
i.e. piano and violin, were used in our study. The block diagram of our proposed
system is depicted in Figure 1.

2.1 Music Decomposition by NMF


In many data analysis tasks, it is a fundamental problem to find a suitable rep-
resentation of the data so that the underlying hidden structure of the data may
be revealed or displayed explicitly. NMF is a data-adaptive linear representa-
tion technique for 2-D matrices that was shown to have such potentials. Given
a non-negative data matrix X, the objective of NMF is to find two non-negative
matrices W and H [12], such that

X = WH (1)

In this work, X is an S × T matrix representing the mixture signal, W is the


basis matrix of dimension S × R, and H is the weighting coefficient matrix of

Fig. 1. Block diagram of the proposed system


Single Channel Music Sound Separation 87

dimension R × T . The number of bases used to represent the original matrix is


described by R, i.e. the decomposition rank. Due to non-negativity constraints,
this representation is purely additive. Many algorithms can be used to find the
suitable pair of W and H such that the error of the approximation is minimised,
see e.g. [12], [13], [7], [20] and [30]. In this work, we use the algorithm proposed in
[25] for the note decomposition. In comparison to the classical algorithm in [12],
this algorithm considers additional constraints from the structure of the signal.
Due to the non-negativity constraints, the time-domain signal (with negative
values) needs to be transformed into another domain so that only non-negative
values are present in X for an NMF algorithm to be applied. In this work, the
music sound is transformed into the frequency domain using, e.g. the short-time
Fourier transform (STFT). The matrix X is generated as the spectrogram of
the signal, and in our study, the frame size of each segment equals to 40 ms,
and 50 percent overlaps between the adjacent frames are used. An example of
matrix X generated from music signals is shown in Figure 2, where two music
sources with each having a music note repeating twice were mixed together. One
of the sources contains musical note G4, and the other is composed of note A3.
The idea of decomposing the mixture signal into individual music components is
based on the observation that a music signal may be represented by a set of basic
building blocks such as musical notes or other general harmonic structures. The
basic building blocks are also known as basis vectors and the decomposition of the
single-channel mixture into basis vectors is the first step towards the separation
of multiple source signals from the single-channel mixture. If different sources
in the mixture represent different basis vectors, then the separation problem
can be regarded as a problem of classification of basis vectors into different
categories. The source signals can be obtained by combining the basis vectors in
each category.
The above mixture (or NMF) model can be equally written as


R
X= wr hr (2)
r=1

2
10
FFT bin index

1
10

0
10
50 100 150 200 250 300 350 400 450 500 550
Time slices index

Fig. 2. The contour plot of a sound mixture (i.e. the matrix X) containing two different
musical notes G4 and A3
88 W. Wang and H. Mustafa

2 2
10 10

FFT bin index

FFT bin index


1 1
10 10

0 0
10 10
20 40 60 80 100 20 40 60 80 100
Time slices index Time slices index

Fig. 3. The contour plots of the individual musical notes which were obtained by
applying an NMF algorithm to the sound mixture X. The separated notes G4 and A3
are shown in the left and right plot respectively.

where wr is the rth column of W = [w1 , w2 , . . . , wR ] which contains the


basis vectors, and hr is the rth row of H = [h1 , h2 , . . . , hR ]T which contains
the weights or coefficients of each basis function in matrix W, where the su-
perscript T is a matrix transpose. Many algorithms including those mentioned
above can be applied to obtain such basis functions and weighting coefficients.
For example, using the algorithm developed in [30], we can decompose the mix-
ture in Figure 2, and the resulting basis vectors (i.e. the decomposed notes) are
shown in Figure 3. From this figure, it can be observed that both note G4 and
A3 are successfully separated from the mixture.
As a prior knowledge, given the mixture of musical sounds containing two
sources (e.g. piano and violin), two different types of basis functions are learnt
from the decomposition by the NMF algorithm. The magnitude spectrograms
of the basis components (notes) of the two different sources in the mixture are
obtained by multiplying the columns of the basis matrix W to the corresponding
rows of the weight matrix H. The columns of matrix W contain the information
of musical notes in the mixture and corresponding rows of matrix H describe the
strength of these notes. Some rows in H do not contain useful information and are
therefore considered as noise. The noise components are considered separately
in the classification process to improve the quality of the separated sources.

2.2 Feature Extraction


Feature extraction is a special form of dimensionality reduction by transforming
the high dimensional data into a lower dimensional feature space. It is used in
both the training and classification processes in our proposed system. The audio
features that we used in this work are the MFCCs. The MFCCs are extracted
on a frame-by-frame basis. In the training process, the MFCCs are extracted
from a training database, and the feature vectors are then formed from these
coefficients. In the classification stage, the MFCCs are extracted similarly from
Single Channel Music Sound Separation 89

150 100

Coefficient values
80
100
(a) (b)
60

50 40

20
0
0

−50 −20
0 5 10 15 0 5 10 15

50 50

(c)
Coefficient values

40 40
(d)
30 30

20 20

10 10

0 0

−10 −10
0 5 10 15 0 5 10 15
MFCC feature dimension MFCC feature dimension

Fig. 4. The 13-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.

120 100

100 80
(a) (b)
Coefficient values

80
60
60
40
40
20
20

0 0

−20 −20
0 5 10 15 20 0 5 10 15 20

50 50

40 (c) 40
Coefficient values

(d)
30 30

20 20

10 10

0 0

−10 −10
0 5 10 15 20 0 5 10 15 20
MFCC feature dimension MFCC feature dimension

Fig. 5. The 20-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
90 W. Wang and H. Mustafa

150 100

Coefficient values
80 (b)
100
(a)
60

50 40

20
0
0

−50 −20
0 2 4 6 8 0 2 4 6 8

50 50
Coefficient values

40
40 (c) (d)
30
30
20
20
10
10 0

0 −10
0 2 4 6 8 0 2 4 6 8
MFCC feature dimension MFCC feature dimension

Fig. 6. The 7-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.

the decomposed notes obtained by the NMF algorithm. In our experiments,


the frame size of 40 ms is used, which equals to 1764 samples when the sam-
pling frequency is 44100 Hz. Examples of such feature vectors are shown in
Figure 4, where the four audio files (“Piano.ff.A0.wav”, “Piano.ff.B0.wav”, “Vi-
olin.pizz.mf.sulG.C4B4.wav”, and “Violin.pizz.pp.sulG.C4B4.wav”) were chosen
from the The University of Iowa Musical Instrument Samples Database [21] and
the feature vectors are 13-dimensional. Different dimensional features have also
been examined in this work. Figure 5 and 6 show the 20-dimensional and 7-
dimensional MFCC feature vectors computed from the same audio frames and
from the same audio signals as those in Figure 4. In comparison to Figure 4, it
can be observed that the feature vectors in Figure 5 and 6 have similar shapes,
even though that the higher dimensional feature vectors show more details about
the signal. However, it inevitably incurs a higher computational cost if the fea-
ture dimension is increased. In our study, we choose to compute a 13 dimensional
MFCCs vector for each frame in the experiments, which offers a good trade-off
between the classification performance and the computational efficiency.

2.3 Classification of Musical Notes


The main objective of classification is to maximally extract patterns on the basis
of some conditions and is to separate one class from another. The K-NN classifier,
which uses a classification rule without having the knowledge of the distribution
of measurements in different classes, is used in this paper for the separation
Single Channel Music Sound Separation 91

Table 1. The musical note classification algorithm

1) Calculate the 13-D MFCCs feature vectors of all the musical examples
in the training database with class labels. This creates a feature space
for the training data.
2) Extract similarly the MFCCs feature vectors of all separated compo-
nents whose class labels need to be determined.
3) Assign the labels to all the feature vectors in the separated compo-
nents to the appropriate classes via the K-NN algorithm.
4) The majority vote of feature vectors determines the class label of the
separated components.
5) Optimize the classification results by different choices of K.

of piano and violin notes. The basic steps in music note classification include
preprocessing, feature extraction or selection, classifier design and optimization.
The main steps used in our system are detailed in Table 1.
The main disadvantage of the classification technique based on simple “ma-
jority voting” is that the classes with more frequent examples tend to come up in
the K-nearest neighbors when the neighbors are computed from a large number
of training examples [5]. Therefore, the class with more frequent training exam-
ples tends to dominate the prediction of the new vector. One possible technique
to solve this problem is to weight the classification based on the distance from
the test pattern to all of its K nearest neighbors.

2.4 K-NN Classifier


This section briefly describes the K-NN classifier used in our algorithm. K-NN
is a simple technique for pattern classification and is particularly important for
non-parametric distributions. The K-NN classifier labels an unknown pattern
x by the majority vote of its K-nearest neighbors [5], [9]. The K-NN classifier
belongs to a class of techniques based on non-parametric probability density
estimation. Suppose, there is a need to estimate the density function P (x) from
a given dataset. In our case, each signal in the dataset is segmented to 999
frames, and a feature vector of 13 MFCC coefficients is computed for each frame.
Therefore, the total number of examples in the training dataset is 52947. Sim-
ilarly, an unknown pattern x is also a 13 dimensional MFCCs feature vector
whose label needs to be determined based on the majority vote of the nearest
neighbors. The volume V around an unknown pattern x is selected such that the
number of nearest neighbors (training examples) within V are 30. We are deal-
ing with the two-class problem with prior probability P (ωi ). The measurement
distribution of the patterns in class ωi is denoted by P (x | ωi ). The measurement
of posteriori class probability P (ωi | x) decides the label of an unknown feature
vector of the separated note. The approximation of P (x) is given by the relation
[5], [10]
K
P (x)  (3)
NV
92 W. Wang and H. Mustafa

where N is the total number of examples in the dataset, V is the volume sur-
rounding unknown pattern x and K is the number of examples within V . The
class prior probability depends on the number of examples in the dataset
Ni
P (ωi ) = (4)
N
and the mesurement distribution of patterns in class ωi is defined as
Ki
P (x | ωi ) = (5)
Ni V
According to the Bayes theorem, the posteriori probability becomes

P (x | ωi )P (ωi )
P (ωi | x) = (6)
P (x)

Based on the above equations, we have [10]

Ki
P (ωi | x) = (7)
K
The discriminant function gi (x) = KK assigns the class label to an unknown
i

pattern x based on the majority of examples Ki of class ωi in volume V .

2.5 Parameter Selection


The most important parameter in the K-NN algorithm is the user-defined con-
stant K. The best value of K depends upon the given data for classification [5].
In general, the effect of noise on classification may be reduced by selecting a
higher value of K. The problem arises when a large value of K is used for less
distinct boundaries between classes [31]. To select good value of K, many heuris-
tic techniques such as cross-validation may be used. In the presence of noisy or
irrelevant features the performance of K-NN classifier may degrade severely [5].
The selection of feature scales according to their importance is another impor-
tant issue. For the improvement of classification results, a lot of effort has been
devoted to the selection or scaling of the features in a best possible way. The
optimal classification results are achieved for most datasets by selecting K = 10
or more.

2.6 Data Preparation


For the classification of separated components from mixture, the features i.e.
the MFCCs, are extracted from all the signals in the training dataset and put
the label on all feature vectors according to their classes (piano or violin). The
labels of the feature vectors of the separated components are not known which
need to be classified. Each feature vector consist of 13 MFCCs. When computing
the MFCCs, the training signals and the separated components are all divided
into frames with each having a length of 40 ms and 50 percent overlap between
Single Channel Music Sound Separation 93

the frames is used to avoid discontinuities between the neighboring frames. The
similarity measure of the feature vectors of the separated components to the
feature vectors obtained from the training process determines which class the
separated notes belong to. This is achieved by the K-NN classifier. If majority
vote goes to the piano, then a piano label is assigned to the separated component
and vice-versa.

2.7 Phase Generation and Source Reconstruction


The factorization of magnitude spectrogram by the NMF algorithm provides
frequency-domain basis functions. Therefore, the reconstruction of source sig-
nals from the frequency-domain bases is used in this paper, where the phase
information is required. Several phase generation methods have been suggested
in the literature. When the components do not overlap each other significantly
in time and frequency, the phases of the original mixture spectrogram produce
good synthesis quality [23]. In the mixture of piano and violin signals, significant
overlapping occurs between musical notes in the time domain but the degree of
overlapping is relatively low in the frequency domain. Based on this observation,
the phases of the original mixture spectrogram are used to reconstruct the source
signals in this work. The reconstruction process can be summarised briefly as
follows. First, the phase information is added to each classified component to
obtain its complex spectrum. Then the classified components from the above
sections are combined to the individual source streams, and finally the inverse
discrete Fourier Transform (IDFT) and the overlap-and-add technique are ap-
plied to obtain the time-domain signal. When the magnitude spectra are used
as the basis functions, the frame-wise spectra are obtained as the product of the
basis function with its gain. If the power spectra are used, a square root needs
to be taken. If the frequency resolution is non-linear, additional processing is
required for the re-synthesis using the IDFT.

3 Evaluations

Two music sources (played by two different instruments, i.e. piano and violin)
with different number of notes overlapping each other in the time domain, were
used to generate artificially an instantaneous mixture signal. The lengths of
piano and violin source signals are both 20 seconds, containing 6 and 5 notes
respectively. The K-NN classifier constant K was selected as K = 30. The signal-
to-noise ratio (SNR), defined as follows, was used to measure the quality of both
the separated notes and the whole source signal,
 2
s,t [Xm ]s,t
SN R(m, j) =  (8)
s,t ([Xm ]s,t − [Xj ]s,t )
2

where s and t are the row and column indices of the matrix respectively. The
SNR was computed based on the magnitude spectrograms Xm and Xj of the
mth reference and the j th separated component to prevent the reconstruction
94 W. Wang and H. Mustafa

300

250

200
Coefficient values
150

100

50

−50
0 2 4 6 8 10 12 14
MFCC feature space

Fig. 7. The collection of the audio features from a typical piano signal (i.e. “Pi-
ano.ff.A0.wav”) in the training process. In total, 999 frames of features were computed.

250

200
Coefficient values

150

100

50

−50
0 2 4 6 8 10 12 14
MFCC feature space

Fig. 8. The collection of the audio features from a typical violin signal (i.e. “Vio-
lin.pizz.pp.sulG.C4B4.wav”) in the training process. In total, 999 frames of features
were computed.

process from affecting the quality [22]. For the same note, j = m. In general,
higher SNR values represent better separation quality of the separated notes
and source signals, vice-versa. The training database used in the classification
process was provided by the McGill University Master Samples Collection [16],
University of Iowa website [21]. It contains 53 music signals with 29 of which
are piano signals and the rest are violin signals. All the signals were sampled
at 44100 Hz. The reference source signals were stored for the measurement of
separation quality.
For the purpose of training, the signals were firstly segmented into frames,
and then the MFCC feature vectors were computed from these frames. In total,
Single Channel Music Sound Separation 95

200

150

Coefficient values 100

50

−50
0 2 4 6 8 10 12 14
MFCC feature space

Fig. 9. The collection of the audio features from a separated speech component in the
testing process. Similar to the training process, 999 frames of features were computed.

7000

6000

5000

4000
MSE

3000

2000

1000

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Frame index 4
x 10

Fig. 10. The MSEs between the feature vector of a frame of the music component to
be classified and those from the training data. The frame indices in the horizontal axis
are ranked from the lower to the higher. The frame index 28971 is the highest frame
number of the piano signals. Therefore, on this plot, to the left of this frame are those
from piano signals, and to the right are those from the violin signals.

999 frames were computed for each signal. Figures 7 and 8 show the collection of
the features from the typical piano and violin signals (i.e. “Piano.ff.A0.wav” and
“Violin.pizz.pp.sulG.C4B4.wav”) respectively. In both figures, it can be seen that
there exist features whose coefficients are all zeros due to the silence part of the
signals. Before running the training algorithm, we performed feature selection by
removing such frames of features. In the testing stage, the MFCC feature vectors
of the individual music components that were separated by the NMF algorithm
were calculated. Figure 9 shows the feature space of 15th separated component
96 W. Wang and H. Mustafa

7000

6000

5000

4000
MSE

3000

2000

1000

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Sorted frame index 4
x 10

Fig. 11. The MSE values obtained in Figure 10 were sorted from the lower to the
higher. The frame indices in the horizontal axis, associated with the MSEs, are shuffled
accordingly.

34

32

30

28
MSE

26

24

22

20
0 5 10 15 20 25 30
K nearest frames

Fig. 12. The MSE values of the K nearest neighbors (i.e. the frames with the K minimal
MSEs) are selected based on the K-NN clustering. In this experiment, K was set to 30.

(the final component in our experiment). To determine whether this component


belongs to piano or violin, we measured the mean squared error (MSE) between
the feature space of the separated component and the feature spaces obtained
from the training data. Figure 10 shows the MSEs between the feature vector of a
frame (the final frame in this experiment) of the separated component and those
obtained in the training data. Then we sort the MSEs according their values
along all these frames. The sorted MSEs are shown in Figure 11, where the frame
indices were shuffled accordingly. After this, we applied the K-NN algorithm to
obtain the 30 neighbors that are nearest to the separated component. The MSEs
of these frames are shown in Figure 12. Their corresponding frame indices are
shown in Figure 13, from which we can see that all the frame indices are greater
Single Channel Music Sound Separation 97

4
x 10
5

4.5

3.5

Frame index 3

2.5

1.5

0.5

0
0 5 10 15 20 25 30 35
The K nearest frames

Fig. 13. The frame indices of the 30 nearest neighbors to the frame of the decomposed
music note obtained in Figure 12. In our experiment, the maximum frame index for
the piano signals is 28971, shown by the dashed line, while the frame indices of violin
signals are all greater than 28971. Therefore, this typical audio frame under testing
can be classified as a violin signal.

1
(a)

0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(b)

0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(c)

0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(d)

0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(e)

0
−1
0 1 2 3 4 5 6 7 8 9
Time in samples 5
x 10

Fig. 14. A separation example of the proposed system. (a) and (b) are the piano and
violin sources respectively, (c) is the single channel mixture of these two sources, and
(d) and (e) are the separated sources respectively. The vertical axes are the amplitude
of the signals.

than 28971, which was the highest index number of the piano signals in the
training data. As a result, this component was classified as a violin signal.
Figure 14 shows a separation example of the proposed system, where (a) and
(b) are the piano and violin sources respectively, (c) is the single channel mixture
98 W. Wang and H. Mustafa

of these two sources, and (d) and (e) are the separated sources respectively. From
this figure, we can observe that, although most notes are correctly separated and
classified into the corresponding sources, there exist notes that were wrongly
classified. The separated notes with the highest SNR is the first note of the
violin signal, for which the SNR equals to 9.7dB, while the highest SNR of the
note within the piano signal is 6.4dB. The average SNRs for piano and violin
are respectively 3.7 dB and 1.3 dB. According to our observation, the separation
quality of the notes varies from notes to notes. In average, the separation quality
of the piano signal is better than the violin signal.

4 Discussions

At the moment, for the separated components by the NMF algorithm, we cal-
culate their MFCC features in the same way as for the signals in the training
data. As a result, the evaluation of the MSEs becomes straightforward, which
consequently facilitates the K-NN classification. It is however possible to use the
dictionary returned by the NMF algorithm (and possibly the activation coeffi-
cients as well) as a set of features. In such a case, the NMF algorithm needs to
be applied to the training data in the same way as the separated components
obtained in the testing and classification process. Similar to principal compo-
nent analysis (PCA) which has been widely used to generate features in many
classification system, using NMF components directly as features has a great
potential. As compared to using the MFCC features, the computational cost
associated with the NMF features could be higher due to the iterations required
for the NMF algorithms to converge. However, its applicability as a feature for
classification deserves further investigation in the future.
Another important issue in applying NMF algorithms is the selection of the
mode of the NMF model (i.e. the rank R). In our study, this determines the
number of components that will be learned from the signal. In general, for a
higher rank R, the NMF algorithm learns the components that are more likely
corresponding to individual notes. However, there is a trade-off between the de-
composition rank and the computational load, as a larger R incurs a higher
computational cost. Also, it is known that NMF produces not only harmonic
dictionary components but also sometimes ad-hoc spectral shapes correspond-
ing to drums, transients, residual noise, etc. In our recognition system, these
components were treated equally as the harmonic components. In other words,
the feature vectors of these components were calculated and evaluated in the
same way as the harmonic components. The final decision was made from the
labelling scores and the K-NN classification results.
We note that many classification algorithms could also be applied for labelling
the separated components, such as the Gaussian Mixture Models (GMMs), which
have been used in both automatic speech/speaker recognition and music infor-
mation retrieval. In this work, we choose the K-NN algorithm due its simplicity.
Moreover, the performance of the single channel source separation system de-
veloped here is largely dependent on the separated components provided by the
Single Channel Music Sound Separation 99

NMF algorithm. Although the music components obtained by the NMF algo-
rithm are somehow sparse, their sparsity is not explicitly controlled. Also, we
didn’t use the information from the music signals explicitly, such as the pitch in-
formation and harmonic structure. According to Li et al. [14], the information of
pitch and common amplitude modulation can be used to improve the separation
quality. Com

5 Conclusions

We have presented a new system for the single channel music sound separation
problem. The system essentially integrates two techniques, automatic note de-
composition using NMF, and note classification based on the K-NN algorithm. A
main assumption with the proposed system is that we have the prior knowledge
about the type of instruments used for producing the music sounds. The simu-
lation results show that the system produces a reasonable performance for this
challenging source separation problem. Future works include using more robust
classification algorithm to improve the note classification accuracy, and incorpo-
rating pitch and common amplitude modulation information into the learning
algorithm to improve the separation performance of the proposed system.

References
1. Abdallah, S.A., Plumbley, M.D.: Polyphonic Transcription by Non-Negative Sparse
Coding of Power Spectra. In: International Conference on Music Information Re-
trieval, Barcelona, Spain (October 2004)
2. Barry, D., Lawlor, B., Coyle, E.: Real-time Sound Source Separation: Azimuth
Discrimination and Re-synthesis, AES (2004)
3. Brown, G.J., Cooke, M.P.: Perceptual Grouping of Musical Sounds: A Computa-
tional Model. J. New Music Res. 23, 107–132 (1994)
4. Casey, M.A., Westner, W.: Separation of Mixed Audio Sources by Independent
Subspace Analysis. In: Proc. Int. Comput. Music Conf. (2000)
5. Devijver, P.A., Kittler, J.: Pattern Recognition - A Statistical Approach. Prentice
Hall International, Englewood Cliffs (1982)
6. Every, M.R., Szymanski, J.E.: Separation of Synchronous Pitched Notes by Spec-
tral Filtering of Harmonics. IEEE Trans. Audio Speech Lang. Process. 14, 1845–
1856 (2006)
7. Fevotte, C., Bertin, N., Durrieu, J.-L.: Nonnegative Matrix Factorization With the
Itakura-Saito Divergence. With Application to Music Analysis. Neural Computa-
tion 21, 793–830 (2009)
8. FitzGerald, D., Cranitch, M., Coyle, E.: Extended Nonnegative Tensor Factori-
sation Models for Musical Sound Source Separation, Article ID 872425, 15 pages
(2008)
9. Fukunage, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press Inc., London (1990)
100 W. Wang and H. Mustafa

10. Gutierrez-Osuna, R.: Lecture 12: K Nearest Neighbor Classifier,


http://research.cs.tamu.edu/prism/lectures (accessed January 17, 2010)
11. Hoyer, P.: Non-Negative Sparse Coding. In: IEEE Workshop on Networks for Signal
Processing XII, Martigny, Switzerland (2002)
12. Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-Negative Matrix
Factorization. Nature 401, 788–791 (1999)
13. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Neu-
ral Information Processing Systems, Denver (2001)
14. Li, Y., Woodruff, J., Wang, D.L.: Monaural Musical Sound Separation Based on
Pitch and Common Amplitude Modulation. IEEE Transactions on Audio, Speech,
and Language Processing 17, 1361–1371 (2009)
15. Mellinger, D.K.: Event Formation and Separation in Musical Sound. PhD disser-
tation, Dept. of Comput. Sci., Standford Univ., Standford, CA (1991)
16. Opolko, F., Wapnick, J.: McGill University master samples, McGill Univ., Mon-
treal, QC, Canada, Tech. Rep. (1987)
17. Pedersen, M.S., Wang, D.L., Larsen, J., Kjems, U.: Two-Microphone Separation
of Speech Mixtures. IEEE Trans. on Neural Networks 19, 475–492 (2008)
18. Rickard, S., Balan, R., Rosca, J.: Real-time Time-Frequency based Blind Source
Separation. In: 3rd International Conference on Independent Component Analysis
and Blind Source Separation, San Diego, CA (December 2001)
19. Smaragdis, P., Brown, J.C.: Non-negative Matrix Factorization for Polyphonic Mu-
sic Transcription. In: Proc. IEEE Int. Workshop Application on Signal Process.
Audio Acoust., pp. 177–180 (2003)
20. Smaragdis, P.: Non-negative matrix factor deconvolution; extraction of multiple
sound sources from monophonic inputs. In: Puntonet, C.G., Prieto, A.G. (eds.)
ICA 2004. LNCS, vol. 3195, pp. 494–499. Springer, Heidelberg (2004)
21. The University of Iowa Musical Instrument Samples Database,
http://theremin.music.uiowa.edu
22. Virtanen, T.: Sound Source Separation Using Sparse Coding with Temporal Con-
tinuity Objective. In: International Computer Music Conference, Singapore (2003)
23. Virtanen, T.: Separation of Sound Sources by Convolutive Sparse Coding. In: Pro-
ceedings of ISCA Tutorial and Research Workshop on Statistical and Perceptual
Audio Processing, Jeju, Korea (2004)
24. Virtanen, T.: Sound Source Separation in Monaural Music Signals. PhD disserta-
tion, Tampere Univ. of Technol., Tampere, Finland (2006)
25. Virtanen, T.: Monaural Sound Source Separation by Non-Negative Matrix Factor-
ization with Temporal Continuity and Sparseness Criteria. IEEE Transactions on
Audio, Speech, and Language Processing 15, 1066–1073 (2007)
26. Wang, D.L., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Al-
gorithms, and Applications. Wiley/IEEE Press (2006)
27. Wang, B., Plumbley, M.D.: Investigating Single-Channel Audio Source Separation
Methods based on Non-negative Matrix Factorization. In: Nandi, Zhu (eds.) Pro-
ceedings of the ICA Research Network International Workshop, pp. 17–20 (2006)
28. Wang, B., Plumbley, M.D.: Single Channel Audio Separation by Non-negative
Matrix Factorization. In: Digital Music Research Network One-day Workshop
(DMRN+1), London (2006)
Single Channel Music Sound Separation 101

29. Wang, W., Luo, Y., Chambers, J.A., Sanei, S.: Note Onset Detection via
Non-negative Factorization of Magnitude Spectrum. EURASIP Journal on
Advances in Signal Processing, Article ID 231367, 15 pages (June 2008);
doi:10.1155/2008/231367
30. Wang, W., Cichocki, A., Chambers, J.A.: A Multiplicative Algorithm for Convo-
lutive Non-negative Matrix Factorization Based on Squared Euclidean Distance.
IEEE Transactions on Signal Processing 57, 2858–2864 (2009)
31. Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, New York (2005)
32. Woodruff, J., Pardo, B.: Using Pitch, Amplitude Modulation and Spatial Cues for
Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP J.
Adv. Signal Process. (2007)
Notes on Nonnegative Tensor Factorization of
the Spectrogram for Audio Source Separation:
Statistical Insights and Towards Self-Clustering
of the Spatial Cues

Cédric Févotte1,∗ and Alexey Ozerov2


1
CNRS LTCI, Telecom ParisTech - Paris, France
[email protected]
2
IRISA, INRIA - Rennes, France
[email protected]

Abstract. Nonnegative tensor factorization (NTF) of multichannel


spectrograms under PARAFAC structure has recently been proposed by
Fitzgerald et al as a mean of performing blind source separation (BSS)
of multichannel audio data. In this paper we investigate the statistical
source models implied by this approach. We show that it implicitly as-
sumes a nonpoint-source model contrasting with usual BSS assumptions
and we clarify the links between the measure of fit chosen for the NTF
and the implied statistical distribution of the sources. While the original
approach of Fitzgeral et al requires a posterior clustering of the spatial
cues to group the NTF components into sources, we discuss means of
performing the clustering within the factorization. In the results section
we test the impact of the simplifying nonpoint-source assumption on
underdetermined linear instantaneous mixtures of musical sources and
discuss the limits of the approach for such mixtures.

Keywords: Nonnegative tensor factorization (NTF), audio source


separation, nonpoint-source models, multiplicative parameter updates.

1 Introduction

Nonnegative matrix factorization (NMF) is an unsupervised data decomposi-


tion technique with growing popularity in the fields of machine learning and
signal/image processing [8]. Much research about this topic has been driven by
applications in audio, where the data matrix is taken as the magnitude or power
spectrogram of a sound signal. NMF was for example applied with success to
automatic music transcription [15] and audio source separation [19,14]. The fac-
torization amounts to decomposing the spectrogram data into a sum of rank-1

This work was supported in part by project ANR-09-JCJC-0073-01 TANGERINE
(Theory and applications of nonnegative matrix factorization) and by the Quaero
Programme, funded by OSEO, French State agency for innovation.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 102–115, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Notes on NTF for Audio Source Separation 103

spectrograms, each of which being the expression of an elementary spectral pat-


tern amplitude-modulated in time.
However, while most music recordings are available in multichannel format
(typically, stereo), NMF in its standard setting is only suited to single-channel
data. Extensions to multichannel data have been considered, either by stacking
up the spectrograms of each channel into a single matrix [11] or by equiva-
lently considering nonnegative tensor factorization (NTF) under a parallel fac-
tor analysis (PARAFAC) structure, where the channel spectrograms form the
slices of a 3-valence tensor [5,6]. Let Xi be the short-time Fourier transform
(STFT) of channel i, a complex-valued matrix of dimensions F × N , where
i = 1, . . . , I and I is the number of channel (I = 2 in the stereo case). The
latter approaches boil down to assuming that the magnitude spectrograms |Xi |
are approximated by a linear combination of nonnegative rank-1 “elementary”
spectrograms |Ck | = wk hTk such that


K
|Xi | ≈ qik |Ck | (1)
k=1

and |Ck | is the matrix containing the modulus of the coefficients of some “latent”
components whose precise meaning we will attempt to clarify in this paper.
Equivalently, Eq. (1) writes


K
|xif n | ≈ qik wf k hnk (2)
k=1

where {xif n } are the coefficients of Xi . Introducing the nonnegative matrices


Q = {qik }, W = {wf k }, H = {hnk }, whose columns are respectively denoted
qk , wk and hk , the following optimization problem needs to be solved

min d(|xif n ||v̂if n ) subject to Q, W, H ≥ 0 (3)
Q,W,H
if n

with
def 
K
v̂if n = qik wf k hnk (4)
k=1

and where the constraint A ≥ 0 means that the coefficients of matrix A are non-
negative, and d(x|y) is a scalar cost function, taken as the generalized Kullback-
Leibler (KL) divergence in [5] or as the Euclidean distance in [11]. Complex-
valued STFT estimates Ĉk are subsequently constructed using the phase of the
observations (typically, ĉkf n is given the phase of xif n , where i = argmax{qik }i
[6]) and then inverted to produce time-domain components. The components
pertaining to same “sources” (e.g, instruments) can then be grouped either man-
ually or via clustering of the estimated spatial cues {qk }k .
In this paper we build on these previous works and bring the following
contributions :
104 C. Févotte and A. Ozerov

– We recast the approach of [5] into a statistical framework, based on a gen-


erative statistical model of the multichannel observations X. In particular
we discuss NTF of the power spectrogram |X|2 with the Itakura-Saito (IS)
divergence and NTF of the magnitude spectrogram |X| with the KL diver-
gence.
– We describe a NTF with a novel structure, that allows to take care of the
clustering of the components within the decomposition, as opposed to after.
The paper is organized as follows. Section 2 describes the generative and statis-
tical source models implied by NTF. Section 3 describes new and existing multi-
plicative algorithms for standard NTF and for “Cluster NTF”. Section 4 reports
experimental source separation results on musical data; we test in particular the
impact of the simplifying nonpoint-source assumption on underdetermined lin-
ear instantaneous mixtures of musical sources and point out the limits of the
approach for such mixtures. We conclude in Section 5. This article builds on
related publications [10,3].

2 Statistical Models to NTF


2.1 Models of Multichannel Audio
Assume a multichannel audio recording with I channels x(t) = [x1 (t), . . . , xI (t)]T ,
also referred to as “observations” or “data”, generated as a linear mixture of
sound source signals. The term “source” refers to the production system, for
example a musical instrument, and the term “source signal” refers to the signal
produced by that source. When the intended meaning is clear from the context
we will simply refer to the source signals as “the sources”.
Under the linear mixing assumption, the multichannel data can be expressed
as
 J
x(t) = sj (t) (5)
j=1

where J is the number of sources and sj (t) = [s1j (t), . . . sij (t), . . . , sIj (t)]T is the
multichannel contribution of source j to the data. Under the common assump-
tions of point-sources and linear instantaneous mixing, we have
sij (t) = sj (t) aij (6)

where the coefficients {aij } define a I×J mixing matrix A, with columns denoted
[a1 , . . . , aJ ]. In the following we will show that the NTF techniques described
in this paper correspond to maximum likelihood (ML) estimation of source and
mixing parameters in a model where the point-source assumption is dropped
and replaced by
(i)
sij (t) = sj (t) aij (7)
(i)
where the signals sj (t), i = 1, . . . , I are assumed to share a certain “resem-
blance”, as modelled by being two different realizations of the same random
Notes on NTF for Audio Source Separation 105

process, characterizing their time-frequency behavior, as opposed to be the same


realization. Dropping the point-source assumption may also be viewed as ig-
noring some mutual information between the channels (assumption of sources
contributing to each channel with equal statistics instead of contributing the
same signal ). Of course, when the data has been generated from point-sources,
dropping this assumption will usually lead to a suboptimal but typically faster
separation algorithm, and the results section will illustrate this point.
In this work we further model the source contributions as a sum of elementary
components themselves, so that
(i)
 (i)
sj (t) = ck (t) (8)
k∈Kj

where [K1 , . . . , KJ ] denotes a nontrivial partition of [1, . . . , K]. As will become


(i)
more clear in the following, the components ck (t) will be characterized by a
spectral shape wk and a vector of activation coefficients hk , through a statistical
model. Finally, we obtain

K
(i)
xi (t) = mik ck (t) (9)
k=1

where mik is defined as mik = aij if and only if k ∈ Kj . By linearity of STFT,


model (8) writes equivalently


K
(i)
xif n = mik ckf n (10)
k=1

(i) (i)
where xif n and ckf n are the complex-valued STFTs of xi (t) and ck (t), and
where f = 1, . . . , F is a frequency bin index and n = 1, . . . , N is a time frame
index.

2.2 A Statistical Interpretation of KL-NTF


Denote V the I × F × N tensor with coefficients vif n = |xif n | and Q the I × K
matrix with elements |mik |. Let us assume so far for ease of presentation that
J = K, i.e, mik = aik , so that M is a matrix with no particular structure. Then
it can be easily shown that the approach of [5], briefly described in Section 1
and consisting in solving

min dKL (vf n |v̂if n ) subject to Q, W, H ≥ 0 (11)
Q,W,H
if n

with v̂if n defined by Eq. (4), is equivalent to ML estimation of Q, W and H in


the following generative model :
 (i)
|xif n | = |mik | |ckf n | (12)
k
(i)
|ckf n | ∼ P(wf k hnk ) (13)
106 C. Févotte and A. Ozerov

where P(λ) denotes the Poisson distribution, defined in Appendix A, and the
KL divergence dKL (·|·) is defined as
x
dKL (x|y) = x log + y − x. (14)
y
The link between KL-NMF/KL-NTF and inference in composite models with
Poisson components has been established in many previous publications, see,
e.g, [2,12]. In our opinion, model (12)-(13) suffers from two drawbacks. First, the
linearity of the mixing model is assumed on the magnitude of the STFT frames -
see Eq. (12) - instead of the frames themselves - see Eq. (10) -, which inherently
(i)
assumes that the components {ckf n }k have the same phase and that the mixing
parameters {mik }k have the same sign, or that only one component is active in
every time-frequency tile (t, f ). Second, the Poisson distribution is formally only
defined on integers, which impairs rigorous statistical interpretation of KL-NTF
on non-countable data such as audio spectra.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the component amplitudes are given by

 (i) def (i)


|ckf n | = E{ |ckf n | | Q, W, H, |X| } (15)
qik wf k hnk
=  |xif n | (16)
l qil wf l hnl

(i)
Then, time-domain components ck (t) are reconstructed through inverse-STFT
(i)  (i)
of ckf n = |ckf n |arg(xif n ), where arg(x) denotes the phase of complex-valued x.

2.3 A Statistical Interpretation of IS-NTF


To remedy the drawbacks of the KL-NTF model for audio we describe a new
model based on IS-NTF of the power spectrogram, along the line of [4] and also
introduced in [10]. The model reads
 (i)
xif n = mik ckf n (17)
k
(i)
ckf n ∼ Nc (0|wf k hnk ) (18)

where Nc (μ, σ 2 ) denotes the proper complex Gaussian distribution, defined in


Appendix A. Denoting now V = |X|2 and Q = |M|2 , it can be shown that ML
estimation of Q, W and H in model (17)-(18) amounts to solving

min dIS (vif n |v̂if n ) subject to Q, W, H ≥ 0 (19)
Q,W,H
if n

where dIS (·|·) denotes the IS divergence defined as


x x
dIS (x|y) = − log − 1. (20)
y y
Notes on NTF for Audio Source Separation 107

Note that our notations are abusive in the sense that the mixing parameters
|mik | and the components |ckf n | appearing through their modulus in Eq. (12)
are in no way the modulus of the mixing parameters and the components ap-
pearing in Eq. (17). Similarly, the matrices W and H represent different types of
quantities in every case; in Eq. (13) their product is homogeneous to component
magnitudes while in Eq. (18) their product is homogeneous to variances of com-
ponent variances. Formally we should have introduced variables |cKL kf n |, W
KL
,
KL IS IS IS
H to be distinguished from variables ckf n , W , H , but we have not in
order to avoid cluttering the notations. The difference between these quantities
should be clear from the context.
Model (17)-(18) is a truly generative model in the sense that the linear mix-
ing assumption is made on the STFT frames themselves, which is a realistic
(i)
assumption in audio. Eq. (18) defines a Gaussian variance model of ckf n ; the
zero mean assumption reflects the property that the audio frames taken as the
input of the STFT can be considered centered, for typical window size of about
(i)
20 ms or more. The proper Gaussian assumption means that the phase of ckf n
is assumed to be a uniform random variable [9], i.e., the phase is taken into the
model, but in a noninformative way. This contrasts from model (12)-(13), which
simply discards the phase information.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the components are given by

(i) def (i)


ĉkf n = E{ckf n | Q, W, H, X } (21)
qik wf k hnk
=  xif n (22)
l qil wf l hnl

We would like to underline that the MMSE estimator of components in the STFT
domain (21) is equivalent (thanks to the linearity of the STFT and its inverse) to

Table 1. Statistical models and optimization problems underlaid to KL-NTF.mag and


IS-NTF.pow

KL-NTF.mag IS-NTF.pow
Model
 (i)  (i)
Mixing model |xif n | = k |mik | |ckf n | xif n = k mik ckf n
(i) (i)
Comp. distribution |ckf n | ∼ P(wf k hnk ) ckf n ∼ Nc (0|wf k hnk )
ML estimation
Data V = |X| V = |X|2
Parameters W, H, Q = |M| W, H, Q = |M|2

Approximate v̂if n = k qik wf k hnk
 
Optimization min if n dKL (vif n |v̂if n ) min if n dIS (vif n |v̂if n )
Q,W,H≥0 Q,W,H≥0
Reconstruction
 (i) q w h (i) q w h
|ckf n | =  ik f k nk |xif n | ĉkf n =  ik f k nk xif n
l qil wf l hnl l qil wf l hnl
108 C. Févotte and A. Ozerov

the MMSE estimator of components in the time domain, while the the MMSE
estimator of STFT magnitudes (15) for KL-NTF is not consistent with time
domain MMSE. Equivalence of an estimator with time domain signal squared
error minimization is an attractive property, at least because it is consistent with
a popular objective source separation measure such as signal to distortion ratio
(SDR) defined in [16].
The differences between the two models, termed “KL-NTF.mag” and “IS-
NTF.pow” are summarized in Table 1.

3 Algorithms for NTF


3.1 Standard NTF
We are now left with an optimization problem of the form
def 
min D(V|V̂) = d(vif n |v̂if n ) subject to Q, W, H ≥ 0 (23)
Q,W,H
if n

where v̂if n = k qik hnk wf k , and d(x|y) is the cost function, either the KL or IS
divergence in our case. Furthermore we impose qk 1 = 1 and wk 1 = 1, so as
to remove obvious scale indeterminacies between the three loading matrices Q,
W and H. With these conventions, the columns of Q convey normalized mix-
ing proportions (spatial cues) between the channels, the columns of W convey
normalized frequency shapes and all time-dependent amplitude information is
relegated into H.
As common practice in NMF and NTF, we employ multiplicative algorithms
for the minimization of D(V|V̂). These algorithms essentially consist of updating
each scalar parameter θ by multiplying its value at previous iteration by the
ratio of the negative and positive parts of the derivative of the criterion w.r.t.
this parameter, namely
[∇θ D(V|V̂)]−
θ←θ , (24)
[∇θ D(V|V̂)]+
where ∇θ D(V|V̂) = [∇θ D(V|V̂)]+ −[∇θ D(V|V̂)]− and the summands are both
nonnegative [4]. This scheme automatically ensures the nonnegativity of the pa-
rameter updates, provided initialization with a nonnegative value. The derivative
of the criterion w.r.t scalar parameter θ writes

∇θ D(V|V̂) = ∇θ v̂if n d (vif n |v̂if n ) (25)
if n

where d (x|y) = ∇y d(x|y). As such, we get

∇qik D(V|V̂) = wf k hnk d (vif n |v̂if n ) (26)
fn

∇wf k D(V|V̂) = qik hnk d (vif n |v̂if n ) (27)
in

∇hnk D(V|V̂) = qik wf k d (vif n |v̂if n ) (28)
if
Notes on NTF for Audio Source Separation 109

We note in the following G the I ×F ×N tensor with entries gif n = d (vif n |v̂if n ).
For the KL and IS cost functions we have
x
dKL (x|y) = 1 − (29)
y
1 x
dIS (x|y) = − 2 (30)
y y
Let A and B be F × K and N × K matrices. We denote A ◦ B the F × N × K
tensor with elements af k bnk , i.e, each frontal slice k contains the outer product
ak bTk .1 Now we note < S, T >KS ,KT the contracted product between tensors S
and T, defined in Appendix B, where KS and KT are the sets of mode indices
over which the summation takes place. With these definitions we get

∇Q D(V|V̂) = < G, W ◦ H >{2,3},{1,2} (31)


∇W D(V|V̂) = < G, Q ◦ H >{1,3},{1,2} (32)
∇H D(V|V̂) = < G, Q ◦ W >{1,2},{1,2} (33)

and multiplicative updates are obtained as


< G− , W ◦ H >{2,3},{1,2}
Q ← Q. (34)
< G+ , W ◦ H >{2,3},{1,2}

< G− , Q ◦ H >{1,3},{1,2}
W ← W. (35)
< G+ , Q ◦ H >{1,3},{1,2}
< G− , Q ◦ W >{1,2},{1,2}
H ← H. (36)
< G+ , Q ◦ W >{1,2},{1,2}
The resulting algorithm can easily be shown to nonincrease the cost function at
each iteration by generalizing existing proofs for KL-NMF [13] and for IS-NMF
[1]. In our implementation normalization of the variables is carried out at the
end of every iteration by dividing every column of Q by their 1 norm and scaling
the columns of W accordingly, then dividing the columns of W by their 1 norm
and scaling the columns of H accordingly.

3.2 Cluster NTF


For ease of presentation of the statistical composite models inherent to NTF, we
have assumed in Section 2.2 and onwards that K = J, i.e., that one source sj (t)
is one elementary component ck (t) with its own mixing parameters {aik }i . We
now turn back to our more general model (9), where each source sj (t) is a sum
of elementary components {ck (t)}k∈Kj sharing same mixing parameters {aik }i ,
i.e, mik = aij iff k ∈ Kj . As such, we can express M as

M = AL (37)
1
This is similar to the Khatri-Rao product of A and B, which returns a matrix of
dimensions F N × K with column k equal to the Kronecker product of ak and bk .
110 C. Févotte and A. Ozerov

where A is the I × J mixing matrix and L is a J × K “labelling matrix” with


only one nonzero value per column, i.e., such that

ljk = 1 iff k ∈ Kj (38)
ljk = 0 otherwise. (39)

This specific structure of M transfers equivalently to Q, so that

Q = DL (40)

where

D = |A| in KL-NTF.mag (41)


D = |A| 2
in IS-NTF.pow (42)

The structure of Q defines a new NTF, which we refer to as Cluster NTF,


denoted cNTF. The minimization problem (23) is unchanged except for the fact
that the minimization over Q is replaced by a minimization over D. As such,
the derivatives w.r.t. wf k , hnk do not change and the derivatives over dij write

∇dij D(V|V̂) = ( ljk wf k hnk ) d (vif n |v̂if n ) (43)
fn k
 
= ljk wf k hnk d (vif n |v̂if n ) (44)
k fn

i.e.,
∇D D(V|V̂) =< G, W ◦ H >{2,3},{1,2} LT (45)
so that multiplicative updates for D can be obtained as
< G− , W ◦ H >{2,3},{1,2} LT
D ← D. (46)
< G+ , W ◦ H >{2,3},{1,2} LT
As before, we normalize the columns of D by their 1 norm at the end of every
iteration, and scale the columns of W accordingly.
In our Matlab implementation the resulting multiplicative algorithm for IS-
cNTF.pow is 4 times faster than the one presented in [10] (for linear instanta-
neous mixtures), which was based on sequential updates of the matrices [qk ]k∈Kj ,
[wk ]k∈Kj , [hk ]k∈Kj . The Matlab code of this new algorithm as well as the
other algorithms described in this paper can be found online at http://perso.
telecom-paristech.fr/~fevotte/Samples/cmmr10/.

4 Results
We consider source separation of simple audio mixtures taken from the Signal
Separation Evaluation Campaign (SiSEC 2008) website. More specifically, we
used some “development data” from the “underdetermined speech and music
mixtures task” [18]. We considered the following datasets :
Notes on NTF for Audio Source Separation 111

– wdrums, a linear instantaneous stereo mixture (with positive mixing coeffi-


cients) of 2 drum sources and 1 bass line,
– nodrums, a linear instantaneous stereo mixture (with positive mixing co-
efficients) of 1 rhythmic acoustic guitar, 1 electric lead guitar and 1 bass
line.
The signals are of length 10 sec and sampled at 16 kHz. We applied a STFT
with sine bell of length 64 ms (1024 samples) leading to F = 513 and N = 314.
We applied the following algorithms to the two datasets :
– KL-NTF.mag with K = 9,
– IS-NTF.pow with K = 9,

Fig. 1. Mixing parameters estimation and ground truth. Top : wdrums dataset. Bot-
tom : nodrums dataset. Left : results of KL-NTF.mag and KL-cNTF.mag; ground
truth mixing vectors {|aj |}j (red), mixing vectors {dj }j estimated with KL-cNTF.mag
(blue), spatial cues {qk }k given by KL-NTF.mag (dashed, black). Right : results of IS-
NTF.pow and IS-cNTF.pow; ground truth mixing vectors {|aj |2 }j (red), mixing vectors
{dj }j estimated with IS-cNTF.pow (blue), spatial cues {qk }k given by IS-NTF.pow
(dashed, black).
112 C. Févotte and A. Ozerov

– KL-cNTF.mag with J = 3 and 3 components per source, leading to K = 9,


– IS-cNTF.pow with J = 3 and 3 components per source, leading to K = 9.

Every four algorithm was run 10 times from 10 random initializations for 1000 it-
erations. For every algorithm we then selected the solutions Q, W and H yielding
smallest cost value. Time-domain components were reconstructed as discussed
in Section 2.2 for KL-NTF.mag and KL-cNTF.mag and as is in Section 2.3 for
IS-NTF.pow and IS-cNTF.pow. Given these reconstructed components, source
estimates were formed as follows :

– For KL-cNTF.mag and IS-cNTF.pow, sources are immediately computed


using Eq. (8), because the partition K1 , . . . , KJ is known.
– For KL-NTF.mag and IS-NTF.pow, we used the approach of [5,6] consisting
of applying the K-means algorithm to Q (with J clusters) so as to label every
component k to a source j, and each of the J sources is then reconstructed
as the sum of its assigned components.

Note that we are here not reconstructing the original single-channel sources
(1) (I)
sj (t) but their multichannel contribution [sj (t), . . . , sj (t)] to the multichan-
nel data (i.e, their spatial image). The quality of the source image estimates
was assessed using the standard Signal to Distortion Ratio (SDR), source Im-
age to Spatial distortion Ratio (ISR), Source to Interference Ratio (SIR) and
Source to Artifacts Ratio (SAR) defined in [17]. The numerical results are
reported in Table 2. The source estimates may also be listened to online at
http://perso.telecom-paristech.fr/~fevotte/Samples/cmmr10/. Figure 1
displays estimated spatial cues together with ground truth mixing matrix, for
every method and dataset.

Discussion. On dataset wdrums best results are obtained with IS-cNTF.pow.


Top right plot of Figure 1 shows that the spatial cues returned by D reasonably
fit the original mixing matrix |A|2 . The slightly better results of IS-cNTF.pow
compared to IS-NTF.pow illustrates the benefit of performing clustering of the
spatial cues within the decomposition as opposed to after. On this dataset
KL-cNTF.mag fails to adequately estimate the mixing matrix. Top left plot of
Figure 1 shows that the spatial cues corresponding to the bass and hi-hat are
correctly captured, but it appears that two columns of D are “spent” on rep-
resenting the same direction (bass, s3 ), suggesting that more components are
needed to represent the bass, and failing to capture the drums, which are poorly
estimated. KL-NTF.mag performs better (and as such, one spatial cue qk is cor-
rectly fitted to the drums direction) but overly not as well as IS-NTF.pow and
IS-cNTF.pow.
On dataset nodrums best results are obtained with KL-NTF.mag. None of
the other methods adequately fits the ground truth spatial cues. KL-cNTF.mag
suffers same problem than on dataset wdrums : two columns of D are spent
on the bass. In contrast, none of the spatial cues estimated by IS-NTF.pow
and IS-cNTF.pow accurately captures the bass direction, and ŝ1 and ŝ2 both
Notes on NTF for Audio Source Separation 113

Table 2. SDR, ISR, SIR and SAR of source estimates for the two considered datasets.
Higher values indicate better results. Values in bold font indicate the results with best
average SDR.

wdrums nodrums
s1 s2 s3 s1 s2 s3
(Hi-hat) (Drums) (Bass) (Bass) (Lead G.) (Rhythmic G.)
KL-NTF.mag KL-NTF.mag
SDR -0.2 0.4 17.9 SDR 13.2 -1.8 1.0
ISR 15.5 0.7 31.5 ISR 22.7 1.0 1.2
SIR 1.4 -0.9 18.9 SIR 13.9 -9.3 6.1
SAR 7.4 -3.5 25.7 SAR 24. 2 7.4 2.6
KL-cNTF.mag KL-cNTF.mag
SDR -0.02 -14.2 1.9 SDR 5.8 -9.9 3.1
ISR 15.3 2.8 2.1 ISR 8.0 0.7 6.3
SIR 1.5 -15.0 18.9 SIR 13.5 -15.3 2.9
SAR 7.8 13.2 9.2 SAR 8.3 2.7 9.9
IS-NTF.pow IS-NTF.pow
SDR 12.7 1.2 17.4 SDR 5.0 -10.0 -0.2
ISR 17.3 1.7 36.6 ISR 7.2 1.9 4.2
SIR 21.1 14.3 18.0 SIR 12.3 -13.5 0.3
SAR 15.2 2.7 27.3 SAR 7.2 3.3 -0.1
IS-cNTF.pow IS-cNTF.pow
SDR 13.1 1.8 18.0 SDR 3.9 -10.2 -1.9
ISR 17.0 2.5 35.4 ISR 6.2 3.3 4.6
SIR 22.0 13.7 18.7 SIR 10.6 -10.9 -3.7
SAR 15.9 3.4 26.5 SAR 3.7 1.0 1.5

contain much bass and lead guitar.2 Results from all four methods on this dataset
are overly all much worse than with dataset wdrums, corroborating an estab-
lished idea than percussive signals are favorably modeled by NMF models [7].
Increasing the number of total components K did not seem to solve the observed
deficiencies of the 4 approaches on this dataset.

5 Conclusions
In this paper we have attempted to clarify the statistical models latent to audio
source separation using PARAFAC-NTF of the magnitude or power spectro-
gram. In particular we have emphasized that the PARAFAC-NTF does not op-
timally exploits interchannel redundancy in the presence of point-sources. This
still may be sufficient to estimate spatial cues correctly in linear instantaneous
mixtures, in particular when the NMF model suits well the sources, as seen from
2
The numerical evaluation criteria were computed using the bss eval.m function
available from SiSEC website. The function automatically pairs source estimates
with ground truth signals according to best mean SIR. This resulted here in pairing
left, middle and right blue directions with respectively left, middle and right red
directions, i.e, preserving the panning order.
114 C. Févotte and A. Ozerov

the results on dataset wdrums but may also lead to incorrect results in other cases,
as seen from results on dataset nodrums. In contrast methods fully exploiting
interchannel dependencies, such as the EM algorithm based on model (17)-(18)
(i)
with ckf n = ckf n in [10], can successfully estimates the mixing matrix in both
datasets. The latter method is however about 10 times computationally more
demanding than IS-cNTF.pow.
In this paper we have considered a variant of PARAFAC-NTF in which the
loading matrix Q is given a structure such that Q = DL. We have assumed that
L is known labelling matrix that reflects the partition K1 , . . . , KJ . An important
perspective of this work is to let the labelling matrix free and automatically
estimate it from the data, either under the constraint that every column lk of L
may contain only one nonzero entry, akin to a hard clustering, i.e., lk 0 = 1, or
more generally under the constraint that lk 0 is small, akin to soft clustering.
This should be made feasible using NTF under sparse 1 -constraints and is left
for future work.

References
1. Cao, Y., Eggermont, P.P.B., Terebey, S.: Cross Burg entropy maximization and its
application to ringing suppression in image reconstruction. IEEE Transactions on
Image Processing 8(2), 286–292 (1999)
2. Cemgil, A.T.: Bayesian inference for nonnegative matrix factorisation models.
Computational Intelligence and Neuroscience (Article ID 785152), 17 pages (2009);
doi:10.1155/2009/785152
3. Févotte, C.: Itakura-Saito nonnegative factorizations of the power spectrogram
for music signal decomposition. In: Wang, W. (ed.) Machine Audition: Principles,
Algorithms and Systems, ch. 11. IGI Global Press (August 2010), http://perso.
telecom-paristech.fr/~fevotte/Chapters/isnmf.pdf
4. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with
the Itakura-Saito divergence. With application to music analysis. Neural Com-
putation 21(3), 793–830 (2009), http://www.tsi.enst.fr/~fevotte/Journals/
neco09_is-nmf.pdf
5. FitzGerald, D., Cranitch, M., Coyle, E.: Non-negative tensor factorisation for sound
source separation. In: Proc. of the Irish Signals and Systems Conference, Dublin,
Ireland (September 2005)
6. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorisa-
tion models for musical sound source separation. Computational Intelligence and
Neuroscience (Article ID 872425), 15 pages (2008)
7. Helén, M., Virtanen, T.: Separation of drums from polyphonic music using non-
negative matrix factorization and support vector machine. In: Proc. 13th European
Signal Processing Conference (EUSIPCO 2005) (2005)
8. Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix
factorization. Nature 401, 788–791 (1999)
9. Neeser, F.D., Massey, J.L.: Proper complex random processes with applications to
information theory. IEEE Transactions on Information Theory 39(4), 1293–1302
(1993)
Notes on NTF for Audio Source Separation 115

10. Ozerov, A., Févotte, C.: Multichannel nonnegative matrix factorization in convolu-
tive mixtures for audio source separation. IEEE Transactions on Audio, Speech and
Language Processing 18(3), 550–563 (2010), http://www.tsi.enst.fr/~fevotte/
Journals/ieee_asl_multinmf.pdf
11. Parry, R.M., Essa, I.: Estimating the spatial position of spectral components in
audio. In: Rosca, J.P., Erdogmus, D., Prı́ncipe, J.C., Haykin, S. (eds.) ICA 2006.
LNCS, vol. 3889, pp. 666–673. Springer, Heidelberg (2006)
12. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to
statistics and computer vision. In: Proc. 22nd International Conference on Machine
Learning, pp. 792–799. ACM, Bonn (2005)
13. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomogra-
phy. IEEE Transactions on Medical Imaging 1(2), 113–122 (1982)
14. Smaragdis, P.: Convolutive speech bases and their application to speech separation.
IEEE Transactions on Audio, Speech, and Language Processing 15(1), 1–12 (2007)
15. Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music
transcription. In: IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA 2003) (October 2003)
16. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind au-
dio source separation. IEEE Transactions on Audio, Speech and Language Pro-
cessing 14(4), 1462–1469 (2006), http://www.tsi.enst.fr/~fevotte/Journals/
ieee_asl_bsseval.pdf
17. Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.P.: First stereo audio source
separation evaluation campaign: Data, algorithms and results. In: Davies, M.E.,
James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666,
pp. 552–559. Springer, Heidelberg (2007)
18. Vincent, E., Araki, S., Bofill, P.: Signal Separation Evaluation Campaign.
In: (SiSEC 2008) / Under-determined speech and music mixtures task re-
sults (2008), http://www.irisa.fr/metiss/SiSEC08/SiSEC_underdetermined/
dev2_eval.html
19. Virtanen, T.: Monaural sound source separation by non-negative matrix factor-
ization with temporal continuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing 15(3), 1066–1074 (2007)

A Standard Distributions
Proper complex Gaussian Nc (x|μ, Σ) = |π Σ|−1 exp −(x − μ)H Σ −1 (x − μ)
x
Poisson P(x|λ) = exp(−λ) λx!

B Contracted Tensor Product


Let S be a tensor of size I1 × . . . × IM × J1 × . . . × JN and T be a tensor of size
I1 × . . . × IM × K1 × . . . × KP . Then, the contracted product < S, T >{1,...,M },{1,...,M }
is a tensor of size J1 × . . . × JN × K1 × . . . × KP , given by
I1 IM
 
< S, T >{1,...,M },{1,...,M } = ... si1 ,...,iM ,j1 ,...,jN ti1 ,...,iM ,k1 ,...,kP (47)
i1 =1 iM =1

The contracted tensor product should be thought of as a form a generalized dot product
of two tensors along common modes of same dimensions.
What Signal Processing Can Do for the Music

Isabel Barbancho, Lorenzo J. Tardón, Ana M. Barbancho, Andrés Ortiz,


Simone Sammartino, and Cristina de la Bandera

Grupo de Aplicación de las Tecnologı́as de la Información y Comunicaciones,


Departamento de Ingenierı́a de Comunicaciones,
E.T.S. Ingenierı́a de Telecomunicación, Campus de Teatinos s/n,
Universidad of Málaga, SPAIN
[email protected]
http://webpersonal.uma.es/~IBP/index.htm

Abstract. In this paper, several examples of what signal processing can


do in the music context will be presented. In this contribution, music con-
tent includes not only the audio files but also the scores. Using advanced
signal processing techniques, we have developed new tools that will help
us handling music information, preserve, develop and disseminate our
cultural music assets and improve our learning and education systems.

Keywords: Music Signal Processing, Music Analysis, Music Transcrip-


tion, Music Information Retrieval, Optical Music Recognition, Pitch
Detection.

1 Introduction
Signal Processing Techniques are a powerful set of mathematical tools that allow
to obtain from a signal the required information for a certain purpose. Signal
Processing Techniques can be used for any type of signal: communication signals,
medical signals, Speech signals, multimedia signals, etc. In this contribution, we
focus on the application of signal processing techniques to music information:
audio and scores.
Signal processing techniques can be used for music database exploration. In
this field, we present a 3D adaptive environment for music content exploration
that allows the exploration of musical contents in a novel way. The songs are
analyzed and a series of numerical descriptors are computed to characterize
their spectral content. Six main musical genres are defined as axes of a multidi-
mensional framework, where the songs are projected. A three-dimensional sub-
domain is defined by choosing three of the six genres at a time and the user is
allowed to navigate in this space, browsing, exploring and analyzing the elements
of this musical universe. Also, inside this field of music database exploration, a
novel method for music similarity evaluation is presented. The evaluation of mu-
sic similarity is one of the core components of the field of Music Information
Retrieval (MIR). In this study, rhythmic and spectral analyses are combined to
extract the tonal profile of musical compositions and evaluate music similarity.
Music signal processing can be used also for the preservation of the cultural
heritage. In this sense, we have developed a complete system with an interactive

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 116–137, 2011.

c Springer-Verlag Berlin Heidelberg 2011
What Signal Processing Can Do for the Music 117

graphical user interface for Optical Music Recognition (OMR), specially adapted
for scores written in white mensural notation. Color photographies of Ancient
Scores taken at the Archivo de la Catedral de Málaga have been used as input to
the system. A series of pre-processing steps are aimed to improve their quality
and return binary images to be processed. The music symbols are extracted and
classified, so that the system is able to transcribe the ancient music notation
into modern notation and make it sound.
Music signal processing can also be focused in developing tools for technology-
enhanced learning and revolutionary learning appliances. In this sense, we present
different applications we have developed to help learning different instruments:
piano, violin and guitar. The graphical tool for piano learning we have developed,
is able to detect if a person is playing the proper piano chord. The graphical tool
shows to the user the time and frequency response of each frame of piano sound
under analysis and a piano keyboard in which the played notes are highlighted
as well as the name of the played notes. The core of the designed tool is a poly-
phonic transcription system able to detect the played notes, based on the use of
spectral patterns of the piano notes. The designed tool is useful both for users
with knowledge of music and users without these knowledge. The violin learning
tool is based on a transcription system able to detect the pitch and duration of
the violin notes and to identify the different expressiveness techniques: détaché
with and without vibrato, pizzicato, tremolo, spiccato, flageolett-töne. The in-
terface is a pedagogical tools to aid in violin learning. For the guitar, we have
developed a system able to perform in real time string and fret estimation of
guitar notes. The system works in three modes: it is able to estimate the string
and fret of a single note played on a guitar, strummed chords from a predefined
list and it is also able to make a free estimation if no information of what is
being played is given. Also, we have developed a lightweight pitch detector for
embedded systems to be used in toys. The detector is based on neural networks
in which the signal preprocessing is a frequency analysis. The selected neural
network is a perceptron-type network. For the preprocessing, the Goertzel Al-
gorithm is the selected technique for the frequency analysis because it is a light
alternative to FFT computing and it is very well suited when only few spectral
points are enough to extract the relevant information.
Therefore, the outline of the paper is as follows. In Section 2, musical content
management related tools are presented. Section 3 is devoted to the presentation
of the tool directly related to the preservation of the cultural heritage. Section 4
will present the different tools developed for technology-enhanced music learning.
Finally, the conclusions are presented in Section 5.

2 Music Content Management


The huge amount of digital musical content available through different databases
makes necessary to have intelligent music signal processing tools that help us
managing all this information.
In subsection 2.1 a novel tool for navigating through the music content is
presented. This 3D navigation environment makes easier to look for inter-related
118 I. Barbancho et al.

musical contents and it also gives the opportunity to the user to get to know
certain types of music that he would not have found with other more traditional
ways of searching musical contents.
In order to use a 3D environment as the one presented, or other types of
methods for music information retrieval, the evaluation of music similarity is one
of the core components. In subsection 2.2, the rhythmic and spectral analyses of
music contents are combined to extract the tonal profile of musical compositions
and evaluate music similarity.

2.1 3D Environment for Music Content Exploration


The interactive music exploration is an open problem [31], with increasing in-
terest due to the growing possibilities to access large music data bases. Efforts
to automate and simplify the access to musical contents require to analyze the
songs to obtain numerical descriptors in the time or frequency domains that can
be used to measure and compare differences and similarities among them. We
have developed an adaptive 3D environment that allows intuitive music explo-
ration and browsing through its graphical interface. Music analysis is based on
the use of the Mel frequency cepstral coefficients (MFCCs) [27], a multidimen-
sional space is built and each song is represented as a sphere in a 3D environment
with tools to navigate, listen and query the music space.
The MFCCs are essentially based on the short term Fourier transform. The
windowed spectrum of the original signal is computed. A Mel bank of filters
is applied to obtain a logarithmic frequency representation and the resulting
spectrum is processed with a discrete cosine transform (DCT). At this end, the
Mel coefficients have to be clustered in few groups, in order to achieve a compact
representation of the global spectral content of the signal. Here, the popular
k-means clustering method has been employed and the centroid of the most
populated cluster has been considered for a compact vectorial representation of
the spectral meaning of the whole piece. This approach has been applied to a
large number of samples for the six genres selected and a predominant vector
has been computed for each genre. These vectors are considered as pseudo-
orthonormal coordinates reference vectors for the projection of the songs. In
particular, for each song, the six coordinates have been obtained by computing
the scalar product among the predominant vectors of the song itself and the ones
of the six genres, conveniently normalized for unit norm.
The graphical user interface comprises a main window, with different func-
tional panels (Figure 1). In the main panel, the representation of the songs in a
3D framework is shown: three orthogonal axes, representing the three selected
genres are centered in the coordinates range and the set of songs are represented
as blue spheres correspondingly titled. A set of other panels with different func-
tions complete the window. During the exploration of the space, the user is
informed in real-time about the closest songs and can listen to them.

2.2 Evaluation of Music Similarity Based on Tonal Behavior


The evaluation of music similarity is one of the core components of the field of
Music Information Retrieval (MIR). Similarity is often computed on the basis of
What Signal Processing Can Do for the Music 119

Fig. 1. The graphical user interface for the 3D exploration of musical audio

the extraction of low-level time and frequency descriptors [25] or on the compu-
tation of rhythmic patterns [21]. Logan and Salomon [26] use the Mel Frequency
Cepstral Coefficients (MFCCs) as a main tool to compare audio tracks, based
on their spectral content. Ellis et al. [13] adopt the cross-correlation of rhythmic
patterns to identify common parts among songs.
In this study, rhythmic and spectral analyses are combined to extract the tonal
profile of musical compositions and evaluate music similarity. The processing
stage comprises two main steps: the computation of the main rhythmic meter of
the song and the estimation of the distribution of contributions of tonalities to
the overall tonal content of the composition. The calculus of the cross-correlation
of the rhythmic pattern of the envelope of the raw signal allows a quantitative
estimation of the main melodic motif of the song. Such temporal unit has to be
employed as a base for the temporal segmentation of the signal, aimed to extract
the pitch class profile of the song [14] and, consequently, the vector of tonality
contributions. Finally, this tonal behavior vector is employed as the main feature
to describe the song and it is used to evaluate similarity.
Estimation of the melodic cell. In order to characterize the main melodic
motif of the track, the songs are analyzed to estimate the tempo. More than a
real quantitative metrical calculus of the rhythmic pattern, the method aims at
delivering measures for guiding the temporal segmentation of the musical signal,
and at subsequently improving the representation of the song dynamics. This
is aimed at optimizing the step for the computation of the tonal content of the
audio signal, supplying the reference temporal frame for the audio windowing.
The aim of the tempo induction is to estimate the width of the window used for
windowing, so that the stage for the computation of the tonal content of the song
120 I. Barbancho et al.

attains improved performance. In particular, the window should be wide enough


to include a single melodic cell, e.g.: a single chord. Usually, the distribution of
tone contributions within a single melodic cell is uniform and coherent with the
chord content. The chord notes are played once for each melodic cell, such that,
by evaluating the tones content of each single cell, we can have a reliable idea
of the global contribution of the single tonalities for the whole track. It is clear
that there are exceptions for these assumptions, such as arpeggios, solos, etc.
Both the width and the phase (temporal location) of the window are extremely
important for achieving the best performance of the spectral analysis.
A series of frequency analysis stages are performed on the raw signal in order
to obtain the most robust estimate of the window. The signal is half-way rec-
tified, low-pass filtered and its envelope is computed. The main window value
is assumed to be best estimated by the average temporal distance between the
points of the first order derivative of the envelope showing the highest difference
among crests and troughs. The steps are schematically listed below:

1. The raw signal is half-way rectified and filtered with a low-pass Butterworth
filter, with a cut-off frequency of 100 Hz [12].
2. The envelope of the filtered signal is computed, using a low-pass Butterworth
filter with a cut-off frequency of 1 Hz.
3. The first order derivative is computed on the envelope.
4. The zero-crossing points of the derivative are found (the crests and the
troughs of the envelope).
5. The difference between crests and troughs is computed and its empirical
cumulative distribution is evaluated.
6. Only the values exceeding the 75th percentile of their cumulative distribu-
tions are kept.
7. The temporal distances among the selected troughs (or crests) are computed
and the average value is calculated.

A further fundamental parameter is the phase of the tempo detected. It assures


the correct matching between the windowing of the signal and the extent of the
melodic cell, which helps to minimize the temporal shifting. This is aimed by
locating the position of the first trough detected in the signal, as starting point
for the windowing stage.
The algorithm described has been employed to obtain an estimation of the
melodic cell that is used in the subsequent steps of the computation of the tonal
content. An objective evaluation of the performance of the method is hard to
achieve because of the fuzzy perception of the main motif of the song by the
human ear. Moreover, the exact usage of a strict regularity in metric is rarely
found in modern music [20] and some slight variations in the rhythm throughout
a whole composition are barely perceived by the listener. Nevertheless, a set of
30 songs have been selected from a set of 5 genres (Pop, Classic, Electro-Disco,
Heavy Metals and Jazz). The songs have been analyzed by experienced listeners
and the width of their main metric unity has been manually quantified. Then,
the results obtained by the automatic procedure described have been compared.
What Signal Processing Can Do for the Music 121

Table 1. Relative and absolute differences among the widths of the melodic window
manually evaluated by the listeners and the ones automatically computed by the pro-
posed algorithm

Relative Difference Absolute Difference


Genre
(percentage) (second)
Pop 14.6 0.60
Classic 21.2 0.99
Electro-Disco 6.2 0.34
Heavy Metals 18.4 0.54
Jazz 14.8 0.58
Mean 17.3 0.68

In Table 1, the differences between the widths of the window manually measured
and automatically computed are shown.
The best results are obtained for the Disco music tracks (6.2%), where the
clear drummed bass background is well detected and the pulse coincides most of
times with the tempo. The worst results are related to the lack of a clear driving
bass in Classical music (21.2%), where the changes in time can be frequent and
a uniform tempo measure is hardly detectable.
However, the beats, or lower-level metrical features are, most of the times,
submultiples of such tempo value, which make them usable for the melodic cell
computation.
Tonal behavior. Most of the music similarity systems aim at imitating the
human perception of a song. This capacity is complex to analyze. The human
brain carries out a series of subconscious processes, as the computation of the
rhythm, the instruments richness, the musical complexity, the tonality, the mode,
the musical form or structure, the presence of modulations, etc., even without
any technical musical knowledge [29].
A novel technique for the determination of the tonal behavior of music signals
based on the extraction of the pattern of tonality contributions is presented.
The main process is based on the calculus of the contributions of each note of
the chromatic scale (Pitch Class Profile - PCP), and the computation of the
possible matching tonalities. The outcome is a vector reflecting the variation of
the spectral contribution of each tonality throughout the entire piece. The song
is time windowed with no overlapping windows, whose width is determined on
the basis of the tempo induction algorithm.
The Pitch Class Profile is based on the contribution of the twelve semitone
pitch classes to the whole spectrum. Fujishima [14] employed the PCPs as main
tool for chord recognition, while İzmirli [22] defined them as ‘Chroma Template’
and used them for audio key finding. Gomez and Herrera [16] applied machine
learning methods to the ‘Harmonic Pitch Class Profile’, to estimate the tonalities
of polyphonic audio tracks.
The spectrum of the whole audio is analyzed, and the distribution of the
strengths of all the tones is evaluated. The different octaves are grouped to
measure the contribution of the 12 basic tones. A detailed description follows.
122 I. Barbancho et al.

The signal spectrum, computed by the discrete Fourier transform, is simplified


making use of the MIDI numbers as in [8].
The PCP is a 12-dimensions vector (from C to B) obtained by the sum of
the spectral amplitudes for each tone, spawning through the seven octaves (from
C1 to B7, or 24 to 107 in MIDI number). That is, the first element of the PCP
vector is the sum of the strengths of the pitches from tone C1 to tone C7, the
second one from tone C# 1 to tone C# 7, and so on.
Each k-th element of the PCP vector, with k ∈ {1, 2, . . . , 12}, is computed as
follows:
7
P CPt (k) = Xs (k + (i − 1) · 12) (1)
i=1

where Xs is the simplified spectrum, the index k covers the twelve semitone
pitches and i is used to index each octave. The subscript t stands for the temporal
frame for which the PCP is computed.
In order to estimate the predominant tonality of a track, it is important to define
a series of PCPs for all the possible tonalities, to be compared with its own PCP.
The shape of the PCP mainly depends on the modality of the tonality (Major or
Minor). Hence, by assembling only two global profiles, for major and minor modes,
and by shifting each of them twelve times according to the tonic pitch of the twelve
possible tonalities of each mode, 24 tonalities profiles are obtained.
Krumhansl [24] defined the profiles empirically, on the base of a series of listen-
ing sessions carried out on a group of undergraduates from University of Harvard,
who had to evaluate the correspondence among test tracks and probe tones. The
author presented two global profiles, one for major and one for minor mode, rep-
resenting the global contribution of each tone to all the tonalities for each mode.
More recently, Temperley [35] presented a modified less biased version of
Krumhansl profiles. In this context we propose a revised version of the Krumhansl’s
profiles with the aim of avoiding the bias of the system for a particular mode. Ba-
sically, the two mode profiles are normalized to show the same sum of values and,
then, their profiles are divided by their corresponding maximums.
For each windowed frame of the track, the squared Euclidean distance be-
tween the PCP of the frame and each tonality profile is computed to define a
24-elements vector. Each element of the vector is the sum of the squared differ-
ences between the amplitudes of the PCP and the tonality profiles. The squared
distance is defined as follow:
⎧ 11
⎪ 

⎨ [(PM (j + 1)) − P CPt ((j + k − 1) mod 12 + 1)]2 1 ≤ k ≤ 12
j=0
Dt (k) =

⎪ 11
⎩ [(Pm (j + 1)) − P CPt ((j + k − 1) mod 12 + 1)]2 13 ≤ k ≤ 24
j=0
(2)
where Dt (k) is the squared distance computed at time t of the k-th tonality, with
k ∈ {1, 2, ..., 24}, and PM /Pm are, respectively, the major and minor profile.
The predominant tonality of each frame corresponds to the minimum of
the distance vector Dt (k), where the index k, with k ∈ {1, ..., 12}, refers to
the twelve major tonalities (from C to B) and k, with k ∈ {13, ..., 24}, refers
What Signal Processing Can Do for the Music 123

Tonal behavior
1

0.8
Normalized Amplitude

0.6

0.4

0.2

0
C C# D Eb E F F# G Ab A Bb B c c# d d# e f f# g g# a a# b
Tonalities

Fig. 2. An example of the tonal behavior of the Beatles’ song “I’ll be back”, where the
main tonality is E major

to the twelve minor tonalities (from c to b). Usually major and minor tonalities
are represented with capital and lower-case letter respectively.
The empirical distribution of all the predominant tonalities, estimated through-
out the entire piece, is calculated in order to represent the tonality contributions
to the tonal content of the song. This is defined as the ‘tonal behavior’ of the com-
position. In Figure 2, an example of the distribution of the tonality contributions
for the Beatles’ song “I’ll be back” is shown.
Music similarity. The vectors describing the tonal behavior of the songs are
employed to measure their reciprocal degree of similarity. In fact the human
brain is able to detect the main melodic pattern, even by means of subconscious
processes and its perception of musical similarity is partially based on it [24].
The tonal similarity among the songs is computed by the Euclidean distance
of the tonal vector calculated, following the equation:
T SAB = TA − TB  (3)

where T SAB stands for the coefficient of tonal similarity between the songs A
and B and TA and TB are the empirical tonality distributions for song A and
B, respectively.
A robust evaluation of the performance of the proposed method for evalua-
tion of music similarity is very hard to achieve. The judgment of the similarity
among audio files is a very subjective issue, showing the complex reality of hu-
man perception. Nevertheless, a series of tests have been performed on some
predetermined lists of songs.
Four lists of 11 songs have been submitted to a group of ten listeners. They
were instructed to sort the songs according to their perceptual similarity and tonal
similarity. For each list, a reference song was defined and the remaining 10 songs
had to be sorted with respect to their degree of similarity with the reference one.
124 I. Barbancho et al.

A series of 10-element lists were returned by the users, as well as by the automatic
method. Two kinds of experimental approaches were carried out: in the first ex-
periment, the users had to listen to the songs and sort them according to a global
perception of their degree of similarity. In the second framework, they were asked
to focus only on the tonal content. The latter was the hardest target to obtain,
because of the complexity of discerning the parameters to be taken into account
when listening to a song and evaluating its similarity with respect to other songs.
The degree of coherence among the list manually sorted and the ones auto-
matically processed was obtained. A weighted matching score for each pair of
lists was computed, the reciprocal distance of the songs (in terms of the position
index in the lists) was calculated. Such distances were linearly weighted, so that
the first songs in the lists reflected more importance than the last ones. In fact, it
is easier to evaluate which is the most similar song among pieces that are similar
to the reference one, than performing the same selection among very different
songs. The weights aid to compensate for this bias.
Let Lα and Lβ represent two different ordered lists of n songs, for the same
reference song. The matching score C has been computed as follows:
n
C= |i − j| · ωi (4)
i=1

where i and j are the indexes for lists Lα and Lβ , respectively, such that j is the
index of the j-th song in list Lβ with Lα (i) ≡ Lβ (j). The absolute difference is
linearly weighted by the weights
n ωi normalized such to sum to one, defined by
the following expression: i=1 ωi = 1. Finally, the scores are transformed to be
represented as percentage of the maximum score attainable.
The efficiency of the automatic method was evaluated by measuring its co-
herence with the users’ response. The closer the two set of values, the better the
performance of the automatic method. As expected, the evaluation of the auto-
matic method in the first experimental framework did not return reliable results
because of the extreme deviation of the marks, due to the scarce relevance of the
tones distribution in the subjective judgment of the song. As mentioned before,
the tonal behavior of the song is only one of the parameters taken into account
subconsciously by the human ear. Nevertheless, if the same songs were asked to
be evaluated only by their tonal content, the scores drastically decreased, reveal-
ing the extreme lack of abstraction of the human ear. In Table 2 the results for
both experimental frameworks are shown.
The differences between the results of the two experiments are evident. Con-
cerning the first experiment, the mean score correspondence is 74.2%, among
the users lists and 60.1% among the users and the automatic list. That is, the
automatic method poorly reproduces the choices made by the users, taking into
account a global evaluation of music similarity. Conversely, in the second ex-
periment, better results were obtained. The mean correspondence score for the
users’ lists decrease to 61.1%, approaching the value returned by the users and
automatic list together, 59.8%. The performance of the system can be considered
to be similar to the behavior of a mean human user, regarding the perception of
tonal similarities.
What Signal Processing Can Do for the Music 125

Table 2. Means and standard deviations of the correspondence scores obtained com-
puting equation (4). The raws ‘Auto+Users’ and ‘Users’ refer to the correspondence
scores computed among the users lists together with the automatic list and among
only the users lists, respectively. The ‘Experiment 1’ is done listening and sorting the
songs on the base of a global perception of the track, while ‘Experiment 2’ is performed
trying to take into account only the tone distributions.

Experiment 1 Experiment 2
Lists Method
Mean St. Dev. Mean St. Dev.
Auto+Users 67.6 7.1 66.6 8.8
List A
Users 72.3 13.2 57.9 11.5
Auto+Users 63.6 1.9 66.3 8.8
List B
Users 81.8 9.6 66.0 10.5
Auto+Users 61.5 4.9 55.6 10.2
List C
Users 77.2 8.2 57.1 12.6
Auto+Users 47.8 8.6 51.0 9.3
List D
Users 65.7 15.4 63.4 14.4

Auto+Users 60.1 5.6 59.8 9.2


Means
Users 74.2 11.6 61.1 12.5

Fig. 3. A snapshot of some of the main windows of interface of the OMR system

3 Cultural Heritage Preservation and Diffusion


Other important use of the music signal processing is the preservation and dif-
fusion of the music heritage. In this sense, we have paid special attention to the
126 I. Barbancho et al.

musical heritage we have in the Archivo de la Catedral de Málaga. There, hand-


written musical scores of the XVII-th and the early XVIII-th centuries written
in white mensural notation are preserved. The aim of the tools we have devel-
oped is to give new life to that music, making easier to the people to get to
know the music of that time. Therefore, in this section, the OMR system we
have developed is presented.

3.1 A Prototype for an OMR System


OMR (Optical Music Recognition) systems are essentially based on the conver-
sion of a digitalized music score into an electronic format. The computer must
‘read’ the document (in this case a manuscript), ‘interprete’ it and transcript
its content (notes, time information, execution symbols etc.) into an electronic
format. The task can be addressed to recover important ancient documents and
to improve their availability to the music community.
In the OMR framework, the recognition of ancient handwritten scores is a
real challenge. The manuscripts are often in a very poor state of conservation,
due to their age an their preservation. The handwritten symbols are not uni-
form and additive symbols can be found to be manually added a posteriori by
other authors. The digital acquisition of the scores and the lighting conditions
in the exposure can cause an incoherence in the background of the image. All
these conditions make the development of an efficient OMR system a very hard
practice. Although the system workflow can be generalized, the specific algo-
rithms cannot be blindly used for different authors but it has to be trained for
each use.
We have developed a whole OMR system [34] for two styles of writing scores
in white mensural notation. Figure 3 shows a snapshot of its graphical user
interface. In the main window a series of tools are supplied to follow a complete
workflow, based on a number of steps: the pre-processing of the image, the
partition of the score into each single staff and the processing of the staves with
the extraction, the classification and the transcription of the musical neums.
Each tool corresponds to a individual window that allows the user to interact to
complete the stage.
The preprocessing of the image, aimed at feeding the system with the clean-
est black and white image of the score, is divided into the following tasks: the
clipping of region of interest of the image [9], the automatic blanking of red
frontispieces, the conversion from RGB to grayscale, the compensation of the
lighting conditions, the binarization of the image [17] and the correction of im-
age tilt [36]. After partitioning the score into each single staff, the staff lines are
tracked and blanked and the symbols are extracted and classified. In particular,
a series of multidimensional feature vectors are computed on the geometrical
extent of the symbols and a series of corresponding classifiers are employed to
relate the neums to their correspondent musical symbols. In any moment, the
interface allows the user for a careful following of each processing stage.
What Signal Processing Can Do for the Music 127

4 Tools for Technology-Enhanced Learning and


Revolutionary Learning Appliances
Music signal processing tools also make possible the development of new inter-
active methods for music learning using a computer or a toy. In this sense, we
have developed a number of specialized tools to help to learn how to play piano,
violin and guitar. These tools will be presented in sections 4.1, 4.2 and 4.3, re-
spectively. It is worth to mention that for the development of these tools, the
very special characteristics of the instruments have been taken into account. In
fact, the people who have developed such tools are able to play these instru-
ments. This has contributed to make these tools specially useful because in the
development, we have observed the main difficulties of each of the instruments.
Finally, thinking about developing toys, or other small embedded systems with
musical intelligence, in subsection 4.4, we present a lightweight pitch detector
that has been designed to this aim.

4.1 Tool for Piano Learning


The piano is a musical instrument that is widely used in all kinds of music and
as an aid to the composition due to its versatility and ubiquity. This instrument
is played by means of a keyboard and allows to get very rich polyphonic sound.
Piano learning involves several difficulties that come from its great possibilities
of generating sound with high polyphony number. These difficulties are easily ob-
served when the musical skills are small or when trying to transcribe its sound
when piano is used in composition. Therefore, it is useful to have a system that
determines the notes that sound in a piano in each time frame and represent them
in a simple form that can be easily understood, this is the aim of the tool that will
be presented. The core of the designed tool is a polyphonic transcription system
able to detect the played notes using spectral patterns of the piano notes [6], [4].
The approach used in the proposed tool to perform the polyphonic transcrip-
tion is rather different from the proposals that can be found in the literature [23].
In our case, the audio signal to be analyzed will be considered to have certain
similarities to the code division multiple access CDMA communications signal.
Our model will consider the spectral patterns of the different piano notes [4].
Therefore, in order to perform the detection of the notes that sound during each
time frame, we have considered a suitable modification of a CDMA multiuser
detection technique to cope with the polyphonic nature of the piano music and
with the different energy of the piano notes in the same way as an advanced
CDMA receiver [5].
A snapshot of the main windows of the interface is presented in Figure 4. The
designed graphical user interface is divided into three parts:
– The management items of the tool are three main buttons: one button to
adquire the piano music to analyze, another button to start the system and
a final button to reset the system.
– The time and frequency response of each frame of piano sound under analysis
are in the middle part of the window.
128 I. Barbancho et al.

Fig. 4. A snapshot of the main windows of the interface of the tool for piano learning

– A piano keyboard in which the played notes are highlighted as well as the
name of the played notes is shown at the bottom.

4.2 Tool for Violin Learning


The violin is one of the most complex instruments often used by the children
for the first approach to music learning. The main characteristic of the violin
is its great expressiveness due to the wide range of interpretation techniques.
The system we have developed is able to detect not only the played pitch, as
other transcription systems [11], but also the technique employed [8]. The signal
envelope and the frequency spectrum are considered in time and frequency do-
main, respectively. The descriptors employed for the detection system have been
computed analyzing a high amount of violin recordings, from the Musical In-
strument Sound Data Base RWC-MDB-1-2001-W05 [18] and other home made
recordings. Different playing techniques have been performed in order to train
the system for its expressiveness capability. The graphical interface is aimed to
facilitate the violin learning for any user.
For the signal processing tool, a graphical interface has been developed. The
main window presents two options for the user, the theory section (Teorı́a) and
the practical section (Práctica). In the section Teorı́a the user is encouraged to
learn all the concepts about the violin’s history, the violin’s parts and the play-
ing posture (left and right hand), while the section Práctica is mainly based on
an expressiveness transcription system [8]. Here, the user starts with the ‘basic
study’ sub-section, where the main violin positions are presented, illustrating
the placement of the left hand on the fingerboard, with the aim to attain a
good intonation. Hence, the user can record the melody correspondence to the
What Signal Processing Can Do for the Music 129

selected position and ask the application to correct it, returning the errors made.
Otherwise, in the ‘free practice’ sub-section, any kind of violin recording can be
analyzed for its melodic content, detecting the pitch, the duration of the notes
and the techniques employed (e.g.:détaché with and without vibrato, pizzicato,
tremolo, spiccato, flageolett-töne). The user can also visualize the envelope and
the spectrum of the each note and listen to the MIDI transcription generated.
In Figure 5, some snapshots of the interface are shown. The overall performance
attained by our system in the detection and correction of the notes and expres-
siveness is 95.4%.

4.3 Tool for String and Fret Estimation in Guitar

The guitar is one of the most popular musical instruments nowadays. In contrast
to other instruments like the piano, in the guitar the same note can be played
plucking different strings at different positions. Therefore, the algorithms used
for piano transcription [10] cannot be used for guitar. In guitar transcription it
is important to estimate the string used to play a note [7].

Fig. 5. Three snapshots of the interface for violin learning are shown. Clockwise from
top left: the main window, the analysis window and a plot of the MIDI melody.
130 I. Barbancho et al.

(a) Main window

(b) Single note estimation with tuner

Fig. 6. Graphical interface of the tool for guitar learning


What Signal Processing Can Do for the Music 131

The system presented in this demonstration is able to estimate the string and
the fret of a single note played with a very low error probability. In order to keep
a low error probability when a chord is strummed on a guitar, the system chooses
which chord has been most likely played from a predefined list. The system works
with classical guitars as well as acoustic or electric guitars. The sound has to
be captured with a microphone connected to the computer soundcard. It is also
possible to plug a cable from an electric guitar to the sound card directly.
The graphical interface consists of a main window (Figure 6(a)) with a pop-up
menu where you can choose the type of guitar you want to use with the interface.
The main window includes a panel (Estimación) with three push buttons, where
you can choose between three estimation modes:

– The mode Nota única (Figure 6(b)) estimates the string and fret of a single
note that is being played and includes a tuner (afinador ).
– The mode Acorde predeterminado estimates strummed chords that are being
played. The system estimates the chord by choosing the most likely one from
a predefined list.
– The last mode, Acorde libre, makes a free estimation of what is being played.
In this mode the system does not have the information of how many notes
are being played, so this piece of information is also estimated.

Each mode includes a window that shows the microphone input, a window with
the Fourier Transform of the sound sample, a start button, a stop button and an
exit button (Salir ). At the bottom of the screen there is a panel that represents
a guitar. Each row stands for a string on the guitar and the frets are numbered
from one to twelve. The current estimation of the sound sample, either note or
chord, is shown on the panel with a red dot.

4.4 Lightweight Pitch Detector for Embedded Systems Using


Neural Networks

Pitch detection could be defined as an act of listening to a music melody and


writing down music notation of the piece, that is, to decide the notes played [28].
Basically, this is a pattern recognition problem over the time, where each pat-
tern corresponds to features characterizing a musical note (e.g. the fundamental
frequency). Nowadays, there exists a wide range of applications for pitch detec-
tion: educational applications, music-retrieval systems, automatic music analysis
systems, music games, etc. The main problem of pitch detection systems is the
computational complexity required, especially if they are polyphonic [23]. Arti-
ficial intelligence techniques often involve an efficient and lightweight alternative
for classification and recognition tasks. These techniques can be used, in some
cases, to avoid other processing algorithms, for downshifting the computational
complexity, speeding up or improving the efficiency of the system [3], [33]. This
is the case of audio processing and music transcription.
When only a small amount of memory and processing power are available,
FFT-based detection techniques can be too costly to implement. In this case,
132 I. Barbancho et al.

artificial intelligence techniques, such as neural networks sized up to be imple-


mented in a small system, can provide the necessary accuracy. There are two
alternatives [3], [33] in neural networks. The first one is unsupervised training.
This is the case of some networks which have been specially designed for pat-
tern classification such as self-organizative maps. However, the computational
complexity of this implementation is too high for a low-cost microcontroller.
The other alternative is supervised training neural networks. This is the case
of perceptron-type networks. In these networks, the synaptic weights connecting
each neuron are modified as a new training vector is presented. Once the network
is trained, the weights can be statically stored to classify new network entries.
The training algorithm can be done on a different machine from the machine
where the network propagation algorithm is executed. Hence the only limitation
comes from the available memory.
In the proposed system, we focus on the design of a lightweight pitch detection
for embedded systems based on neural networks in which the signal preprocessing
is a frequency analysis. The selected neural network is a perceptron-type network.
For the preprocessing, the Goertzel algorithm [15] is the selected technique for
the frequency analysis because it is a light alternative to FFT computing if we
are only interested in some of the spectral points.

Audio in

8th order A/D 10-bit


Preamp Buffering
elliptic filter conversion

Pitch
I2C out Preprocessing
Detection

AVR ATMEGA168

Fig. 7. Block diagram of the pitch detector for an embedded system

Figure 7 shows the block diagram of the detection system. This figure shows
the hardware connected to the microcontroller’s A/D input, which consists of a
preamplifier in order to accommodate the input from the electret microphone
into the A/D input range and an anti-aliasing filter. The anti-aliasing filter pro-
vides 58dB of attenuation at cutoff which is enough to ensure the anti-aliasing
function. After the filter, the internal A/D converter of the microcontroller is
used. After conversion, a buffer memory is required in order to store enough sam-
ples for the preprocessing block. The output of the preprocessing block is used for
pitch detection using a neural network. Finally, an I2C (Inter-Integrated Circuit)
[32] interface is used for connecting the microcontroller with other boards.
We use the open source Arduino environment [1] with the AVR ATMEGA168
microcontroller [2] for development and testing of the pitch detection implemen-
tation. The system will be configured to detect the notes between A3 (220Hz)
What Signal Processing Can Do for the Music 133

and G#5 (830.6Hz), following the well-tempered scale, as it is the system mainly
used in Western music. This range of notes has been selected because one of the
applications of the proposed system is the detection of vocal music of children
and adolescents.
The aim of the preprocessing stage is to transform the samples of the audio
signal from the time domain to the frequency domain. The Goertzel algorithm
[15], [30] is a light alternative to FFT computing if the interest is focused only
in some of the spectrum points, as in this case. Given the frequency range of
the musical notes in which the system is going to work, along with the sampling
restriction of the selected processor, the selected sampling frequency is fs =
4KHz and the number of input samples N = 400, that obtain a precision of
10Hz, which is sufficient for the pitch detection system. On the other hand, in
the preprocessing block, the number of frequencies fk , in which the Goertzel
p
algorithm is computed, is 50 and are given according to fp = 440 · 2 12 Hz
with p = −24, −23, ..., 0, ..., 24, 25, so that, each note in the range of interest
have, at least, one harmonic and one subharmonic to improve the detection
performance of notes with octave or perfect fifth relation. Finally, the output of
the preprocessing stage is a vector that contains the squared modulus of the 50
points of interest of the Goertzel algorithm: the points of the power spectrum of
the input audio signal in the frequencies of interest.
For the algorithm implemented using fixed-point arithmetic, the execution
time is less than 3ms on a 16 MIPS AVR microcontroller. The number of points
of the Goertzel algorithm are limited by the available memory. The Eq. 5 shows
the number of bytes required to implement the algorithm.
 
N
nbytes = 2 + 2N + m (5)
4
In this expression, m represents the number of desired frequency points. Thus
with m = 50 points and N = 400, the algorithm requires 1900bytes of RAM
memory for signal input/processing/output buffering. Since the microcontroller
has 1024bytes of RAM memory, it is necessary to use an external high-speed SPI
RAM memory in order to have enough memory for buffering audio samples.
Once the Goertzel Algorithm has been performed and the points are stored in
the RAM memory, a recognition algorithm has to be executed for pitch detection.
A useful alternative to spectral processing techniques consist of using artificial
intelligence techniques. We use a statically trained neural network storing the
network weights vectors in a EEPROM memory. Thus, the network training is
performed in a computer with the same algorithm implemented and the embed-
ded system only runs the network. Figure 8 depicts the structure of the neural
network used for pitch recognition. It is a multilayer feed-forward perceptron
with a back-propagation training algorithm.
In our approach, sigmoidal activation has been used for each neuron as well
as no neuron bias. This provides a fuzzy set of values, yj , at the output of each
neural layer. The fuzzy set is controlled by the shape factor, α, of the sigmoid
function, which is set to 0.8, and it is applied to a threshold-based decision func-
tion. Hence, outputs below 0.5 does not activate output neurons while values
134 I. Barbancho et al.

1
1
1
2 2
2

3
3

50 4

24
5

Hidden Output
Input Layer
Layer Layer

Fig. 8. Neural network structure for note identification in an embedded system

G#5
G5 Ideal Output
F#5 Validation Test
F5 Learning Test
E5
D#5
D5
C#5
C5
B4
Output (Note)

A#4
A4
G#4
G4
F#4
F4
E4
D#4
D4
C#4
C4
B3
A#3
A3
0
A3 A#3 B3 C4 C#4 D4 D#4 E4 F4 F#4 G4 G#4 A4 A#4 B4 C5 C#5 D5 D#5 E5 F5 F#5 G5 G#5
Input (Note)

Fig. 9. Learning test, validation test and ideal output of the designed neural network

above 0.5 activate output neurons. The neural network parameters such as the
number of neurons in the hidden layer or the shape factor of the sigmoid func-
tion has been determined experimentally. The neural network has been trained
by running the BPN (Back Propagation Neural Network) on a PC. Once the
network convergence is achieved, the weight vectors are stored. Regarding the
output layer of the neural network, we use five neurons to encode 24 different
outputs corresponding to each note in two octaves (A3 − G#5 notation).
The training and the evaluation of the proposed system has been done using
independent note samples taken from the Musical Instrument Data Base RWC
[19]. The selected instruments have been piano and human voice. The training
of the neural network has been performed using 27 samples for each note in the
What Signal Processing Can Do for the Music 135

range of interest. Thus, we used 648 input vectors to train the network. This
way, the network convergence was achieved with an error of 0.5%.
In Figure 9, we show the learning characteristic of the network when simulating
the network with the training vectors. At the same time, we show the validation
test using 96 input vectors (4 per note) which corresponds to about 15% of new
inputs. As shown in Figure 9, the inputs are correctly classified due to the small
difference among the outputs for the ideal, learning and validation inputs.

5 Conclusions

Nowadays, it is a requirement that all types of information will be widely avail-


able in digital form in digital libraries together with intelligent techniques for
the creation and management of the digital information, thus, contents will be
plentiful, open, interactive and reusable. It becomes necessary to link contents,
knowledge and learning in such a way that information will be produced, stored,
handled, transmitted and preserved to ensure long term accessibility to every-
one, regardless of the special requirements of certain communities (e-inclusion).
Among the many different types of information, music happens to be one of the
more widely demanded due to its cultural interest, for entertainment or even
due to therapeutic reasons.
Through this paper, we have presented several application of music signal
processing techniques. It is clear that the use of such tools can be very enrich-
ing from several points of views: Music Content Management, Cultural Heritage
Preservation and Diffusion, Tools for Technology-Enhanced Learning and Rev-
olutionary Learning Appliances, etc. Now, that we have the technology at our
side in every moment: mobile phones, e-books, computers, etc., all these tools
we have developed can be easily used. There are still a lot open issues and things
that should be improved, but more and more, technology helps music.

Acknowledgments

This work has been funded by the Ministerio de Ciencia e Innovación of the Span-
ish Government under Project No. TIN2010-21089-C03-02, by the Junta de
Andalucı́a under Project No. P07-TIC-02783 and by the Ministerio de Industria,
Turismo y Comercio of the Spanish Government under Project No. TSI-020501-
2008-117. The authors are grateful to the person in charge of the Archivo de la
Catedral de Málaga, who allowed the utilization of the data sets used in this work.

References
1. Arduino board, http://www.arduino.cc (last viewed February 2011)
2. Atmel corporation web side, http://www.atmel.com (last viewed February 2011)
3. Aliev, R.: Soft Computing and its Applications. World Scientific Publishing Com-
pany, Singapore (2001)
136 I. Barbancho et al.

4. Barbancho, A.M., Barbancho, I., Fernandez, J., Tardón, L.J.: Polyphony number
estimator for piano recordings using different spectral patterns. In: 128th Audio
Engineering Society Convention (AES 2010), London, UK (2010)
5. Barbancho, A.M., Tardón, L., Barbancho, I.: CDMA systems physical function level
simulation. In: IASTED International Conference on Advances in Communication,
Rodas, Greece (2001)
6. Barbancho, A.M., Tardón, L.J., Barbancho, I.: PIC detector for piano chords.
EURASIP Journal on Advances in Signal Processing (2010)
7. Barbancho, I., Tardón, L.J., Barbancho, A.M., Sammartino, S.: Pitch and played
string estimation in classic and acoustic guitars. In: Proc. of the 126th Audio
Engineering Society Convention (AES 126th), Munich, Germany (May 2009)
8. Barbancho, I., Bandera, C., Barbancho, A.M., Tardón, L.J.: Transcription and
expressiveness detection system for violin music. In: IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), Taipei, Taiwan, pp. 189–192 (2009)
9. Barbancho, I., Segura, C., Tardón, L.J., Barbancho, A.M.: Automatic selection of
the region of interest in ancient scores. In: IEEE Mediterranean Electrotechnical
Conference (MELECON 2010), Valletta, Malta (May 2010)
10. Bello, J.: Automatic piano transcription using frequency and time-domain informa-
tion. IEEE Transactions on Audio, Speech and Language Processing 14(6), 2242–
2251 (2006)
11. Boo, W., Wang, Y., Loscos, A.: A violin music transcriber for personalized learning.
In: IEEE Int. Conf. on Multimdia and Expo. (ICME), Toronto, Canada, pp. 2081–
2084 (2006)
12. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: Proceedings of the International Conference on Music Information
Retrieval (ISMIR 2003), October 26-30, pp. 159–165. John Hopkins University,
Baltimore, USA (2003)
13. Ellis, D.P.W., Cotton, C.V., Mandel, M.I.: Cross-correlation of beat-synchronous
representations for music similarity. In: Proceedings of the IEEE International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, USA, pp.
57–60 (2008), http://mr-pc.org/work/icassp08.pdf (last viewed February 2011)
14. Fujishima, T.: Realtime chord recognition of musical sound: a system using com-
mon lisp music. In: Proc. International Computer Music Association, ICMC 1999,
pp. 464–467 (1999), http://ci.nii.ac.jp/naid/10013545881/en/ (last viewed,
February 2011)
15. Goertzel, G.: An algorithm for the evaluation of finite trigonomentric series. The
American Mathematical Monthly 65(1), 34–35 (1958)
16. Gómez, E., Herrera, P.: Estimating the tonality of polyphonic audio files: Cognitive
versus machine learning modelling strategies. In: Proc. Music Information Retrieval
Conference (ISMIR 2004), Barcelona, Spain, pp. 92–95 (2004)
17. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall
Inc., Upper Saddle River (2006)
18. Goto, M.: Development of the RWC music database. In: 18th Int. Congress on
Acoustics., pp. I-553–I-556 (2004)
19. Goto, M.: Development of the RWC music database. In: Proc. of the 18th Interna-
tional Congress on Acoustics ICA 2004, Kyoto, Japan, pp. 553–556 (April 2004)
20. Gouyon, F.: A computational approach to rhythm description — Audio features
for the computation of rhythm periodicity functions and their use in tempo in-
duction and music content processing. Ph.D. thesis, Ph.D. Dissertation. UPF
(2005), http://www.mtg.upf.edu/files/publications/9d0455-PhD-Gouyon.pdf
(last viewed February 2011)
What Signal Processing Can Do for the Music 137

21. Holzapfel, A., Stylianou, Y.: Rhythmic similarity of music based on dynamic pe-
riodicity warping. In: IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2008, Las Vegas, USA, March 31- April 4, pp. 2217–
2220 (2008)
22. Izmirli, Ö.: Audio key finding using low-dimensional spaces. In: Proc. Music Infor-
mation Retrieval Conference, ISMIR 2006, Victoria, Canada, pp. 127–132 (2006)
23. Klapuri, A.: Automatic music transcription as we know it today. Journal of New
Music Research 33(3), 269–282 (2004)
24. Krumhansl, C.L., Kessler, E.J.: Tracing the dynamic changes in perceived tonal
organization in a spatial representation of musical keys. Psychological Review 89,
334–368 (1982)
25. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Individualization of mu-
sic similarity perception via feature subset selection. In: Proc. Int. Conference on
Systems, Man and Cybernetics, Massachusetts, USA, vol. 1, pp. 552–556 (2004)
26. Logan, B., Salomon, A.: A music similarity function based on signal analysis.
In: IEEE International Conference on Multimedia and Expo., ICME 2001,Tokyo,
Japan, pp. 745–748 (August 2001)
27. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proc. Music
Information Retrieval Conference(ISMIR 2000) (2000)
28. Marolt, M.: A connectionist approach to automatic transcription of polyphonic
piano music. IEEE Transactions on Multimedia 6(3), 439–449 (2004)
29. Ockelford, A.: On Similarity, Derivation and the Cognition of Musical Structure.
Psychology of Music 32(1), 23–74 (2004), http://pom.sagepub.com/cgi/content/
abstract/32/1/23 (last viewed February 2011)
30. Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Prentice-Hall, En-
glewood Cliffs (1989)
31. Pampalk, E.: Islands of music - analysis, organization, and visualization of music
archives. Vienna University of Technology, Tech. rep. (2001)
32. Philips: The I2C bus specification v.2.1. (2000), http://www.nxp.com (last viewed
February 2011)
33. Prasad, B., Mahadeva, S.: Speech, Audio, Image and Biomedical Signal Processing
using Neural Networks. Springer, Heidelberg (2004)
34. Tardón, L.J., Sammartino, S., Barbancho, I., Gómez, V., Oliver, A.J.: Optical
music recognition for scores written in white mensural notation. EURASIP Jour-
nal on Image and Video Processing 2009, Article ID 843401, 23 pages (2009),
doi:10.1155/2009/843401
35. Temperley, D.: The Cognition of Basic Musical Structures. The MIT Press, Cam-
bridge (2004)
36. William, W.K.P.: Digital image processing, 2nd edn. John Wiley & Sons Inc., New
York (1991)
Speech/Music Discrimination in Audio Podcast
Using Structural Segmentation and Timbre
Recognition

Mathieu Barthet , Steven Hargreaves, and Mark Sandler

Centre for Digital Music, Queen Mary University of London, Mile End Road, London
E1 4NS, United Kingdom
{mathieu.barthet,steven.hargreaves,mark.sandler}@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/digitalmusic/

Abstract. We propose two speech/music discrimination methods using


timbre models and measure their performances on a 3 hour long database
of radio podcasts from the BBC. In the first method, the machine es-
timated classifications obtained with an automatic timbre recognition
(ATR) model are post-processed using median filtering. The classifica-
tion system (LSF/K-means) was trained using two different taxonomic
levels, a high-level one (speech, music), and a lower-level one (male and
female speech, classical, jazz, rock & pop). The second method combines
automatic structural segmentation and timbre recognition (ASS/ATR).
The ASS evaluates the similarity between feature distributions (MFCC,
RMS) using HMM and soft K-means algorithms. Both methods were
evaluated at a semantic (relative correct overlap RCO), and temporal
(boundary retrieval F-measure) levels. The ASS/ATR method obtained
the best results (average RCO of 94.5% and boundary F-measure of
50.1%). These performances were favourably compared with that ob-
tained by a SVM-based technique providing a good benchmark of the
state of the art.

Keywords: Speech/Music Discrimination, Audio Podcast, Timbre


Recognition, Structural Segmentation, Line Spectral Frequencies, K-means
clustering, Mel-Frequency Cepstral Coefficients, Hidden Markov Models.

1 Introduction

Increasing amounts of broadcast material are being made available in the pod-
cast format which is defined in reference [52] as a “digital audio or video file
that is episodic; downloadable; programme-driven, mainly with a host and/or
theme; and convenient, usually via an automated feed with computer software”
(the word podcast comes from the contraction of webcast, a digital media file
distributed over the Internet using streaming technology, and iPod, the portable
media player by Apple). New technologies have indeed emerged allowing users

Correspondence should be addressed to Mathieu Barthet.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 138–162, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Speech/Music Discrimination Using Timbre Models 139

to access audio podcasts material either online (on radio websites such as the
one from the BBC used in this study: http://www.bbc.co.uk/podcasts), or
offline, after downloading the content on personal computers or mobile devices
using dedicated services. A drawback of the podcast format, however, is its lack of
indexes for individual songs and sections, such as speech. This makes navigation
through podcasts a difficult, manual process, and software built on top of auto-
mated podcasts segmentation methods would therefore be of considerable help
for end-users. Automatic segmentation of podcasts is a challenging task in speech
processing and music information retrieval since the nature of the content from
which they are composed is very broad. A non-exhaustive list of type of content
commonly found in podcast includes: spoken parts of various types depending
on the characteristics of the speakers (language, gender, number, etc.) and the
recording conditions (reverberation, telephonic transmission, etc.), music tracks
often belonging to disparate musical genres (classical, rock, jazz, pop, electro,
etc.) and which may include a predominant singing voice (source of confusion
since the latter intrinsically shares properties with the spoken voice), jingles and
commercials which are usually complex sound mixtures including voice, music,
and sound effects. One step of the process of automatically segmenting and an-
notating podcasts therefore is to segregate sections of speech from sections of
music. In this study, we propose two computational models for speech/music
discrimination based on structural segmentation and/or timbre recognition and
evaluate their performances in the classification of audio podcasts content. In
addition to their use with audio broadcast material (e.g. music shows, inter-
views) as assessed in this article, speech/music discrimination models may also
be of interest to enhance navigation into archival sound recordings that con-
tain both spoken word and music (e.g. ethnomusicology interviews available on
the online sound archive from the British Library: https://sounds.bl.uk/). If
speech/music discrimination models find a direct application in automatic audio
indexation, they may also be used as a preprocessing stage to enhance numerous
speech processing and music information retrieval tasks such as speech and mu-
sic coding, automatic speaker recognition (ASR), chord recognition, or musical
instrument recognition.
The speech/music discrimination methods proposed in this study rely on
timbre models (based on various features such as the line spectral frequen-
cies [LSF], and the mel-frequency cepstral coefficients [MFCC]), and machine
learning techniques (K-means clustering and hidden Markov models [HMM]).
The first proposed method comprises an automatic timbre recognition (ATR)
stage using the model proposed in [7] and [16] trained here with speech and
music content. The results of the timbre recognition system are then post-
processed using a median filter to minimize the undesired inter-class switches.
The second method utilizes the automatic structural segmentation (ASS) model
proposed in [35] to divide the signal into a set of segments which are homoge-
neous with respect to timbre before applying the timbre recognition procedure.
A database of classical music, jazz, and popular music podcasts from the BBC
was manually annotated for training and testing purposes (approximately 2,5
140 M. Barthet, S. Hargreaves, and M. Sandler

hours of speech and music). The methods were both evaluated at the semantic
level to measure the accuracy of the machine estimated classifications, and at the
temporal level to measure the accuracy of the machine estimated boundaries be-
tween speech and music sections. Whilst studies on speech/music discrimination
techniques usually provide the first type of evaluation (classification accuracy),
boundary retrieval performances are not reported to our knowledge, despite their
interest. The results of the proposed methods were also compared with those ob-
tained with a state-of-the-art’s speech/music discrimination algorithm based on
support vector machine (SVM) [44].
The remainder of the article is organized as follows. In section 2, a review
of related works on speech/music discrimination is proposed. Section 3, we give
a brief overview of timbre research in psychoacoustics, speech processing and
music information retrieval, and then describe the architecture of the proposed
timbre-based methods. Section 4 details the protocols and databases used in the
experiments, and specifies the measures used to evaluate the algorithms. The
results of the experiments are given and discussed in section 5. Finally, section
6 is devoted to the summary and the conclusions of this work.

2 Related Work

Speech/music discrimination is a special case of audio content classification re-


duced to two classes. Most audio content classification methods are based on the
following stages: (i) the extraction of (psycho)acoustical variables aimed at char-
acterizing the classes to be discriminated (these variables are commonly referred
to as descriptors or features), (ii) a feature selection stage in order to further im-
prove the performances of the classifier, that can be either done a priori based on
some heuristics on the disparities between the classes to discern, or a posteriori
using an automated selection technique, and (iii) a classification system relying
either on generative methods modeling the distributions in the feature space,
or discriminative methods which determine the boundaries between classes. The
seminal works on speech/music discrimination by Saunders [46], and Scheirer
and Slaney [48], developed descriptors quantifying various acoustical specifici-
ties of speech and music which were then widely used in the studies on the same
subject. In [46], Saunders proposed five features suitable for speech/music dis-
crimination and whose quick computation in the time domain directly from the
waveform allowed for a real-time implementation of the algorithm; four of them
are based on the zero-crossing rate (ZCR) measure (a correlate of the spectral
centroid or center of mass of the power spectral distribution that characterize
the dominant frequency in the signal [33]), and the other was an energy contour
(or envelope) dip measure (number of energy minima below a threshold defined
relatively to the peak energy in the analyzed segment). The zero crossing rates
were computed on a short-term basis (frame-by-frame) and then integrated on a
longer-term basis with measures of the skewness of their distribution (standard
deviation of the derivative, the third central moment about the mean, number
of zero crossings exceeding a threshold, and the difference of the zero crossing
Speech/Music Discrimination Using Timbre Models 141

samples above and below the mean). When both the ZCR and energy-based
features were used jointly with a supervised machine learning technique relying
on a multivariate-Gaussian classifier, a 98% accuracy was obtained on average
(speech and music) using 2.4 s-long audio segments. The good performance of
the algorithm can be explained by the fact that the zero-crossing rate is a good
candidate to discern unvoiced speech (fricatives) with a modulated noise spec-
trum (relatively high ZCR) from voiced speech (vowels) with a quasi-harmonic
spectrum (relatively low ZCR): speech signals whose characteristic structure is
a succession of syllabes made of short periods of fricatives and long periods of
vowels present a marked rise in the ZCR during the periods of fricativity, which
do not appear in music signals, which are largely tonal (this however depends on
the musical genre which is considered). Secondly, the energy contour dip mea-
sure characterizes the differences between speech (whose systematic changeovers
between voiced vowels and fricatives produce marked and frequent change in the
energy envelope), and music (which tends to have a more stable energy envelope)
well. However, the algorithm proposed by Saunders is limited in time resolution
(2.4 s). In [48], Scheirer and Slaney proposed a multifeature approach and ex-
amined various powerful classification methods. Their system relied on the 13
following features and, in some cases, their variance: 4 Hz modulation energy
(characterizing the syllabic rate in speech [30]), the percentage of low-energy
frames (more silences are present in speech than in music), the spectral rolloff,
defined as the 95th percentile of the power spectral distribution (good candidate
to discriminate voiced from unvoiced sounds), the spectral centroid (often higher
for music with percussive sounds than for speech whose pitches stay in a fairly
low range), the spectral flux, which is a measure of the fluctuation of the short-
term spectrum (music tends to have a higher rate of spectral flux change than
speech), the zero-crossing rate as in [46], the cepstrum resynthesis residual mag-
nitude (the residual is lower for unvoiced speech than for voiced speech or mu-
sic), and a pulse metric (indicating whether or not the signal contains a marked
beat, as is the case in some popular music). Various classification frameworks
were tested by the authors, a multidimensional Gaussian maximum a posteriori
(MAP) estimator as in [46], a Gaussian mixture model (GMM), a k-nearest-
neighbour estimator (k-NN), and a spatial partitioning scheme (k-d tree), and
all led to similar performances. The best average recognition accuracy using the
spatial partitioning classification was of 94.2% on a frame-by-frame basis, and of
98.6% when integrating on 2.4 s long segments of sound, the latter results being
similar to those obtained by Saunders. Some authors used extensions or corre-
lates of the previous descriptors for the speech/music discrimination task such
as the higher order crossings (HOC) which is the zero-crossing rate of filtered
versions of the signal [37] [20] originally proposed by Kedem [33], the spectral
flatness (quantifying how tonal or noisy a sound is) and the spectral spread (the
second central moment of the spectrum) defined in the MPEG-7 standard [9],
and a rhythmic pulse computed in the MPEG compressed domain [32]. Carey
et al. introduced the use of the fundamental frequency f0 (strongly correlated
to the perceptual attribute of pitch) and its derivative in order to characterize
142 M. Barthet, S. Hargreaves, and M. Sandler

some prosodic aspects of the signals (f0 changes in speech are more evenly dis-
tributed than in music where they are strongly concentrated about zero due to
steady notes, or large due to shifts between notes) [14]. The authors obtained a
recognition accuracy of 96% using the f0 -based features with a Gaussian mix-
ture model classifier. Descriptors quantifying the shape of the spectral envelope
were also widely used, such as the Mel Frequency Cepstral Coefficients (MFCC)
[23] [25] [2], and the Linear Prediction Coefficients (LPC) [23] [1]. El-Maleh et
al. [20] used descriptors quantifying the formant structure of the spectral enve-
lope, the line spectral frequencies (LSF), as in this study (see section 3.1). By
coupling the LSF and HOC features with a quadratic Gaussian classifier, the au-
thors obtained a 95.9% average recognition accuracy with decisions made over 1
s long audio segments, procedure which performed slightly better than the algo-
rithm by Scheirer and Slaney tested on the same dataset (an accuracy increase
of approximately 2%). Contrary to the studies described above that relied on
generative methods, Ramona and Richard [44] developed a discriminative classi-
fication system relying on support vector machines (SVM) and median filtering
post-processing, and compared diverse hierarchical and multi-class approaches
depending on the grouping of the learning classes (speech only, music only, speech
with musical background, and music with singing voice). The most relevant fea-
tures amongst a large collection of about 600 features are selected using the
inertia ratio maximization with feature space projection (IRMFSP) technique
introduced in [42] and integrated on 1 s long segments. The method provided an
F-measure of 96.9% with a feature vector dimension of 50. Those results repre-
sent an error reduction of about 50% compared to the results gathered by the
French ESTER evaluation campaign [22]. As will be further shown in section
5, we obtained performances favorably comparable to those provided by this
algorithm. Surprisingly, all the mentioned studies evaluated the speech/music
classes recognition accuracy, but none, to our knowledge, evaluated the bound-
ary retrieval performance commonly used to evaluate structural segmentation
algorithms [35] (see section 4.3), which we also investigate in this work.

3 Classification Frameworks

We propose two audio classification frameworks based on timbre models applied


in this work to the speech/music discrimination task. The architecture of both
systems are represented in Figure 1. The first system (see Figure 1(a)) is based
on the automatic timbre recognition (ATR) algorithm described in [7], initially
developed for musical instrument recognition, and a post-processing step aiming
at reducing the undesired inter-class switches (smoothing by median filtering).
This method will be denoted ATR. The second system (see Figure 1(b)) was
designed to test whether the performances of the automatic timbre recognition
system would be improved by using a pre-processing step which divides the
signal into segments of homogenous timbre. To address this issue, the signal is
first processed with an automatic structural segmentation (ASS) procedure [35].
Automatic timbre recognition (ATR) is then applied to the retrieved segments
Speech/Music Discrimination Using Timbre Models 143

Testing audio

Structural
S, D
segmentation
Testing audio
homogeneous segments

Timbre recognition LSF, K, L Timbre recognition LSF, K, L

intermediate classification intermediate classification


(short-term) (short-term)

Post-processsing Post-processsing
W
(median filtering) (class decision)

segment-level classification segment-level classification

(a) Classification based on auto- (b) Classification based on


matic timbre recognition (ATR) automatic structural segmen-
tation and timbre recognition
(ASS/ATR)

Fig. 1. Architecture of the two proposed audio segmentation systems. The tuning pa-
rameters of the systems’ components are also reported: number of line spectral frequen-
cies (LSF), number of codevectors K, latency L for the automatic timbre recognition
module, size of the sliding window W used in the median filtering (post-processing),
maximal number S of segment types, and minimal duration D of segments for the
automatic structural segmentation module.

and the segment-level classification decisions are obtained after a post-processing


step whose role is to determine the classes most frequently identified within the
segments. This method will be denoted ASS/ATR. In the remainder of this
section we will first present the various acoustical correlates of timbre used by
the systems, and then describe both methods in more detail.

3.1 Acoustical Correlates of Timbre

The two proposed systems rely on the assumption that speech and music can
be discriminated based on their differences in timbre. Exhaustive computational
models of timbre have not yet been found and the common definition used by
scholars remains vague: “timbre is that attribute of auditory sensation in terms
of which a listener can judge that two sounds similarly presented and having the
same loudness and pitch are dissimilar; Timbre depends primarily upon the spec-
trum of the stimulus, but it also depends on the waveform, the sound pressure,
the frequency location of the spectrum, and the temporal characteristics of the
stimulus.” [3]. Research in psychoacoustics [24] [10], [51], analysis/synthesis [45],
144 M. Barthet, S. Hargreaves, and M. Sandler

music perception [4] [5], speech recognition [19], and music information retrieval
[17] have however developed acoustical correlates of timbre characterizing some
of the facets of this complex and multidimensional variable.

The Two-fold Nature of Timbre: from Identity to Quality. One of the


pioneers on timbre research, French researcher and electroacoustic music com-
poser Schaeffer, put forward a relevant paradox about timbre wondering how a
musical instrument’s timbre could be defined considering that each of its tones
also possessed a specific timbre [47]. Cognitive categorization’s theories shed
light on Schaeffer’s paradox showing that sounds (objects, respectively) could
be categorized either in terms of the sources from which they are generated, or
simply as sounds (objects, respectively), in terms of the properties that charac-
terize them [15]. These principles have been applied to timbre by Handel who
described timbre perception as being both guided by our ability to recognize
various physical factors that determine the acoustical signal produced by musi-
cal instruments [27] (later coined “source” mode of timbre perception by Hadja
et al. [26]), and by our ability to analyze the acoustic properties of sound ob-
jects perceived by the ear, traditionally modeled as a time-evolving frequency
analyser (later coined as “interpretative” mode of timbre perception in [26]). In
order to refer to that two-fold nature of timbre, we like to use the terms timbre
identity and timbre quality, which were proposed in reference [38]. The timbre
identity and quality facets of timbre perception have several properties: they
are not independent but intrinsically linked together (e.g. we can hear a guitar
tone and recognize the guitar, or we can hear a guitar tone and hear the sound
for itself without thinking to the instrument), they are function of the sounds to
which we listen to (in some music like the “musique concrète”, the sound sources
are deliberately hidden by the composer, hence the notion of timbre identity is
different, it may refer to the technique employed by the musician, e.g. a spe-
cific filter), they have variable domain of range (music lovers are often able to
recognize the performer behind the instrument extending the notion of identity
to the very start of the chain of sound production, the musician that controls
the instrument). Based on these considerations, we include the notion of sound
texture, as that produced by layers of instruments in music, into the definitions
of timbre. The notion of timbre identity in music may then be closely linked to
a specific band, a sound engineer, or the musical genre, largely related to the
instrumentation).

The Formant Theory of Timbre. Contrary to the classical theory of mu-


sical timbre advocated in the late 19th century by Helmholtz [29], timbre does
not only depend on the relative proportion between the harmonic components
of a (quasi-)harmonic sound; two straightforward experiments indeed show that
timbre is highly altered when a sound a) is reversed in time, b) is pitch-shifted
by frequency translation of the spectrum, despite the fact that in both cases the
relative energy ratios between harmonics are kept. The works from the phoneti-
cian Slawson proved that the timbre of voiced sounds was mostly characterized
by the invariance of their spectral envelope through pitch changes, and therefore
Speech/Music Discrimination Using Timbre Models 145

a mostly fixed formant1 structure, i.e. zones of high spectral energy (however, in
the case of large pitch changes the formant structure needs to be slightly shifted
for the timbral identity of the sounds to remain unchanged): “The popular no-
tion that a particular timbre depends upon the presence of certain overtones (if
that notion is interpreted as the “relative pitch” theory of timbre) is seen [...] to
lead not to invariance but to large differences in musical timbre with changes in
fundamental frequency. The “fixed pitch” or formant theory of timbre is seen in
those same results to give much better predictions of the minimum differences in
musical timbre with changes in fundamental frequency. The results [...] suggest
that the formant theory may have to be modified slightly. A precise determina-
tion of minimum differences in musical timbre may require a small shift of the
lower resonances, or possibly the whole spectrum envelope, when the fundamen-
tal frequency changes drastically.” [49]. The findings by Slawson have causal
and cognitive explanations. Sounds produced by the voice (spoken or sung) and
most musical instruments present a formant structure closely linked to reso-
nances generated by one or several components implicated in their production
(e.g. the vocal tract for the voice, the body for the guitar, the mouthpiece for
the trumpet). It seems therefore legitimate from the perceptual point of view to
suggest that the auditory system relies on the formant structure of the spectral
envelope to discriminate such sounds (e.g. two distinct male voices of same pitch,
loudness, and duration), as proposed by the “source” or identity mode of timbre
perception hypothesis mentioned earlier.
The timbre models used in this study to discriminate speech and music rely
on features modeling the spectral envelope (see the next section). In these timbre
models, the temporal dynamics of timbre are captured up to a certain extent by
performing signal analysis on successive frames where the signal is assumed to
be stationary, and by the use of hidden markov model (HMM), as described in
section 3.3. Temporal (e.g. attack time) and spectro-temporal parameters (e.g.
spectral flux) have also shown to be major correlates of timbre spaces but these
findings were obtained in studies which did not include speech sounds but only
musical instrument tones either produced on different instruments (e.g. [40]),
or within the same instrument (e.g. [6]). In situations where we discriminate
timbres from various sources either implicitly (e.g. in everyday life’s situations)
or explicitly (e.g. in a controlled experiment situation), it is most probable that
the auditory system uses different acoustical clues depending on the typological
differences of the considered sources. Hence, the descriptors used to account for
timbre differences between musical instruments’ tones may not be adapted for
the discrimination between speech and music sounds. If subtle timbre differences
are possible within a same instrument, large timbre differences are expected to
occur between disparate classes, such as speech, and music, and those are li-
able to be captured by spectral envelope correlates. Music generally being a
mixture of musical instrument sounds playing either synchronously in a poly-
phonic way, or solo, may exhibit complex formant structures induced by its

1
In this article, a formant is considered as being a broad band of enhanced power
present within the spectral envelope.
146 M. Barthet, S. Hargreaves, and M. Sandler

individual components (instruments), as well as the recording conditions (e.g.


room acoustics). Some composers like Schoenberg have explicitly used very sub-
tle instrumentation rules to produce melodies that were not shaped by changes
of pitches like in traditional Western art music but by changes of timbre (the lat-
ter were called Klangfarben melodie by Schoenberg, which literally means “color
melodies”). Hence, if they occur, formant structures in music are likely to be
much different from those produced by the vocal system. However, some intrin-
sic cases of confusions consist in music containing a predominant singing voice
(e.g. in opera or choral music) since singing voice shares timbral properties with
the spoken voice. The podcast database with which we tested the algorithms
included such types of mixture. Conversely, the mix of a voice with a strong
musical background (e.g. in commercials, or jingles) can also be a source of con-
fusion in speech/music discrimination. This issue is addressed in [44], but not
directly in this study. Sundberg [50] showed the existence of a singers or singing
formant around 3 kHz when analyzing performances by classically trained male
singers, which he attributed to a clustering of the third, fourth and fifth reso-
nances of the vocal tract. This difference between spoken and sung voices can
potentially be captured by features charactering the spectral envelope, as the
ones presented in next section.

Spectral Envelope Representations: LP, LSF, and MFCC. Spectral


envelopes can be obtained either from linear prediction (LP) [31] or from mel-
frequency cepstral coefficients (MFCC) [17] which both offer a good representa-
tion of the spectrum while keeping a small amount of features. Linear prediction
is based on the source-filter model of sound production developed for speech
coding and synthesis. Synthesis based on linear prediction coding is performed
by processing an excitation signal (e.g. modeling the glottal excitation in the
case of voice production) with an all-pole filter (e.g. modeling the resonances
of the vocal tract in the case of voice production). The coefficients of the filter
are computed on a frame-by-frame basis from the autocorrelation of the signal.
The frequency response of the LP filter hence represents the short-time spectral
envelope. Itakura derived from the coefficients of the inverse LP filter a set of
features, the line spectral frequencies (LSF), suitable for efficient speech coding
[31]. The LSF have the interesting property of being correlated in a pairwise
manner to the formant frequencies: two adjacent LSF localize a zone of high
energy in the spectrum. The automatic timbre recognition model described in
section 3.2 exploits this property of the LSF. MFCCs are computed from the
logarithm of the spectrum computed on a Mel-scale (a perceptual frequency
scale emphasizing low frequencies), either by taking the inverse Fourier trans-
form, or a discrete cosine transform. A spectral envelope can be represented by
considering the first 20 to 30 MFCC coefficients. In [51], Terasawa et al. have es-
tablished that MFCC parameters are a good perceptual representation of timbre
for static sounds. The automatic structural segmentation technique, component
of the second classification method described in section 3.3, was employed using
MFCC features.
Speech/Music Discrimination Using Timbre Models 147

3.2 Classification Based on Automatic Timbre Recognition (ATR)

The method is based on the timbre recognition system proposed in [7] and [16],
which we describe in the remainder of this section.

Feature Extraction. The algorithm relies on a frequency domain represen-


tation of the signal using short-term spectra (see Figure 2). The signal is first
decomposed into overlapping frames of equal size obtained by multiplying blocks
of audio data with a Hamming window to further minimize spectral distortion.
The fast Fourier transform (FFT) is then computed on a frame-by-frame basis.
The LSF features described above are extracted using the short-term spectra.

Classifier. The classification process is based on the unsupervised K-means


clustering technique both at the training and the testing stages. The principle
of K-means clustering is to partition n-dimensional space (here the LSF feature
space) into K distinct regions (or clusters), which are characterized by their
centres (called codevectors). The collection of the K codevectors (LSF vectors)
constitutes a codebook, whose function, within this context, is to capture the
most relevant features to characterize the timbre of an audio signal segment.
Hence, to a certain extent, the K-means clustering can here be viewed both as a
classifier and a technique of feature selection in time. The clustering of the fea-
ture space is performed according to the Linde-Buzo-Gray (LBG) algorithm [36].

Fig. 2. Automatic timbre recognition system based on line spectral frequencies and
K-means clustering
148 M. Barthet, S. Hargreaves, and M. Sandler

During the training stage, each class is attributed an optimized codebook by


performing the K-means clustering on all the associated training data. During
the testing stage, the K-means clustering is applied to blocks of audio data or
decision horizons (collection of overlapping frames), the duration of which can
be varied to modify the latency L of the classification (see Figure 2). The inter-
mediate classification decision is obtained by finding the class which minimizes a
codebook-to-codebook distortion measure based on the Euclidean distance [16].
As will be discussed in section 4.1, we tested various speech and music training
class taxonomies (e.g. separating male and female voice for the speech class) to
further enhance the performance of the recognition.

Post-processing. Given that one of our ultimate goals is to be able to accu-


rately locate the temporal start and end positions of speech and music sections,
relatively short duration of decision horizons are required (a 1 s latency was
used in the experiments). A drawback with this method though is that even if
the LSF/K-means based algorithm achieves high levels of class recognition ac-
curacy (for example, it might correctly classify music sections 90% of the time,
see section 5), there can be undesirable switches from one type of retrieved class
to another. This sometimes rapid variation between speech and music classifica-
tions makes it difficult to accurately identify the start and end points of speech
and music sections. Choosing longer classification intervals though decreases the
resolution with which we are able to pinpoint any potential start or end time.
In an attempt to alleviate this problem, we performed some post-processing on
the initial results obtained with the LSF/K-means based algorithm. All “music”
sections are attributed a numerical class index of 0, and all “speech” sections a
class index of 1. The results are resampled at 0.1 s intervals and then processed
through a median filter. Median filtering is a nonlinear digital filtering technique
which has been widely used in digital image processing and speech/music in-
formation retrieval to remove noise, e.g. in the peak-picking stage of an onset
detector in [8], or for the same purposes as in this work, to enhance speech/music
discrimination in[32], and [44]. Median filtering has the effect of smoothing out
regions of high variation. The size W of the sliding window used in the median
filtering process was empirically tuned (see section 5). Contiguous short-term
classifications of same types (speech or music) are then merged together to form
segment-level classifications. Figure 3 shows a comparison between podcasts’
ground truth annotations and typical results of classification before and after
post-processing.

Software Implementation. The intermediate classification decisions were ob-


tained with the Vamp [13] musical instrument recognition plugin [7] trained for
music and speech classes. The plugin works interactively with the Sonic Visu-
aliser host application developed to analyse and visualise music-related informa-
tion from audio files [12]. The latency L of the classification (duration of the
decision horizon) can be varied between 0.5 s and 10 s. In this study, we used a
1 s long latency in order to keep a good time resolution/performance ratio [7].
The median filtering post-processing was performed in Matlab. An example of
Speech/Music Discrimination Using Timbre Models 149

Fig. 3. Podcast ground truth annotations (a), classification results at 1 s intervals (b)
and post-processed results (c)

detection of a transition from a speech to a music part within Sonic Visualiser


is shown in Figure 4.

3.3 Classification Based on Automatic Structural Segmentation and


Timbre Recognition (ASS/ATR)

Automatic Structural Segmentation. We used the structural segmentation


technique based on constrained clustering initially proposed in [35] for automatic
music structure extraction (chorus, verse, etc.). The technique is thoroughly
described in [21], study in which it is applied to studio recordings’ intelligent
editing.
The technique relies on the assumption that the distributions of timbre fea-
tures are similar over music structural elements of same type. The high-level
song structure is hence determined upon structural/timbral similarity. In this
study we extend the application of the technique to audio broadcast content
(speech and music parts) without focusing on the structural fluctuations within
the parts themselves. The legitimacy to port the technique to speech/music
150 M. Barthet, S. Hargreaves, and M. Sandler

Fig. 4. Example of detection of a transition between speech and music sections in a


podcast using the Vamp timbre recognition transform jointly with Sonic Visualiser

discrimination relies on the fact that a higher level of similarity is expected be-
tween the various spoken parts one one hand, and between the various music
parts, on the other hand.
The algorithm, implemented as a Vamp plugin [43], is based on a frequency-
domain representation of the audio signal using either a constant-Q transform,
a chromagram or mel-frequency cepstral coefficients (MFCC). For the reasons
mentioned earlier in section 3.1, we chose the MFCCs as underlying features in
this study. The extracted features are normalised in accordance with the MPEG-
7 standard (normalized audio spectrum envelope [NASE] descriptor [34]), by ex-
pressing the spectrum in the decibel scale and normalizing each spectral vector
by the root mean square (RMS) energy envelope. This stage is followed by the
extraction of 20 principal components per block of audio data using principal
component analysis. The 20 PCA components and the RMS envelope consti-
tute a sequence of 21 dimensional feature vectors. A 40-state hidden markov
model (HMM) is then trained on the whole sequence of features (Baum-Welsh
algorithm), each state of the HMM being associated to a specific timbre quality.
After training and decoding (Viterbi algorithm) the HMM, the signal is assigned
a sequence of timbre features according to specific timbre quality distributions
for each possible structural segment. The minimal duration D of expected struc-
tural segments can be tuned. The segmentation is then computed by clustering
timbre quality histograms. A series of histograms are created using a sliding
window and are then grouped into S clusters with an adapted soft K-means
Speech/Music Discrimination Using Timbre Models 151

algorithm. Each of these clusters will correspond to a specific type of segment in


the analyzed signal. The reference histograms describing the timbre distribution
for each segment are updated during clustering in an iterative way. The final
segmentation is obtained from the final cluster assignments.

Automatic Timbre Recognition. Once the signal has been divided into seg-
ments assumed to be homogeneous in timbre, the latter are processed with the
automatic timbre recognition technique described in section 3.2 (see Figure 1(b)).
This yields intermediate classification decisions defined on a short-term basis
(depending on the latency L used in the ATR model).

Post-processing. Segment-level classifications are then obtained by choosing


the class that appears most frequently amongst the short-term classification
decisions made within the segments.

Software Implementation. The automatic structural segmenter Vamp plugin


[43] [35] was run from the terminal using the batch tool for feature extraction
Sonic Annotator [11]. Each of the retrieved segments were then processed with
the automatic timbre recognition Vamp plugin [7] previously trained for speech
and music classes using a Python script. The segment-level classification deci-
sions were also computed using a Python script.

4 Experiments
Several experiments were conducted to evaluate and compare the performances
of the speech/music discrimination ATR and ASS/ATR methods respectively
presented in sections 3.2 and 3.3. In this section, we first describe the experi-
mental protocols, and the training and testing databases. The evaluation mea-
sures computed to assess the class identification and boundary accuracy of the
systems are then specified.

4.1 Protocols
Influence of the Training Class Taxonomy. In a first set of experiments,
we evaluated the precision of the ATR model according to the taxonomy used to
represent speech and music content in the training data. The classes associated
to the two taxonomic levels schematized in Figure 5 were tested to train the
ATR model. The first level correspond to a coarse division of the audio content
into two classes: speech and music. Given that common spectral differences may
be observed between male and female speech signals due to vocal tract morphol-
ogy changes, and that musical genres are often associated with different sound
textures or timbres due to changes of instrumentation, we sought to establish
whether there was any benefit to be gained by training the LSF/K-means algo-
rithm on a wider, more specific set of classes. Five classes were chosen: two to
represent speech (male speech and female speech), and three to represent music
according to the genre (classical, jazz, and rock & pop). The classifications ob-
tained using the algorithm trained on the second, wider, set of classes are later
152 M. Barthet, S. Hargreaves, and M. Sandler

Level I speech music

Level II speech m speech f classical jazz rock & pop

Fig. 5. Taxonomy used to train the automatic timbre recognition model in the
speech/music discrimination task. The first taxonomic level is associated to a training
stage with two classes: speech and music. The second taxonomic level is associated to
a training stage with five classes: male speech (speech m), female speech (speech f),
classical, jazz, and rock & pop music.

mapped back down to either speech or music in order to be able to evaluate


their correlation with ground truth data and also so that we can compare the
two methods. To make a fair comparison between the two methods, we kept
the same training excerpts in both cases and hence kept constant the content
duration for the speech and music classes.

Comparison Between ATR and ASS/ATR. A second set of experiments


was performed to compare the performances of the ATR and ASS/ATR meth-
ods. In these experiments the automatic timbre recognition model was trained
with five classes (second taxonomic level), case which lead to the best perfor-
mances (see section 5.1). The number of clusters used in the K-means classifier
of the ATR method was kept constant and tuned to a value that yielded good
results in a musical instrument recognition task (K=32) [16]. In order to find
the best configuration, the number of line spectral frequencies was varied in the
feature extraction stage (LSF={8;16;24;32}) since the number of formants in
speech and music spectra is not known a priori and is not expected to be the
same. If voice is typically associated with four or five formants (hence 8 or 10
LSFs), this number may be higher in music due to the superpositions of various
instruments’ sounds. The parameter of the automatic structural segmentation
algorithm setting the minimal duration of retrieved segments was set to 2 s
since shorter events are not expected, and longer durations could decrease the
boundary retrieval accuracy. Since the content of audio podcasts can be broad
(see next section 4.2), the maximal number of segments S of the ASS was varied
between 7 and 12. Classification tests were also performed with the algorithm
proposed by Ramona and Richard in [44] which provides a good benchmark
of the state-of-the-art performance for speech/music discriminators. This algo-
rithm which relies on a feature-based approach with a support vector machine
classifier (previously described in section 2) is however computationnaly expen-
sive since a large collection of about 600 features of various types (temporal,
spectral, cepstral, and perceptual) is computed in the training stage.
Speech/Music Discrimination Using Timbre Models 153

4.2 Database
The training data used in the automatic timbre recognition system consisted of
a number of audio clips extracted from a wide variety of radio podcasts from
BBC 6 Music (mostly pop) and BBC Radio 3 (mostly classical and jazz) emis-
sions. The clips were manually auditioned and then, classified as either speech
or music when the ATR model was trained with two classes, or as male speech,
female speech, classical music, jazz music, and rock & pop music when the ATR
model was trained with five classes. These manual classifications constituted the
ground truth annotations further used in the algorithm evaluations. All speech
was english language, and the training audio clips, whose durations are shown
in Table 1, gathered approximately 30 min. of speech, and 15 min. of music.
For testing purposes, four podcasts different from the ones used for training
(hence containing different speakers and music excerpts) were manually anno-
tated using terms from the following vocabulary: speech, multi-voice speech,
music, silence, jingle, efx (effects), tone, tones, beats. Mixtures of these terms
were also employed (e.g. “speech + music”, to represent speech with background
music). The music class included cases where a singing voice was predominant
(opera and choral music). More detailed descriptions of the podcast material
used for testing are given in Tables 2 and 3.

4.3 Evaluation Measures


We evaluated the speech/music discrimination methods with regards to two
aspects: (i) their ability to correctly identify the considered classes (semantic
level), and (ii) their ability to correctly retrieve the boundary locations between
classes (temporal level).

Relative Correct Overlap. Several evaluation measures have been proposed


to assess the performances of audio content classifiers depending on the time scale
considered to perform the comparison between the machine estimated classifi-
cations and the ground-truth annotations used as reference [28]. The accuracy
of the models can indeed be measured on a frame-level basis by resampling the
ground-truth annotations at the frequency used to make the estimations, or on

Table 1. Audio training data durations. Durations are expressed in the following
format: HH:MM:SS (hours:mn:s).

Training class Total duration of audio clips (HH:MM:SS)


Speech 00:27:46
Two class training
Music 00:14:30
Male speech 00:19:48
Female speech 00:07:58
Five class training Total speech 00:27:46
Classical 00:03:50
Jazz 00:07:00
Rock & Pop 00:03:40
Total music 00:14:30
154 M. Barthet, S. Hargreaves, and M. Sandler

Table 2. Podcast content

Podcast Nature of content


1 Male speech, rock & pop songs, jingles, and small amount of
electronic music
2 Speech and classical music (orchestral and opera)
3 Speech, classical music (choral, solo piano, and solo organ), and
folk music
4 Speech, and punk, rock & pop with jingles

Table 3. Audio testing data durations. Durations are expressed in the following format:
HH:MM:SS (hours:mn:s).

Podcast Total duration Speech duration Music duration


1 00:51:43 00:38:52 00:07:47
2 00:53:32 00:32:02 00:18:46
3 00:45:00 00:18:09 00:05:40
4 00:47:08 00:06:50 00:32:31
Total 03:17:23 01:35:53 01:04:43

a segment-level basis by considering the relative proportion of correctly identi-


fied segments. We applied the latter segment-based method by computing the
relative correct overlap (RCO) measure used to evaluate algorithms in the mu-
sic information retrieval evaluation exchange (MIREX) competition [39]. The
relative correct overlap is defined as the cumulated duration of segments where
the correct class has been identified normalized by the total duration of the
annotated segments:

|{estimated segments} ∩ {annotated segments}|


RCO = (1)
|{annotated segments}|

where {.} denotes a set of segments, and —.— their duration. When comparing
the machine estimated segments with the manually annotated ones, any sections
not labelled as speech (male or female), multi-voice speech, or music (classical,
jazz, rock & pop) were disregarded due to their ambiguity (e.g. jingle). The
durations of these disregarded parts are stated in the results section.

Boundary Retrieval F-measure. In order to assess the precision with which


the algorithms are able to detect the time location of transitions from one class to
another (i.e. start/end of speech and music sections), we computed the boundary
retrieval F-measure proposed in [35] and used in MIREX to evaluate the tempo-
ral accuracy of automatic structural segmentation methods [41]. The boundary
retrieval F-measure, denoted F in the following, is defined as the harmonic mean
2P R
between the boundary retrieval precision P and recall R (F = ). The
P +R
boundary retrieval precision and recall are obtained by counting the numbers of
Speech/Music Discrimination Using Timbre Models 155

correctly detected boundaries (true positives tp), false detections (false positives
f p), and missed detections (false negatives f n) as follows:
tp
P = (2)
tp + f p
tp
R= (3)
tp + f n
Hence, the precision and the recall can be viewed as measures of exactness
and completeness, respectively. As in [35] and [41], the number of true positives
were determined using a tolerance window of duration ΔT = 3 s: a retrieved
boundary is considered to be a “hit” (correct) if its time position l lies within
ΔT ΔT
the range l − ≤ l ≤ l+ . This method to compute the F-measure is also
2 2
used in onset detector evaluation [18] (the tolerance window in the latter case
being much shorter). Before comparing the manually and the machine estimated
boundaries, a post-processing was performed on the ground-truth annotations in
order to remove the internal boundaries between two or more successive segments
whose type was discarded in the classification process (e.g. the boundary between
a jingle and a sound effect section).

5 Results and Discussion

In this section, we present and discuss the results obtained for the two sets
of experiments described in section 4.1. In both sets of experiments, all audio
training clips were extracted from 128 kbps, 44.1 kHz, 16 bit stereo mp3 files
(mixed down to mono) and the podcasts used in the testing stage were full
duration mp3 files of the same format.

5.1 Influence of the Training Class Taxonomies in the ATR Model

Analysis Parameters. The LSF/K-means algorithm in the automatic timbre


recognition model was performed with a window of length 1024 samples (ap-
proximately 20 ms), and hop length of 256 samples (approximately 5 ms). A
combination of 24 line spectral frequencies and 32 codevectors was used as in
[7]. During testing, the intermediate classifications were made with a latency of
1 s. The post-processing of the machine estimated annotations was performed
by resampling the data with a sampling period of 0.1 s, and processing them
with a median filter using a 20 s long window.

Performances. Table 4 shows the relative correct overlap (RCO) performances


of the speech/music discriminator based on automatic timbre recognition for
each of the four podcasts used in the test set, as well as the overall results
(podcasts 1 to 4 combined). The sections that were neither speech nor music
and were disregarded lasted in overall 36 min. 50 s. The RCO measures are
given both when the model was trained on only two classes (music and speech),
and when it was trained on five classes (male speech, female speech, classical,
156 M. Barthet, S. Hargreaves, and M. Sandler

Table 4. Influence of the training class taxonomy on the performances of the automatic
timbre recognition model assessed at the semantic level with the relative correct overlap
(RCO) measure

Speech Music
ATR model - RCO measure (%)
Two class Five class Two class Five class
1 (Rock & pop) 90.5 91.9 93.7 94.5
2 (Classical) 91.8 93.0 97.8 99.4
3 (Classical) 88.3 91.0 76.1 82.7
4 (Rock & pop) 48.7 63.6 99.8 99.9
Overall 85.2 89.2 96.6 97.8

jazz, and rock & pop). We see from that table that training the ATR model on
five classes instead of two improved classification performances in all cases, but
most notably for the speech classifications of podcast number 4 (an increase of
14.9% from 48.7% to 63.6%) and for the music classifications of podcasts number
3 (up from 76.1% to 82.7%, an increase of 6.6%). In all other cases, the increase
is more modest; being between 0.1% and 2.7%. The combined results show an
increased RCO of 4% for speech and 1.2% for music when trained on five classes
instead of two.

5.2 Comparison between ATR and ASS/ATR

Analysis Parameters. The automatic timbre model was trained with five
classes since this configuration gave the best RCO performances. Regarding the
ATR method, the short-term analysis was performed with a window of 1024
samples, a hop size of 256 samples, and K = 32 codevectors, as in the first set
of experiments. However, in this set of experiments the number of line spectral
frequencies LSF were varied between 8 and 32 by steps of 8, and the duration of
the median filtering windows were tuned accordingly based on experimenting.
The automatic structural segmenter Vamp plugin was used with the default
window and hop sizes (26460 samples, i.e. 0.6 s, and 8820 samples, i.e. 0.2 s,
respectively), parameters defined based on typical beat-length in music [35].
Five different number of segments S were tested (S = {5;7;8;10;12}). The best
relative correct overlap and boundary retrieval performances were obtained with
S = 8 and S = 7, respectively.

Relative Correct Overlap Performances. Table 5 presents a comparison of


the relative correct overlap (RCO) results obtained for the proposed speech/music
discriminators based on automatic timbre recognition ATR, and on automatic
structural segmentation and timbre recognition ASS/ATR. The performances
obtained with the algorithm from Ramona et Richard [44] are also reported.
The ATR and ASS/ATR methods obtain very similar relative correct overlaps.
For both methods, the best configuration is obtained with the lowest number
of features (LSF = 8) yielding high average RCOs, 94.4% for ATR, and 94.5%
for ASS/ATR. The algorithm from [44] obtain a slightly higher average RCO
(increase of approximately 3%) but may require more computations than the
Speech/Music Discrimination Using Timbre Models 157

Table 5. Comparison of the relative correct overlap performances for the ATR and
ASS/ATR methods, as well as the SVM-based algorithm from [44]. For each method,
the best average result (combining speech and music) is indicated in bold

RCO (%) ATR ASS/ATR SVM [44]


LSF number LSF number n/a
Podcast Class 8 16 24 32 8 16 24 32 n/a
speech 94.8 94.8 94.7 94.3 96.9 95.8 96.9 96.9 97.5
1
music 94.9 92.4 90.8 92.8 84.3 82.5 82.3 86.3 94.1
speech 94.2 95.4 92.9 92.8 96.3 96.3 96.3 96.1 97.6
2
music 98.8 98.7 98.8 98.1 97.1 94.2 96.5 96.9 99.9
speech 96.7 96.9 93.5 92.0 96.4 95.3 93.6 93.5 97.2
3
music 96.1 79.0 76.8 77.4 92.3 85.8 77.5 83.5 96.9
speech 55.3 51.9 56.4 58.9 61.8 48.5 60.2 65.6 88.6
4
music 99.5 99.5 99.9 99.5 99.7 100 100 100 99.5
speech 90.3 90.0 89.5 89.5 92.8 89.4 92.0 92.8 96.8
Overall
music 98.5 96.1 96.1 96.3 96.2 94.4 94.3 95.8 98.8
Average 94.4 93.1 92.8 92.9 94.5 91.9 93.2 94.3 97.3

ATR method (the computation time has not been measured in these experi-
ments). The lower performances obtained by the three compared methods for
the speech class of the fourth podcast is to be nuanced by the very short pro-
portion of spoken excerpts within this podcast (see Table 3), which hence does
not affect much the overall results. The good performances obtained with a low
dimensional LSF vector can be explained by the fact that the voice has a limited
number of formants that are therefore well characterized by a small number of
line spectral frequencies (LSF = 8 corresponds to the characterization of 4 for-
mants). Improving the recognition accuracy for the speech class diminishes the
confusions made with the music class, which explains the concurrent increase of
RCO for the music class when LSF = 8. When considering the class identifi-
cation accuracy, the ATR method conducted with a low number of LSF hence
appears interesting since it is not computationally expensive relatively to the
performances of modern CPUs (linear predictive filter determination, computa-
tion of 8 LSFs, K-means clustering and distance computation). For the feature
vectors of higher dimensions, the higher-order LSFs may contain information
associated with the noise in the case of the voice which would explain the drop
of overall performances obtained with LSF = 16 and LSF = 24. However the
RCOs obtained when LSF = 32 are very close to that obtained when LSF =
8. In this case, the higher number of LSF may be adapted to capture the more
complex formant structures of music.

Boundary Retrieval Performances. The boundary retrieval performance


measures (F-measure, precision P, and recall R) obtained for the ATR, ASS/ATR,
and SVM-based method from [44] are reported in Table 6.
As opposed to the relative correct overlap evaluation where the ATR and
ASS/ATR methods obtained similar performances, the ASS/ATR method clearly
158 M. Barthet, S. Hargreaves, and M. Sandler

Table 6. Comparison of the boundary retrieval measures (F-measure, precision P, and


recall R) for the ATR and ASS/ATR methods, as well as the SVM-based algorithm
from [44]. For each method, the best overall result is indicated in bold.

Boundary retrieval ATR ASS/ATR SVM [44]


LSF number LSF number n/a
Podcast Measures (%) 8 16 24 8 16 24 32 n/a
P 40.0 45.7 31.0 43.6 36.0 37.2 34.1 36.0
1 R 21.3 34.0 19.1 36.2 38.3 34.0 31.9 57.4
F 27.8 39.0 23.7 39.5 37.1 35.6 33.0 44.3
P 61.5 69.0 74.1 72.7 35.3 84.6 71.9 58.2
2 R 37.5 31.3 31.3 37.5 37.5 34.4 35.9 60.9
F 46.6 43.0 44.0 49.5 36.4 48.9 47.9 59.5
P 69.2 54.5 56.7 75.0 68.0 60.4 64.0 67.3
3 R 24.3 32.4 23.0 44.6 45.9 43.2 43.2 50.0
F 36.0 40.7 32.7 55.9 54.8 50.4 51.6 57.4
P 11.7 12.3 21.7 56.7 57.1 57.7 48.5 28.6
4 R 21.9 21.9 15.6 53.1 50.0 46.9 50.0 50.0
F 15.2 15.7 18.2 54.8 53.3 51.7 49.2 57.4
P 23.3 40.6 46.8 62.3 46.9 57.4 54.1 47.0
Overall R 27.2 30.9 23.5 41.9 42.4 39.2 39.6 54.8
F 32.2 35.1 31.3 50.1 44.6 46.6 45.7 50.6

outclassed the ATR method regarding the boundary retrieval accuracy. The best
overall F-measure of the ASS/ATR method (50.1% with LSF = 8) is approxi-
mately 15% higher than the one obtained with the ATR method (35.1% for LSF
= 16). This shows the benefit of using the automatic structural segmenter prior
to the timbre recognition stage to locate the transitions between the speech and
music sections. As in the previous set of experiments, the best configuration is
obtained with a small amount of LSF features (ASS/ATR method with LSF =
8) which stems from the fact the boundary positions are a consequence of the
classification decisions. For all the tested podcasts, the ASS/ATR method yields
a better precision than the SVM-based algorithm. The most notable difference
happens for the second podcast where the precision of the ASS/ATR method
(72.7%) is approximately 14% higher than the one obtained with the SVM-based
algorithm (58.2%). The resulting increase in overall precision achieved with the
ASS/ATR method (62.3%) compared with the SVM-based method (47.0%) is
of approximately 15%. The SVM-based method however obtains a better overall
boundary recall measure (54.8%) than the ASS/ATR method (42.4%), inducing
the boundary F-measures of both methods to be very close (50.6% and 50.1%,
respectively).

6 Summary and Conclusions


We proposed two methods for speech/music discrimination based on timbre
models and machine learning techniques and compared their performances with
Speech/Music Discrimination Using Timbre Models 159

audio podcasts. The first method (ATR) relies on automatic timbre recognition
(LSF/K-means) and median filtering. The second method (ASS/ATR) performs
an automatic structural segmentation (MFCC, RMS / HMM, K-means) before
applying the timbre recognition system. The algorithms were tested with more
than 2,5 hours of speech and music content extracted from popular and classical
music podcasts from the BBC. Some of the music tracks contained a predomi-
nant singing voice which can be a source of confusion with the spoken voice. The
algorithms were evaluated both at the semantic level to measure the quality of
the retrieved segment-type labels (classification relative correct overlap), and at
the temporal level to measure the accuracy of the retrieved boundaries between
sections (boundary retrieval F-measure). Both methods obtained similar and rel-
atively high segment-type labeling performances. The ASS/ATR method lead to
a RCO of 92.8% for speech, and 96.2% for music, yielding an average performance
of 94.5%. The boundary retrieval performances were higher for the ASS/ATR
method (F-measure = 50.1%) showing the benefit to use a structural segmen-
tation technique to locate transitions between different timbral qualities. The
results were compared against the SVM-based algorithm proposed in [44] which
provides a good benchmark of the state-of-the-art’s speech/music discrimina-
tors. The performances obtained by the ASS/ATR method were approximately
3% lower than those obtained with the SVM-based method for the segment-type
labeling evaluation, but lead to better boundary retrieval precisions (approxi-
mately 15% higher).
The boundary retrieval scores were clearly lower for the three compared meth-
ods, relatively to the segment-type labeling performances which were fairly high,
up to 100% of correct identifications in some cases. Future works will be dedi-
cated to refine the accuracy of the sections’ boundaries either by performing a
new analysis of the feature variations locally around the retrieved boundaries,
or by including descriptors complementary to the timbre ones, by using e.g. the
rhythmic information such as tempo whose fluctuations around speech/music
transitions may give complementary clues to accurately detect them. The dis-
crimination of intricate mixtures of music, speech, and sometimes strong post-
production sound effects (e.g. the case of jingles) will also be investigated.

Acknowledgments. This work was partly funded by the Musicology for the
Masses (M4M) project (EPSRC grant EP/I001832/1, http://www.elec.qmul.
ac.uk/digitalmusic/m4m/), the Online Music Recognition and Searching 2
(OMRAS2) project (EPSRC grant EP/E017614/1, http://www.omras2.org/),
and a studentship (EPSRC grant EP/505054/1). The authors wish to thank
Matthew Davies from the Centre for Digital Music for sharing his F-measure
computation Matlab toolbox, as well as György Fazekas for fruitful discussions
on the structural segmenter. Many thanks to Mathieu Ramona from the Institut
de Recherche et Coordination Acoustique Musique (IRCAM) for sending us the
results obtained with his speech/music segmentation algorithm.
160 M. Barthet, S. Hargreaves, and M. Sandler

References
1. Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Seg-
mentation. In: Proc. ICASSP 2002, vol. 1, pp. 297–300 (2002)
2. Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of
Fisher Linear Discriminant Analysis to Speech Music Classification. In: Proc.
EUROCON 2005, vol. 2, pp. 1666–1669 (2005)
3. ANSI: USA Standard Acoustical Terminology. American National Standards In-
stitute, New York (1960)
4. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates
of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2),
135–153 (2010)
5. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis
of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Per-
ception 28(3), 265–278 (2011)
6. Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Con-
trol to Timbre Perception. Acta Acustica United with Acustica 96(4), 678–689
(2010)
7. Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recog-
nition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling
and Retrieval (CMMR 2010), Malaga, Spain, pp. 183–194 (2010)
8. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A
Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and
Audio Processing (2005)
9. Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal
of the Audio Engineering Society 52(7/8), 724–739 (2004)
10. Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre
Space Dimensions: A Confirmatory Study Using Synthetic Tones. J. Acoust. Soc.
Am. 118(1), 471–482 (2005)
11. Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.
org/SonicAnnotator
12. Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.
sonicvisualiser.org/
13. Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin
System, http://www.vamp-plugins.org/
14. Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech,
Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, pp. 149–152 (1999)
15. Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriété du Signal, de
l’Instrument, ou Construction Cognitive (Timbre or Timbres? Property of the
Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on
Interdisciplinary Musicology (CIM 2005), Montréal, Québec, Canada (2005)
16. Chétry, N., Davies, M., Sandler, M.: Musical Instrument Identification using LSF
and K-Means. In: Proc. AES 118th Convention (2005)
17. Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing.
Proc. of the IEEE 65, 1428–1443 (1977)
18. Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Au-
dio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary
University of London, Centre for Digital Music (2009), http://www.eecs.qmul.
ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf
Speech/Music Discrimination Using Timbre Models 161

19. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Mono-
syllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366 (1980)
20. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for
Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 2445–2448 (2000)
21. Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help
of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention,
Vienna, Austria (2007)
22. Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.:
The ESTER Phase II Evaluation Campaign for the Rich Transcription of French
Broadcast News. In: Proc. Interspeech (2005)
23. Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for
Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187–200
(2001)
24. Grey, J.M., Gordon, J.W.: Perception of Spectral Modifications on Orchestral In-
strument Tones. Computer Music Journal 11(1), 24–31 (1978)
25. Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation
and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the
DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137
(1998)
26. Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodologi-
cal Issues in Timbre Research. In: Deliége, I., Sloboda, J. (eds.) Perception and
Cognition of Music, 2nd edn., pp. 253–306. Psychology Press, New York (1997)
27. Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identification,
2nd edn., pp. 425–461. Academic Press, San Diego (1995)
28. Harte, C.: Towards Automatic Extraction of Harmony Information From Music
Signals. Ph.D. thesis, Queen Mary University of London (2010)
29. Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the
works of 1877). English trad. with notes and appendix from E.J. Ellis
30. Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room
Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 66–73 (1973)
31. Itakura, F.: Line Spectrum Representation of Linear Predictive Coefficients of
Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)
32. Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm Detection For Speech-
Music Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th
International Conference on Digital Signal Processing (DSP), Santorini (2002)
33. Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc.
IEEE 74, 1477–1493 (1986)
34. Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG-
7 Descriptors. In: Proc. of EUROSPEECH (2003)
35. Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained
Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318–326
(2008)
36. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE
Transactions on Communications 28, 702–710 (1980)
37. Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classification and Segmentation
Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203–211
(2001)
38. Marozeau, J., de Cheveigné, A., McAdams, S., Winsberg, S.: The Dependency
of Timbre on Fundamental Frequency. Journal of the Acoustical Society of
America 114(5), 2946–2957 (2003)
162 M. Barthet, S. Hargreaves, and M. Sandler

39. Mauch, M.: Automatic Chord Transcription from Audio using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)
40. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimphoff, J.: Perceptual
Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and
Latent Subject Classes. Psychological Research 58, 177–192 (1995)
41. Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation
(2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation
42. Peeters, G.: Automatic Classification of Large Musical Instrument Databases Us-
ing Hierarchical Classifiers with Inertia Ratio Maximization. In: Proc. AES 115th
Convention, New York (2003)
43. Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/
SonicAnnotator
44. Ramona, M., Richard, G.: Comparison of Different Strategies for a SVM-Based
Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference
(EUSIPCO 2009), pp. 20–24 (2009)
45. Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In:
Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)
46. Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc.
ICASSP 1996, vol. 2, pp. 993–996 (1996)
47. Schaeffer, P.: Traité des Objets Musicaux (Treaty of Musical Objects). Éditions
du seuil (1966)
48. Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature
Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 1331–1334 (1997)
49. Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum En-
velope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)
50. Sundberg, J.: Articulatory Interpretation of the ‘Singing Formant’. J. Acoust. Soc.
Am. 55, 838–844 (1974)
51. Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception.
In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition
(SAPA 2006), pp. 18–23 (2006)
52. Gil de Zúñiga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimag-
ining Pathways to Political Participation. Journal of Information Technology &
Politics 7(1), 36–51 (2010)
Computer Music Cloud

Jesús L. Alvaro1 and Beatriz Barros2


1
Computer Music Lab,
Madrid, Spain
[email protected]
http://cml.fauno.org
2
Departamento de Lenguajes y Ciencias de la Computación
Universidad de Málaga, Spain
[email protected]

Abstract. The present paper puts forward a proposal for computer


music (CM) composition system on the Web. Setting off from the CM
composition paradigm used so far and on the basis of the current com-
puter technology shift into cloud computing, a new paradigm is open for
the CM composition domain. An experience of computer music cloud
(CMC) is described: the whole music system is split into several web ser-
vices sharing an unique music representation. MusicJSON is proposed
as the interchangeable music data format based on the solid and flexible
EvMusic representation. A web browser-based graphic environment is
developed as the user interface for the Computer Music Cloud as music
web applications.

Keywords: Music Representation, Cloud Computing, Computer Music,


Knowledge Representation, Music Composition, UML, Distributed Data,
Distributed Computing, Creativity, AI.

1 Computer Aided Composition

Computers offer composers multiple advantages, from score notation to sound


synthesis, algorithmic composition and music artificial intelligence (MAI) exper-
imentation. Fig. 1 shows the basic structure of a generalized CM composition
system. In this figure, music composition is intentionally divided into two func-
tional processes: a composer-level computation and a performance-level compo-
sition [8]. Music computation systems, like algorithmic music programs, are used
to produce music materials in an intermediate format, usually a standard midi
file (SMF). These composition materials are combined and post-produced for
performance by means of a music application which finally produces a score or
sound render of the composition. In some composition systems the intermediate
format is not so evident, because the same application carries out both func-
tions, but in terms of music representation, some symbols representing music
entities are used for computation. This internal music representation determines
the creative capabilities of the system.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 163–175, 2011.

c Springer-Verlag Berlin Heidelberg 2011
164 J.L. Alvaro and B. Barros

Fig. 1. Basic Structure of a CM Composition System

There are many different approaches for a CM composition system, as well


as multiple languages and music representations. As shown in Fig. 3, our CM
system has substantially evolved during the last 12 years [1]. Apart from musical
and creative requirements, these changes have progressively accommodated tech-
nology changes and turned the system into a distributed computing approach.
Computer-assisted music composition and platforms have evolved for 50 years.
Mainframes were used at the beginning and personal computers (PCs) arrived
in the 1980s, bringing computation to the general public. With network devel-
opment, Internet has gained more and more importance within the present-day
Information Technology (IT) and dominates the current situation, making irrele-
vant the geographical location of IT resources. This paper is aimed at presenting
a proposal for computer music (CM) composition on the web. Starting from the
CM paradigm used so far, and on the basis of the current computer technology
shift into cloud computing, a new paradigm is open for the CM domain. The
paper is organized as follows: the next section describes the concept of cloud
computing, thus locating the work within the field of ITs. Then, section 3 in-
troduces the EV representation, which is the basis of the proposed composition
paradigm, explained in section 4. Next, an example is sketched in section 5, while
section 6 presents the MusicJSON music format: the interchangeable music data
format based on the solid and flexible EvMusic representation. The paper ends
with some conclusions and ideas for future research.

2 Cloud Computing
IT continues evolving. Cloud Computing, a new term defined in various different
ways [8], involves a new paradigm in which computer infrastructure and software
are provided as a service [5]. These services themselves have been referred to as
Software as a Service (SaaS ). Google Apps is a clear example of SaaS [10].
Computation infrastructure is also offered as a service (IaaS ), thus enabling
the user to run the customer software. Several providers currently offer resizable
compute capacity as a Public Cloud, such as the Amazon Elastic Compute Cloud
(EC2) [4] and the Google AppEngine [9].
This situation offers new possibilities for both software developers and users.
For instance, this paper was written and revised in GoogleDocs [11], a Google
Computer Music Cloud 165

web service offering word processing capabilities online. The information is no


longer stored in local hard discs but in Google servers. The only software users
need is a standard web browser. Could this computing in the cloud approach
be useful for music composition? What can it offer? What type of services does
a music composition cloud consist of? What music representation should they
share? What data exchange format should be used?

3 EvMusic Representation

The first step when planning a composition system should be choosing a proper
music representation. The chosen representation will set the frontiers of the sys-
tem’s capabilities. As a result, our CM research developed a solid and versatile
representation for music composition. EvMetamodel [3] was used to model the
music knowledge representation behind EvMusic. A previous, deep analysis of
music knowledge was carried out to assure the representation meets music com-
position requirements. This multilevel representation is not only compatible with
traditional notation but also capable of representing highly abstract music ele-
ments. It can also represent symbolic pitch entities [1] from both music theory
and algorithmic composition, keeping the door open to the representation of the
music elements of higher symbolic level conceived by the composer’s creativity.
It is based on real composition experience and was designed to support CM
Composition, including experiences in musical artificial intelligence (MAI).
Current music representation is described in a platform-independent UML for-
mat [15]. Therefore, it is not confined to its original LISP system, but can be used
in any system or language: a valuable feature when approaching a cloud system.

Fig. 2. UML class diagram of the EvMetamodel


166 J.L. Alvaro and B. Barros

Fig. 2 is an UML class diagram for the representation core of the EvMetamodel,
the base representation for time dimension. The three main classes are shown:
event, parameter and dynamic object. High level music elements are represented
as subclasses of metaevent, the interface which provides the develop function-
ality. The special dynamic object changes is also shown. This is a very useful
option for the graphic edition of parameters, since it represents a dynamic object
as a sequence of parameter-change events which can be easily moved in time.

4 Composing in the Cloud: A New Paradigm


Our CM system underwent several changes since the beginning, back to 1997.
Fig. 3 shows the evolutions undergone by formats, platforms and technologies
toward the current proposal. This figure follows the same horizontal scheme
shown in Figure 1. The first column indicates the user interface for music input,
and the second shows the music computation system and its evolution over recent
years, while the music intermediate format is reported in the central column.
Post-production utilities and their final results are shown in last two columns,
respectively.
The model in Fig. 1 clearly shows the evolution undergone by the system.
First, a process of service diversification and specialisation, mainly at the post-
production stage; second, as a trend in user input, CM is demanding graphic
environments. Finally, technologies and formats undergo multiple changes. The
reason behind most of these changes can be found in external technology ad-
vances and the need to accommodate to these new situations. At times the

Fig. 3. Evolution of our CM Composition System


Computer Music Cloud 167

needed tool or library was not available at that time. At others the available
tool was suitable at that moment, but offered no long-term availability. As stated
above, the recent shift of IT into cloud computing brings new opportunities for
evolution. In CM, system development can benefit from computing distribution
and specialization. Splitting the system into several specialized services prevents
the limitations involved by a single programming language or platform. There-
fore, individual music services can be developed and evolved independently from
the others. Each component service can be implemented in the most appropriate
platform for its particular task, regardless of the rest of services, without being
conditioned by the technologies necessary for the implementation of other ser-
vices. In the previous paradigm, all services were performed by one only system,
and the selection of technologies to complete a particular task affected or even
conditioned the implementation of other tasks. This frees the system design,
thus making it more platform-independent. In addition, widely-available tools
can be used for specific tasks, thus benefitting from tool development in other
areas such as database storage and web application design.

4.1 Computer Music Cloud (CMC)


Fig. 4 shows a Computer Music Cloud (CMC) for composition as an evolution of
the scheme in Fig. 3. The system is distributed across specialized online services.
The user interface is now a web application running in a standard browser. A
storage service is used as an edition memory. An intelligent-dedicated service is
allocated for music calculation and development. Output formats such as MIDI,
graphic score and sound file are rendered by independent services exclusively
devoted to this task. The web application includes user sessions to allow multiple
users utilizing the system. Both public and user libraries are also provided for
music objects. Intermediary music elements can be stored in the library and also
serialized into a MusicJSON format file, as described below.
An advantage involved by this approach is the availability of a single service for
different CMC systems. Therefore, the design of new music systems is facilitated
by the joint work of different services controlled by a web application. The key
factor for successful integration lies in the use of a well-defined suitable music
representation for music data exchange.

4.2 Music Web Services


Under Cloud Computing usual notation, each of the services can be considered
as a Music-computing as a Service (MaaS) component. In a simple form, they
are servers receiving a request and performing a task. At the end of the task,
resulting objects are returned to the stream or stored in an exchange database.
The access to this database is a valuable feature for services, since it allows
the definition of complex services. They could even be converted into real MAI
agents (i.e., intelligent devices which can perceive their environment, make de-
cisions and act inside their environment) [17]. Storing the music composition as
a virtual environment in a database allows music services interacting within the
composition, thus opening a door toward a MAI system of distributed agents.
Music Services of the Cloud are classified according their function.
168 J.L. Alvaro and B. Barros

Fig. 4. Basic Structure of a CMC

Input. This group includes the services aimed particularly at incorporating new
music elements and translating them from other input formats.

Agents. They are those services which are capable of inspecting and modify-
ing music composition, as well as introducing new elements. This type includes
human user interfaces, but may also include other intelligent elements taking
part in composition introducing decisions, suggestions or modifications. In our
prototype, we have developed a web application acting as a user interface for
the edition of music elements. This service is described in the next section.

Storage. At this step, only music object instances and relations are stored,
but the hypothetical model also includes procedural information. Three music
storage services are implemented in the prototype. Main lib stores shared music
elements as global definitions. This content may be seen as some kind of mu-
sic culture. User-related music objects are stored in the user lib. These include
music objects defined by the composer which can be reused in several parts and
complete music sections or represent the composer’s style. The editing storage
service is provided as temporary storage for the editing session. The piece is
progressively composed in the database. The composition environment (i.e., ev-
erything related to the piece under composition) is in the database. This is the
environment in which several composing agents can integrate and interact by
reading and writing on this database-stored score. All three storage services in
this experience, were written in Python and clouded with Google AppEngine.

Development. The services in this group perform development processes. As


explained in [8], development is the process by which higher-abstraction symbolic
elements are turned into lower-abstraction elements. High-abstraction symbols
Computer Music Cloud 169

are implemented as meta-events and represent music objects such as motives,


segments and other composing abstractions [3]. In this prototype, the entire
EvMusic LISP implementation is provided as a service. Other intelligent services
in this group such as constraint solvers, genetic algorithms and others may also
be incorporated.

Output. These services produce output formats as a response to requests from


other services. They work in a two-level scheme. At the first level, they ren-
der formats for immediate composer feedback such as MIDI playback or .PNG
graphic notation from the element currently under edition. Composition prod-
ucts such as the whole score or the rendering of audio are produced at the second
level. In this experience, a MIDI service is implemented in Python language and
run in Google AppEngine. It returns a SMF for quick audible monitoring. A
LISP -written Score Service is also available. It uses FOMUS [16] and LiLypond
[14] libraries and runs in an Ubuntu Linux image[18] for the Amazon Cloud. It
produces graphic scores from the music data as .PNG and .PDF files. Included in
this output group, the MusicJSON serializer service produces a music exchange
file, as described in the next section.

5 User Interface: A Music Web Application


What technologies are behind those successful widespread web cloud applications
such as GoogleDocs? What type of exchange format do they use? JavaScript
[13] is a dialect of the standard ECMAScript [6] supported for almost all web
browsers. It is the key tool behind these web clients. In this environment, the
code for the client is downloaded from a server and then run in the web browser
as a dynamic script keeping communication with web services. Thus, the browser
window behaves as a user-interface window for the system.

5.1 EvEditor Client


The Extjs library [7] was chosen as a development framework for client side
implementation. The web application takes the shape of a desktop with windows
archetype (i.e., a well-tested approach and an intuitive interface environment
we would like to benefit from, but within a web-browser window). The main
objective of its implementation was not only producing a suitable music editor
for a music cloud example but also a whole framework for the development
of general-purpose object editors under an OOP approach. Once this aim was
reached, different editors could be subclassed with the customizations required
from the object type. Extjs is a powerful JavaScript library including a vast
collection of components for user interface creation. Elements are defined in a
hierarchic class system, which offers a great solution for our OOP approach.
That is, all EvEditor components are subclassed from ExtJS classes. As shown
in Fig. 5, EvEditor client is a web application consisting of three layers. The
bottom or data layer contains an editor proxy for data under current edition. It
duplicates the selected records in the remote storage service. The database in the
170 J.L. Alvaro and B. Barros

DOM
COMPONENT LAYER
PROXY & DATA

Fig. 5. Structure of EvEditor Client

remote storage service is synched with the editions an updates the editor writes
in its proxy. Several editors can share the same proxy, so all listening editors are
updated when the data are modified in the proxy. The intermediate layer is a
symbolic zone for all components. It includes both graphic interface components
such as editor windows, container views and robjects: representations for the
objects under current edition. Interface components are subclassed from ExtJS
components. Every editor is displayed in its own window on the working desktop
and optionally contains a contentView displaying its child objects as robjects.
Computer Music Cloud 171

Fig. 6. Screen Capture of EvEditor

Fig. 6 is a browser-window capture showing the working desktop and some editor
windows. The application menu is shown in the lower left-hand corner, including
user setting. The central area of the capture shows a diatonic sequence editor
based on our TclTk Editor [1]. A list editor and a form-based note editor are also
shown. In the third layer or DOM (Document Object Model) [19], all components
are rendered as DOM elements (i.e., HTML document elements to be visualized).

5.2 Server Side


The client script code is dynamically provided by the server side of the web
application. It is written in Python and can run as a GoogleApp for integration
into the cloud. All user-related tasks are managed by this server side. It identifies
the user, and manages sessions, profiles and environments.

6 MusicJSON Music Format


As explained above, music services can be developed in the selected platform.
The only requirement for the service to be integrated into the music composition
cloud is that it must use the same music representation. EvMusic representation
is proposed for this purpose. This section describes how EvMusic objects are
stored in a database and transmitted over the cloud.
172 J.L. Alvaro and B. Barros

6.1 Database Object Storage


Database storage allows several services sharing the same data and collaborating
in the composition process. The information stored in a database is organized in
tables of records. For storing EvMusic tree structures in a database, they must
be previously converted into records. For this purpose, the three main classes
of Ev representation are subclassed from a tree node class, as shown in Fig. 2.
Thus, every object is identified by an unique reference and a parent attribute.
This allows representing a large tree structure of nested events as a set of records
for individual retrieval or update.

6.2 MusicJSON Object Description


Web applications usually use XML and JSON (Java Script Object Notation) for
data exchange [12]. Both formats meet the requirements. However, two reasons
supported our inclination for JSON: 1) The large tool library available for JSON
at the time of this writing, and 2) the fact that JSON is offered as the exchange
format for some of the main Internet web services such as Google or Yahoo.
The second reason has to do with its great features, such as human readability
and dynamic unclosed object support, a very valuable feature inherited from the
prototype-based nature of JavaScript.
JSON can be used to describe EvMusic objects and to communicate among
web music services. MusicJSON [2] is the name given to this use. As a simple
example, the code below shows the description for the short music fragment
shown in Fig. 7. As it can be seen, the code is self-explanatory.

{"pos":0, "objclass": "part","track":1, "events":[


{"pos":0, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":0.5, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":1, "objclass": "pitch":"g4","dur":0.75,
"dyn":"mf","legato":"start"},
{"pos":1.75, "objclass": "pitch":"f#4","dur":0.25
"legato":"end"},
{"pos":2, "objclass": "note","pitch":"g4","dur":0.5,
"art":"stacatto"}
{"pos":2.5, "objclass": "note","pitch":"a4","dur":0.5,
"art":"stacatto"}
{"pos":3, "objclass":"nchord","dur":1,"pitches":[
{"objclass": "spitch","pitch": "d4"},
{"objclass": "spitch","pitch": "b4"}]
}
]
}
Computer Music Cloud 173

Fig. 7. Score notation of the example code

6.3 MusicJSON File


Every EvMusic object, from single notes to complex structures, can be serial-
ized into a MusicJSON text and subsequently transmitted through the Web. In
addition, MusicJSON can be used as an intermediate format for local storage
of compositions. The next listing code shows a draft example of the proposed
description of an EvMusic file.
{"objclass":"evmusicfile","ver":"0911","content":
{"lib":{
"instruments":"http://evmusic.fauno.org/lib/main/instruments",
"pcstypes": "http://evmusic.fauno.org/lib/main/pcstypes",
"mypcs": "http://evmusic.fauno.org/lib/jesus/pcstypes",
"mymotives": "http://evmusic.fauno.org/lib/jesus/motives"
},
"def":{
"ma": {"objclass":"motive",
"symbol":[ 0,7, 5,4,2,0 ],
"slength": "+-+- +-++ +---"},
"flamenco": {"objclass":"pcstype",
"pcs":[ 0,5,7,13 ],},
},
"orc":{
"flauta": {"objclass":"instrument",
"value":"x.lib.instruments.flute",
"role":"r1"}
"cello": {"objclass":"instrument",
"value":"x.lib.instruments.cello",
"role":"r2"}
},
"score":[
{"pos": 0,"objclass":"section",
"pars":[
"tempo":120,"dyn":"mf","meter":"4/4",
...
],
"events":[
{"pos":0, "objclass": "part","track":1,"role":"i1",
"events":[
... ]
... ]},
{"pos": 60,"objclass":"section","ref":"s2",
...
},],}}}
174 J.L. Alvaro and B. Barros

The code shows four sections in the content. Library is an array of libraries to
be loaded with object definitions. Both main and user libraries can be addressed.
The following section includes local definitions of objects. As an example, a mo-
tive and a chord type are defined. Next section establishes instrumentation as-
signments by means of the arrangement object role. Last section is the score
itself, where all events are placed in a tree structure using parts. Using MusicJ-
SON as the intermediary communication format enables us to connect several
music services conforming a cloud composition system.

7 Conclusion

The present paper puts forward an experience of music composition under a dis-
tributed computation approach as a viable solution for Computer Music Com-
position in the Cloud. The system is split into several music services hosted in
common IaaS providers such as Google or Amazon. Different music systems can
be built by joint operation of some of these music services in the cloud.
In order to cooperate and deal with music objects, each service in the music
cloud must understand the same music knowledge. The music knowledge repre-
sentation they must share must be therefore standardized. EvMusic representa-
tion is proposed for this, since it is a solid multilevel representation successfully
tested in real CM compositions in recent years.
Furthermore, MusicJSON is proposed as an exchange data format between
services. Example descriptions of music elements, as well as a file format for
local saving of a musical composition, are given. A graphic environment is also
proposed for the creation of user interfaces for object editing as a web application.
As an example, the EvEditor application is described.
This CMC approach opens multiple possibilities for derivative work. New
music creation interfaces can be developed as web applications benefiting from
the upcoming web technologies such as the promising HTML5 standard [20]. The
described music in the cloud, together with EvMusic representation, provides a
ground environment for MAI research, where especialised agents can cooperate
in a music composition environment sharing the same music representation.

References
1. Alvaro, J.L. : Symbolic Pitch: Composition Experiments in Music Representation.
Research Report, http://cml.fauno.org/symbolicpitch.html (retrieved Decem-
ber 10, 2010) (last viewed February 2011)
2. Alvaro, J.L., Barros, B.: MusicJSON: A Representation for the Computer Mu-
sic Cloud. In: Proceedings of the 7th Sound and Music Computer Conference,
Barcelona (2010)
3. Alvaro, J.L., Miranda, E.R., Barros, B.: Music knowledge analysis: Towards an
efficient representation for composition. In: Marı́n, R., Onaindı́a, E., Bugarı́n, A.,
Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 331–341. Springer,
Heidelberg (2006)
Computer Music Cloud 175

4. Amazon Elastic Computing, http://aws.amazon.com/ec2/ (retrieved February 1,


2010) (last viewed February 2011)
5. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A.,
Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., Zaharia, M.: Above the Clouds:
A Berkeley View of Cloud Computing White Paper, http://www.eecs.berkeley.
edu/Pubs/TechRpts/2009/EECS-2009-28.pdf (retrieved February 1, 2010) (last
viewed February 2011)
6. ECMAScript Language Specification, http://www.ecma-international.org/
publications/standards/Ecma-262.htm (retrieved February 1, 2010) (last viewed
February 2011)
7. ExtJS Library, http://www.extjs.com/ (retrieved February 1, 2010) (last viewed
February 2011)
8. Geelan, J.: Twenty Experts Define Cloud Computing. Cloud Computing Jour-
nal (2008), http://cloudcomputing.sys-con.com/node/612375/print (retrieved
February 1, 2010) (last viewed February 2011)
9. Google AppEngine, http://code.google.com/appengine/ (retrieved February 1,
2010) (last viewed February 2011)
10. Google Apps, http://www.google.com/apps/ (retrieved February 1, 2010) (last
viewed February 2011)
11. Google Docs, http://docs.google.com/ (retrieved February 1, 2010 (last viewed
February 2011)
12. Introducing JSON, http://www.json.org/ (retrieved February 1, 2010) (last
viewed February 2011)
13. JavaScript, http://en.wikipedia.org/wiki/JavaScript (retrieved February 1,
2010) (last viewed February 2011)
14. Nienhuys, H.-W., Nieuwenhuizen J.: GNU Lilypond, http://www.lilypond.org
(rertrieved February 1, 2010) (last viewed February 2011)
15. OMG: Unified Modeling Language: Superstructure. Version 2.1.1(2007), http://
www.omg.org/uml (retrieved February 1, 2010) (last viewed February 2011)
16. Psenicka, D.: FOMUS, a Music Notation Package for Computer Music Com-
posers, http://fomus.sourceforge.net/doc.html/index.html (retrieved Febru-
ary 1, 2010) (last viewed, February 2011)
17. Russell, S.J., Norvig, P.: Intelligent Agents. In: Artificial Intelligence: A Modern
Approach, ch. 2. Prentice-Hall, Englewood Cliffs (2002)
18. Ubuntu Server on Amazon EC2, http://www.ubuntu.com/cloud/public (re-
trieved February 1, 2010) (last viewed, February 2011)
19. Wood, L.: Programming the Web: The W3C DOM Specification. IEEE Internet
Computing 3(1), 48–54 (1999)
20. W3C: HTML5 A vocabulary and associated APIs for HTML and XHTML W3C
Editor’s Draft, http://dev.w3.org/html5/spec/ (retrieved February 1, 2010) (last
viewed, February 2011)
Abstract Sounds and Their Applications in
Audio and Perception Research

Adrien Merer1 , Sølvi Ystad1 , Richard Kronland-Martinet1,


and Mitsuko Aramaki2,3
1
CNRS - Laboratoire de Mécanique et d’Acoustique,
31 ch. Joseph Aiguier, Marseille, France
2
CNRS - Institut de Neurosciences Cognitives de la Méditerranée,
31 ch. Joseph Aiguier, Marseille, France
3
Université Aix-Marseille, 38 bd. Charles Livon, Marseille, France
{merer,ystad,kronland}@lma.cnrs-mrs.fr
[email protected]

Abstract. Recognition of sound sources and events is an important pro-


cess in sound perception and has been studied in many research domains.
Conversely sounds that cannot be recognized are not often studied except
by electroacoustic music composers. Besides, considerations on recogni-
tion of sources might help to address the problem of stimulus selection
and categorization of sounds in the context of perception research. This
paper introduces what we call abstract sounds with the existing musical
background and shows their relevance for different applications.

Keywords: abstract sound, stimuli selection, acousmatic.

1 Introduction
How do sounds convey meaning? How can acoustic characteristics that con-
vey the relevant information in sounds be identified? These questions interest
researchers within various research fields such as cognitive neuroscience, musi-
cology, sound synthesis, sonification, etc. Recognition of sound sources, identi-
fication, discrimination and sonification deal with the problem of linking signal
properties and perceived information. In several domains (linguistic, music anal-
ysis), this problem is known as “semiotics” [21]. The analysis by synthesis
approach [28] has permitted to understand some important features that char-
acterize the sound of vibrating objects or interaction between objects. A similar
approach was also adopted in [13] where the authors use vocal imitations in
order to study human sound source identification with the assumption that vo-
cal imitations are simplifications of original sounds that still contain relevant
information.
Recently, there has been an important development in the use of sounds to
convey information to a user (of a computer, a car, etc.) within a new research
community called auditory display [19] which deals with topics related to sound
design, sonification and augmented reality. In such cases, it is important to use

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 176–187, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Abstract Sounds and Their Applications 177

sounds that are meaningful independently of cultural references taking into ac-
count that sounds are presented through speakers concurrently with other au-
dio/visual information.
Depending on the research topics, authors focused on different sound cate-
gories (i.e. speech, environmental sounds, music or calibrated synthesized stim-
uli). In [18], the author proposed a classification of everyday sounds according
to physical interactions from which the sound originates. When working within
synthesis and/or sonification domains, the aim is often to reproduce the acoustic
properties responsible for the attribution of meaning and thus, sound categories
can be considered from the point of view of semiotics i.e. focusing on information
that can be gathered in sounds.
In this way, we considered a specific category of sounds that we call “abstract
sounds”. This category includes any sound that cannot be associated with an
identifiable source. It includes environmental sounds that cannot be easily iden-
tified by listeners or that give rise to many different interpretations depending on
listeners and contexts. It also includes synthesized sounds, and laboratory gener-
ated sounds if they are not associated with a clear origin. For instance, alarm or
warning sounds cannot be considered as abstract sounds. In practice, recordings
with a microphone close to the sound source and some synthesis methods like
granular synthesis are especially efficient for creating abstract sounds. Note that
in this paper, we mainly consider acoustically complex stimuli since they best
meet our needs in the different applications (as discussed further).
Various labels that refer to abstract sounds can be found in the literature: “con-
fused” sounds [6], “strange” sounds [36], “sounds without meaning” [16]. Con-
versely, [34] uses the term “source-bonded” and the expression “source bonding”
for the “The natural tendency to relate sounds to supposed sources and causes”.
Chion introduced “acousmatic sounds” [9] in the context of cinema and audio-
visual applications with the following definition: “sound one hears without seeing
their originating cause - an invisible sound source” (for more details see section 2).
The most common expression is “abstract sounds” [27,14,26] particularly
within the domain of auditory display, when concerning “earcons” [7]. “Ab-
stract” used as an adjective means “based on general ideas and not on any
particular real person, thing or situation” and also “existing in thought or as an
idea but not having a physical reality”1. For sounds, we can consider another
definition used for art ”not representing people or things in a realistic way”1 .
Abstract as a noun is “a short piece of writing containing the main ideas in a
document”1 and thus share the ideas of essential attributes which is suitable in
the context of semiotics. In [4], authors wrote: “Edworthy and Hellier (2006)
suggested that abstract sounds can be interpreted very differently depending on
the many possible meanings that can be linked to them, and in large depending
on the surrounding environment and the listener.”
In fact, there is a general agreement for the use of the adjective “abstract”
applied to sounds that express both ideas of source recognition and different
possible interpretations.
1
Definitions from http://dictionary.cambridge.org/
178 A. Merer et al.

This paper will first present the existing framework for the use of abstract
sounds by electroacoustic music composers and researchers. We will then discuss
some important aspects that should be considered when conducting listening
tests with a special emphasis on the specificities of abstract sounds. Finally, three
practical examples of experiments with abstract sounds in different research
domains will be presented.

2 The Acousmatic Approach

Even if the term “abstract sounds” was not used in the context of electroacoustic
music, it seems that this community was one of the first to consider the issue
related to the recognition of sound sources and to use such sounds. In 1966,
P. Schaeffer, who was both a musician and a researcher, wrote the Traité des
objets musicaux [29], in which he reported more than ten years of research on
electroacoustic music. With a multidisciplinary approach, he intended to carry
out fundamental music research that included both “Concrète”2 and traditional
music. One of the first concepts he introduced was the so called “acousmatic”
listening, related to the experience of listening to a sound without paying atten-
tion to the source or the event. The word “acousmatic” is at the origin of many
discussions, and is now mainly employed in order to describe a musical trend.
Discussions about “acousmatic” listening was kept alive due to a fundamental
problem in Concrète music. Indeed, for music composers the problem is to create
new meaning from sounds that already carry information about their origins. In
compositions where sounds are organized according to their intrinsic properties,
thanks to the acousmatic approach, information on the origins of sounds is still
present and interacts with the composers’ goals.
There was an important divergence of points of view between Concrète and
Elektronische music (see [10] for a complete review), since the Elektronische
music composers used only electronically generated sounds and thus avoided the
problem of meaning [15]. Both Concrète and Elektronische music have developed
a research tradition on acoustics and perception, but only Schaeffer adopted a
scientific point of view. In [11], the author wrote: “Schaeffer’s decision to use
recorded sounds was based on his realization that such sounds were often rich
in harmonic and dynamic behaviors and thus had the largest potential for his
project of musical research”. This work was of importance for electroacoustic
musicians, but is almost unknown by researchers in auditory perception, since
there is no published translation of his book except for concomitant works [30]
and Chion’s Guide des objets musicaux 3 . As reported in [12], translating Scha-
effer’s writing is extremely difficult since he used neologisms and very specific
2
The term “concrete” is related to a composition method which is based on concrete
material i.e recorded or synthesized sounds, in opposition with “abstract” music
which is composed in an abstract manner i.e from ideas written on a score, and
become “concrete” afterwards.
3
Translation by J.Dack available at http://www.ears.dmu.ac.uk/spip.php?
page=articleEars&id_article=3597
Abstract Sounds and Their Applications 179

Fig. 1. Schaeffer’s typology. Note that some column labels are redundant since the
table must be read from center to borders. For instance, the “Non existent evolution”
column in the right part of the table corresponds to endless iterations whereas the
“Non existent evolution” column in the left part concerns sustained sounds (with no
amplitude variations).
Translation from [12]

meanings of french words. However, recently has been a growing interest in this
book and in particular in the domain of music information retrieval, for the mor-
phological sound description [27,26,5]. Authors indicate that in the case of what
they call “abstract” sounds, classical approaches based on sound source recogni-
tion are not relevant and thus base their algorithms on Schaeffer’s morphology
and typology classifications.
Morphology and typology have been introduced as analysis and creation tools
for composers as an attempt to construct a music notation that includes electroa-
coustic music and therefore any sound. The typology classification (cf. figure 1)
is based on a characterization of spectral (mass) and dynamical (facture 4 ) “pro-
files” of with respect to their complexity and consists of twenty-eight categories.
There are nine central categories of “balanced” sounds for which the variations
are neither too rapid and random nor too slow or nonexistent. Those nine cate-
gories included three facture profiles (sustained, impulsive or iterative) and three
mass profiles (tonic, complex and varying). On both sides of the “balanced ob-
jects” in the table, there are nineteen additional categories for which mass and
facture profiles are very simple/repetitive or vary a lot.
Note that some automatic classification methods are available [26]. In [37] the
authors proposed an extension of Schaeffer’s typology that includes graphical
notations.
Since the 1950s, electroacoustic music composers have addressed the problem
of meaning of sounds and provided an interesting tool for classification of sounds
with no a priori differentiation on the type of sound. For sound perception
research, a classification of sounds according to these categories may be useful
4
As discussed in [12] even if facture is not a common English word, there is no better
translation from French.
180 A. Merer et al.

since they are suitable for any sound. The next section will detail the use of such
classification for the design of listening tests.

3 Design of Listening Tests Using Abstract Sounds


The design of listening tests is a fundamental part of sound perception stud-
ies and implies considerations of different aspects of perception that are closely
related to the intended measurements. For instance, it is important to design
calibrated stimuli and experimental procedures to control at best the main fac-
tors that affect the subjects’ evaluations. We propose to discuss such aspects in
the context of abstract sounds.

3.1 Stimuli
It is common to assume that perception differs as a function of sound categories
(e.g. speech, environmental sounds, music). Even more, these categories are un-
derlying elements defining a research area. Consequently, it is difficult to deter-
mine a general property of human perception based on collected results obtained
from different studies. For instance, results concerning loudness conducted on el-
ementary synthesized stimuli (sinusoids, noise, etc.) cannot be directly adapted
to complex environmental sounds as reported by [31]. Furthermore, listeners’
judgements might differ for sounds belonging to a same category. For instance,
in the environmental sound category, [14] have shown specific categorization
strategies for sounds that involve human activity.
When there is no hypothesis regarding the signal properties, it is important
to gather sounds that present a large variety of acoustic characteristics as dis-
cussed in [33]. Schaeffer’s typology offers an objective selection tool than can
help the experimenter to construct a very general sound corpus representative
of most existing sound characteristics by covering all the typology categories.
As a comparison, environmental sounds can be classified only in certain rows
of Schaeffer’s typology categories (mainly the “balanced” objects). Besides, ab-
stract sounds may constitute a good compromise in terms of acoustic properties
between elementary (sinusoids, noise, etc.) and ecological (speech, environmental
sounds and music) stimuli.
A corpus of abstract sounds can be obtained in different ways. Many databases
available for audiovisual applications contain such sounds (see [33]). Different
synthesis techniques (like granular or FM synthesis, etc.) are also efficient to
create abstract sounds. In [16] and further works [38,39], the authors presented
some techniques to transform any recognizable sound into an abstract sound,
preserving several signal characteristics. Conversely, many transformations dras-
tically alter the original (environmental or vocal) sounds when important acous-
tic attributes are modified. For instance, [25] has shown that applying high and
low-pass filtering influence the perceived naturalness of speech and music sounds.
Since abstract sounds do not convey univocal meaning, it is possible to use them
in different ways according to the aim of the experience. For instance, a same
sound corpus can be evaluated in different contexts (by drawing the listener’s
Abstract Sounds and Their Applications 181

attention to certain evocations) in order to study specific aspects of the infor-


mation conveyed by the sounds. In particular, we will see how the same set of
abstract sounds was used in 2 different studies described in sections 4.3 and 4.1.

3.2 Procedure

To control the design of stimuli, it is important to verify in a pre-test that the


evaluated sounds are actually “abstract” for most listeners. In a musical context,
D. Smalley [35] has introduced the expression “surrogacy” level (or degree) to
quantify the ease of source recognition. This level is generally evaluated by using
identification tasks. In [6], the authors describe three methods: 1) Free identi-
fication tasks that consists of associating words or any description with sounds
[2]. 2) Context-based ratings, which are comparisons between sounds and other
stimuli. 3) Attribute rating, which is a generalization of the semantic differential
method. The third method may be the most relevant since it provides graduated
ratings on an unlimited number of scales. In particular, we will see in section 4.3
that we evaluated the degree of recognition of abstract sounds (“the sound is
easily recognizable or not”) by asking listeners to use a non graduated scale from
“not recognizable” to “easily recognizable”.
Since abstract sounds are not easily associated with a source (and to the
corresponding label), they can also be attributed to several meanings that may
depend on the type of experimental procedure and task. In particular, we will
see that it is possible to take advantage of this variability of meaning to highlight
for example differences between groups of listeners as described in section 4.1.

3.3 Type of Listening

In general, perception research distinguishes analytic and synthetic listening.


Given a listening procedure, subjects may focus on different aspects of sounds
since different concentration and attention levels are involved. From a different
point of view, [17] introduced the terms “everyday listening” (as opposed to
“musical listening”) and argued that even in the case of laboratory experiences,
listeners are naturally more interested in sound source properties than in intrinsic
properties and therefore use “everyday listening”. [29] also introduced different
types of listening (“hearing”, “listening”, “comprehending”, “understanding”)
and asserted that when listening to a sound we switch from one type of listening
to another. Even if different points of view are used to define the different types
of listening, they share the notions of attentional direction and intention when
perceiving sounds. Abstract sounds might help listeners to focus on intrinsic
properties of sound and thus to adopt musical listening.
Another aspect that could influence the type of listening and therefore intro-
duce variability in responses is the coexistence of several streams in a sound5 .
If a sound is composed of several streams, listeners might alternatively focus on
different elements which cannot be accurately controlled by the experimenter.
5
Auditory streams have been introduced by Bregman [8], and describe our ability to
group/separate different elements of a sound.
182 A. Merer et al.

Since abstract sounds have no univocal meaning to be preserved, it is possible


to proceed to transformations that favour one stream (and alter the original
meaning). This is not the case for environmental sound recordings for instance,
since transformations can make them unrecognizable. Note that classification of
sounds with several streams according to Schaeffer’s typology might be difficult
since they present concomitant profiles associated with distinct categories.

4 Potentials of Abstract Sounds


As described in section 2, potentials of abstract sounds was initially revealed in
the musical context. In particular, their ability to evoke various emotions was
fully investigated by electroacoustic composers. In this section, we describe how
abstract sounds can be used in different contexts by presenting studies linked
to three different research domains, i.e. sound synthesis, cognitive neuroscience
and clinical diagnosis. Note that we only aim at giving an overview of some
experiments that use abstract sounds, in order to discuss the motivations behind
the different experimental approaches. Details of the material and methods can
be found in the referred articles in the following sections.
The three experiments partially shared the same stimuli. We collected abstract
sounds provided by electroacoustic composers. Composers constitute an original
resource of interesting sounds since they have thousands of specially recorded or
synthesized sounds, organized and indexed to be included in their compositions.
From these databases, we selected a set of 200 sounds6 that best spread out in
the typology table proposed by Schaeffer (cf. tab 1). A subset of sounds was
finally chosen according to the needs of each study presented in the following
paragraphs.

4.1 Bizarre and Familiar Sounds


Abstract sounds are not often heard in our everyday life and could even be
completely novel for listeners. Therefore, they might be perceived as “strange”
or “bizarre”. As mentioned above, listeners’ judgements of abstract sounds are
highly subjective. In some cases, it is possible to use this subjectivity to in-
vestigate some specificities of human perception and in particular, to highlight
differences of sound evaluations between groups of listeners. In particular, the
concept of “bizarre” is one important element from standard classification of
mental disorders (DSM - IV) for schizophrenia [1] pp. 275. An other frequently
reported element is the existence of auditory hallucinations7 , i.e. perception
without stimulation. From such considerations, we explored the perception of
bizarre and familiar sounds in patients with schizophrenia by using both envi-
ronmental (for their familiar aspect) and abstract sounds (for their bizarre as-
pect). The procedure consisted in rating sounds on continuous scales according
6
Some examples from [23] are available at http://www.sensons.cnrs-mrs.fr/
CMMR07_semiotique/
7
“[...] auditory hallucinations are by far the most common and characteristic of
Schizophrenia.” [1] pp. 275
Abstract Sounds and Their Applications 183

to a perceptual dimension labelled by an adjective (by contrast, classical differ-


ential semantic uses an adjective and an antonym to define the extremes of each
scale). Sounds were evaluated on six dimensions along linear scales: “familiar”,
“reassuring”, “pleasant”, “bizarre”, “frightening”, “invasive”8. Concerning the
abstract sound corpus, we chose 20 sounds from the initial set of 200 sounds by
a pre-test on seven subjects and selected sounds that best spread in the space of
measured variables (the perceptual dimensions). This preselection was validated
by a second pre-test on fourteen subjects that produced similar repartition of
the sounds along the perceptual dimensions.
Preliminary results showed that the selected sound corpus made it possible to
highlight significant differences between patients with schizophrenia and control
groups. Further analysis and testing (for instance brain imaging techniques) will
be conducted in order to better understand these differences.

4.2 Reduction of Linguistic Mediation and Access to Different


Meanings

Within the domain of cognitive neuroscience, a major issue is to determine whether


similar neural networks are involved in the allocation of meaning for language and
other non-linguistic sounds. A well-known protocol largely used to investigate se-
mantic processing in language, i.e. the semantic priming paradigm [3], has been
applied to other stimuli such as pictures, odors and sounds and several studies high-
lighted the existence of a conceptual priming in a nonlinguistic context (see [32]
for a review). One difficulty that occurs when considering non-linguistic stimuli,
is the potential effect of linguistic mediation. For instance watching a picture of
a bird or listening to the song of a bird might automatically activate the verbal
label “bird”. In this case, the conceptual priming cannot be considered as purely
non-linguistic because of the implicit naming induced by the stimulus processing.
Abstract sounds are suitable candidates to weaken this problem, since they are not
easily associated with a recognizable source. In [32], the goals were to determine
how a sense is attributed to a sound and whether there are similarities between
brain processing of sounds and words. For that, a priming protocol was used with
word/sound pairs and the degree of congruence between the prime and the target
was manipulated. To design stimuli, seventy abstract sounds from the nine ”bal-
anced” (see section 2) categories of Schaeffer’s typology table were evaluated in a
pre-test to define the word/sound pairs. The sounds were presented successively to
listeners who were asked to write the first words that came to their mind after lis-
tening. A large variety of words were given by listeners. One of the sounds obtained
for instance the following responses: “dry, wildness, peak, winter, icy, polar, cold”.
Nevertheless, for most sounds, it was possible to find a common word that was
accepted as coherent by more than 50% of the listeners. By associating these com-
mon words with the abstract sounds, we designed forty-five related word/sound
pairs. The non-related pairs were constructed by recombining words and sounds
8
These are arguable translations from French adjectives: familier, rassurant, plaisant,
bizarre, angoissant, envahissant.
184 A. Merer et al.

randomly. This step allowed us to validate the abstract sounds since no label refer-
ring to the actual source was given. Indeed when listeners are asked to explicitly
label abstract sounds, different labels that were more related to the sound quality
were collected. In a first experiment a written word (prime) was visually presented
before a sound (target) and subjects had to decide whether or not the sound and
the word fit together. In a second experiment, presentation order was reversed (i.e.
sound presented before word). Results showed that participants were able to evalu-
ate the semiotic relation between the prime and the target in both sound-word and
word-sound presentations with relatively low inter-subject variability and good
consistency (see [32] for details on experimental data and related analysis). This
result indicated that abstract sounds are suitable for studying conceptual process-
ing. Moreover, their contextualization by the presentation of a word reduced the
variability of interpretations and led to a consensus between listeners. The study
also revealed similarities in the electrophysiological patterns (Event Related Po-
tentials) between abstract sounds and word targets, supporting the assumption
that similar processing is involved for linguistic and non-linguistic sounds.

4.3 Sound Synthesis

Intuitive control of synthesizers through high-level parameters is still an open


problem in virtual reality and sound design. Both in industrial and musical
contexts, the challenge consists of creating sounds from a semantic description
of their perceptual correlates. Indeed, as discussed formerly, abstract sounds can
be rich from an acoustic point of view and enable testing of different spectro-
temporal characteristics at the same time. Thus they might be useful to identify
general signal properties characteristic of different sound categories. In addition,
they are particularly designed for restitution through speakers (as this is the case
for synthesizers). For this purpose, we proposed a general methodology based
on evaluation and analysis of abstract sounds aiming at identifying perceptually
relevant signal characteristics and propose an intuitive synthesis control. Given
a set of desired control parameters and a set of sounds, the proposed method
consists of asking listeners to evaluate the sounds on scales defined by the control
parameters. Sounds with same/different values on a scale are then analyzed in
order to identify signal correlates. Finally, using feature based synthesis [20],
signal transformations are defined to propose an intuitive control strategy.
In [23], we addressed the control of perceived movement evoked by mono-
phonic sounds. We first conducted a free categorization task asking subjects to
group sounds that evoke a similar movement and to label each category. The aim
of this method was to identify sound categories to further identify perceptually
relevant sound parameters specific to each category. Sixty-two abstract sounds
were considered for this purpose. Based on subjects’ responses, we identified
six main categories of perceived movements: “rotate”, “fall down”, “approach”,
“pass by”, “go away”and “go up”and identified a set of sounds representative
of each category. Note that like in the previous studies, the labels given by
the subjects did not refer to the sound source but rather to an evocation.
Based on this first study, we aimed at refining the perceptual characterization of
Abstract Sounds and Their Applications 185

movements and identify relevant control parameters. For that, we selected 40


sounds among the initial corpus of 200 sounds. Note that in the case of
movement, we are aware that the recognition of the physical sound source can
introduce a bias in the evaluation. If the source can be easily identified, the
corresponding movement is more likely to be linked to the source: a car sound
only evokes horizontal movement and cannot fall or go up. Thus, we asked 29
listeners to evaluate the 40 sounds through a questionnaire including the two
following questions rated on a linear scale:
• “Is the sound source recognizable?” (rated on a non graduated scale from
“not recognizable” to “easily recognizable”)
• “Is the sound natural?” (rated from “natural” to “synthetic”)
When the sources were judged “recognizable”, listeners were asked to write a
few words to describe the source.
We found a correspondence between responses of the two questions: the source
is perceived natural as long as it is easily recognized (R=.89). Note that abstract
sounds were judged as “synthesized” sounds even if they actually were recordings
from vibrating bodies. Finally we asked listeners to characterize the movements
evoked by sounds with a drawing interface that allowed representing combination
of the elementary movements previously found (sounds can rotate and go-up at
the same time) and where drawing parameters correspond to potential control
parameters of the synthesizer. Results showed that it was possible to determine
the relevant perceptual features and to propose an intuitive control strategy for
a synthesizer dedicated to movements evoked by sounds.

5 Conclusion
In this paper, we presented the advantages of using abstract sounds in audio
and perception research based on a review of studies in which we exploited their
distinctive features. The richness of abstract sounds in terms of their acoustic
characteristics and potential evocations open various perspectives. Indeed, they
are generally perceived as “unrecognizable”, “synthetic” and “bizarre” depend-
ing on context and task and these aspects can be relevant to help listeners to
focus on the intrinsic properties of sounds, to orient the type of listening, to
evoke specific emotions or to better investigate individual differences. Moreover,
they constitute a good compromise between elementary and ecological stimuli.
We addressed the design of the sound corpus and of specific procedures for
listening tests using abstract sounds. In auditory perception research, sound
categories based on well identified sound sources are most often considered (ver-
bal/non verbal sounds, environmental sounds, music). The use of abstract sounds
may allow defining more general sound categories based on other criteria such as
listeners’ evocations or intrinsic sound properties. Based on empirical researches
from electroacoustic music trends, the sound typology proposed by P. Schaeffer
should enable the definition of such new sound categories and may be relevant
for future listening tests including any sound. Otherwise, since abstract sounds
186 A. Merer et al.

convey multiple information (attribution of several meanings), the procedure is


of importance to orient type of listening towards the information that actually
is of interest for the experiment.
Beyond these considerations, the resulting reflections may help us to address
more general and fundamental questions related to the determination of invariant
signal morphologies responsible for evocations and to which extent “universal”
sound morphologies that do not depend on context and type of listening exist.

References
1. Association, A.P.: The Diagnostic and Statistical Manual of Mental Disorders,
Fourth Edition (DSM-IV). American Psychiatric Association (1994), http://www.
psychiatryonline.com/DSMPDF/dsm-iv.pdf (last viewed February 2011)
2. Ballas, J.A.: Common factors in the identification of an assortment of brief every-
day sounds. Journal of Experimental Psychology: Human Perception and Perfor-
mance 19, 250–267 (1993)
3. Bentin, S., McCarthy, G., Wood, C.C.: Event-related potentials, lexical decision
and semantic priming. Electroencephalogr Clin. Neurophysiol. 60, 343–355 (1985)
4. Bergman, P., Skold, A., Vastfjall, D., Fransson, N.: Perceptual and emotional cat-
egorization of sound. The Journal of the Acoustical Society of America 126, 3156–
3167 (2009)
5. Bloit, J., Rasamimanana, N., Bevilacqua, F.: Towards morphological sound de-
scription using segmental models. In: DAFX, Milan, Italie (2009)
6. Bonebright, T.L., Miner, N.E., Goldsmith, T.E., Caudell, T.P.: Data collection
and analysis techniques for evaluating the perceptual qualities of auditory stimuli.
ACM Trans. Appl. Percept. 2, 505–516 (2005)
7. Bonebright, T.L., Nees, M.A.: Most earcons do not interfere with spoken passage
comprehension. Applied Cognitive Psychology 23, 431–445 (2009)
8. Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)
9. Chion, M.: Audio-vision, Sound on Screen. Columbia University Press, New-York
(1993)
10. Cross, L.: Electronic music, 1948-1953. Perspectives of New Music (1968)
11. Dack, J.: Abstract and concrete. Journal of Electroacoustic Music 14 (2002)
12. Dack, J., North, C.: Translating pierre schaeffer: Symbolism, literature and music.
In: Proceedings of EMS 2006 Conference, Beijing (2006)
13. Dessein, A., Lemaitre, G.: Free classification of vocal imitations of everyday sounds.
In: Sound And Music Computing (SMC 2009), Porto, Portugal, pp. 213–218 (2009)
14. Dubois, D., Guastavino, C., Raimbault, M.: A cognitive approach to urban sound-
scapes: Using verbal data to access everyday life auditory categories. Acta Acustica
United with Acustica 92, 865–874 (2006)
15. Eimert, H.: What is electronic music. Die Reihe 1 (1957)
16. Fastl, H.: Neutralizing the meaning of sound for sound quality evaluations. In:
Proc. Int. Congress on Acoustics ICA 2001, Rome, Italy, vol. 4, CD-ROM (2001)
17. Gaver, W.W.: How do we hear in the world? explorations of ecological acoustics.
Ecological Psychology 5, 285–313 (1993)
18. Gaver, W.W.: What in the world do we hear? an ecological approach to auditory
source perception. Ecological Psychology 5, 1–29 (1993)
Abstract Sounds and Their Applications 187

19. Hermann, T.: Taxonomy and definitions for sonification and auditory display.
In: Proceedings of the 14th International Conference on Auditory Display, Paris,
France (2008)
20. Hoffman, M., Cook, P.R.: Feature-based synthesis: Mapping acoustic and percep-
tual features onto synthesis parameters. In: Proceedings of the 2006 International
Computer Music Conference (ICMC), New Orleans (2006)
21. Jekosch, U.: 8. Assigning Meaning to Sounds - Semiotics in the Context of Product-
Sound Design. J. Blauert, 193–221 (2005)
22. McKay, C., McEnnis, D., Fujinaga, I.: A large publicly accessible prototype audio
database for music research (2006)
23. Merer, A., Ystad, S., Kronland-Martinet, R., Aramaki, M.: Semiotics of sounds
evoking motions: Categorization and acoustic features. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 139–158. Springer,
Heidelberg (2008)
24. Micoulaud-Franchi, J.A., Cermolacce, M., Vion-Dury, J.: Bizzare and familiar
recognition troubles of auditory perception in patient with schizophrenia (2010)
(in preparation)
25. Moore, B.C.J., Tan, C.T.: Perceived naturalness of spectrally distorted speech and
music. The Journal of the Acoustical Society of America 114, 408–419 (2003)
26. Peeters, G., Deruty, E.: Automatic morphological description of sounds. In: Acous-
tics 2008, Paris, France (2008)
27. Ricard, J., Herrera, P.: Morphological sound description computational model and
usability evaluation. In: AES 116th Convention (2004)
28. Risset, J.C., Wessel, D.L.: Exploration of timbre by analysis and synthesis. In:
Deutsch, D. (ed.) The psychology of music. Series in Cognition and Perception,
pp. 113–169. Academic Press, London (1999)
29. Schaeffer, P.: Traité des objets musicaux. Editions du seuil (1966)
30. Schaeffer, P., Reibel, G.: Solfège de l’objet sonore. INA-GRM (1967)
31. Schlauch, R.S.: 12 - Loudness. In: Ecological Psychoacoustics, pp. 318–341. Else-
vier, Amsterdam (2004)
32. Schön, D., Ystad, S., Kronland-Martinet, R., Besson, M.: The evocative power
of sounds: Conceptual priming between words and nonverbal sounds. Journal of
Cognitive Neuroscience 22, 1026–1035 (2010)
33. Shafiro, V., Gygi, B.: How to select stimuli for environmental sound research and
where to find them. Behavior Research Methods, Instruments, & Computers 36,
590–598 (2004)
34. Smalley, D.: Defining timbre — refining timbre. Contemporary Music Review 10,
35–48 (1994)
35. Smalley, D.: Space-form and the acousmatic image. Org. Sound 12, 35–58 (2007)
36. Tanaka, K., Matsubara, K., Sato, T.: Study of onomatopoeia expressing strange
sounds: Cases of impulse sounds and beat sounds. Transactions of the Japan Society
of Mechanical Engineers C 61, 4730–4735 (1995)
37. Thoresen, L., Hedman, A.: Spectromorphological analysis of sound objects: an
adaptation of pierre schaeffer’s typomorphology. Organised Sound 12, 129–141
(2007)
38. Zeitler, A., Ellermeier, W., Fastl, H.: Significance of meaning in sound quality
evaluation. Fortschritte der Akustik, CFA/DAGA 4, 781–782 (2004)
39. Zeitler, A., Hellbrueck, J., Ellermeier, W., Fastl, H., Thoma, G., Zeller, P.: Method-
ological approaches to investigate the effects of meaning, expectations and context
in listening experiments. In: INTER-NOISE 2006, Honolulu, Hawaii (2006)
Pattern Induction and Matching in Music
Signals

Anssi Klapuri

Centre for Digital Music, Queen Mary University of London


Mile End Road, E1 4NS London, United Kingdom
[email protected]
http://www.elec.qmul.ac.uk/people/anssik/

Abstract. This paper discusses techniques for pattern induction and


matching in musical audio. At all levels of music - harmony, melody,
rhythm, and instrumentation - the temporal sequence of events can be
subdivided into shorter patterns that are sometimes repeated and trans-
formed. Methods are described for extracting such patterns from musical
audio signals (pattern induction) and computationally feasible methods
for retrieving similar patterns from a large database of songs (pattern
matching).

1 Introduction
Pattern induction and matching plays an important part in understanding the
structure of a given music piece and in detecting similarities between two differ-
ent music pieces. The term pattern is here used to refer to sequential structures
that can be characterized by a time series of feature vectors x1 , x2 , . . . , xT . The
vectors xt may represent acoustic features calculated at regularly time intervals
or discrete symbols with varying durations. Many different elements of music
can be represented in this form, including melodies, drum patterns, and chord
sequences, for example.
In order to focus on the desired aspect of music, such as the drums track or
the lead vocals, it is often necessary to extract that part from a polyphonic music
signal. Section 2 of this paper will discuss methods for separating meaningful
musical objects from polyphonic recordings.
Contrary to speech, there is no global dictionary of patterns or ”words” that
would be common to all music pieces, but in a certain sense, the dictionary of
patterns is created anew in each music piece. The term pattern induction here
refers to the process of learning to recognize sequential structures from repeated
exposure [63]. Repetition plays an important role here: rhythmic patterns are
repeated, melodic phrases recur and vary, and even entire sections, such as the
chorus in popular music, are repeated. This kind of self-reference is crucial for
imposing structure on a music piece and enables the induction of the underlying
prototypical patterns. Pattern induction will be discussed in Sec. 3.
Pattern matching, in turn, consists of searching a database of music for seg-
ments that are similar to a given query pattern. Since the target matches can

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 188–204, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Pattern Induction and Matching in Music Signals 189

in principle be located at any temporal position and are not necessarily scaled
to the same length as the query pattern, temporal alignment of the query and
target patterns poses a significant computational challenge in large databases.
Given that the alignment problem can be solved, another pre-requisite for mean-
ingful pattern matching is to define a distance measure between musical patterns
of different kinds. These issues will be discussed in Sec. 4.
Pattern processing in music has several interesting applications, including
music information retrieval, music classification, cover song identification, and
creation of mash-ups by blending matching excerpts from different music pieces.
Given a large database of music, quite detailed queries can be made, such as
searching for a piece that would work as an accompaniment for a user-created
melody.

2 Extracting the Object of Interest from Music


There are various levels at which pattern induction and matching can take place
in music. At one extreme, a polyphonic music signal is considered as a coherent
whole and features describing its harmonic or timbral aspects, for example, are
calculated. In a more analytic approach, some part of the signal, such as the
melody or the drums, is extracted before the feature calculation. Both of these
approaches are valid from the perceptual viewpoint. Human listeners, especially
trained musicians, can switch between a ”holistic” listening mode and a more
analytic one where they focus on the part played by a particular instrument or
decompose music into its constituent elements and their releationships [8,3].
Even when a music signal is treated as a coherent whole, it is necessary to
transform the acoustic waveform into a series of feature vectors x1 , x2 , . . . , xT
that characterize the desired aspect of the signal. Among the most widely used
features are Mel-frequency cepstral coefficients (MFCCs) to represent the timbral
content of a signal in terms of its spectral energy distribution [73]. The local
harmonic content of a music signal, in turn, is often summarized using a 12-
dimensional chroma vector that represents the amount of spectral energy falling
at each of the 12 tones of an equally-tempered scale [5,50]. Rhythmic aspects are
conveniently represented by the modulation spectrum which encodes the pattern
of sub-band energy fluctuations within windows of approximately one second in
length [15,34]. Besides these, there are a number of other acoustic features, see
[60] for an overview.
Focusing pattern extraction on a certain instrument or part in polyphonic mu-
sic requires that the desired part be pulled apart from the rest before the feature
extraction. While this is not entirely straightforward in all cases, it enables mu-
sically more interesting pattern induction and matching, such as looking at the
melodic contour independently of the accompanying instruments. Some strate-
gies towards decomposing a music signal into its constituent parts are discussed
in the following.
190 A. Klapuri

2.1 Time-Frequency and Spatial Analysis

Musical sounds, like most natural sounds, tend to be sparse in the time-frequency
domain, meaning that the sounds can be approximated using a small number of
non-zero elements in the time-frequency domain. This facilitates sound source
separation and audio content analysis. Usually the short-time Fourier transform
(STFT) is used to represent a given signal in the time-frequency domain. A
viable alternative for STFT is the constant-Q transform (CQT), where the center
frequencies of the frequency bins are geometrically spaced [9,68]. CQT is often
ideally suited for the analysis of music signals, since the fundamental frequencies
(F0s) of the tones in Western music are geometrically spaced.
Spatial information can sometimes be used to organize time-frequency com-
ponents to their respective sound sources [83]. In the case of stereophonic audio,
time-frequency components can be clustered based on the ratio of left-channel
amplitude to the right, for example. This simple principle has been demonstrated
to be quite effective for some music types, such as jazz [4], despite the fact that
overlapping partials partly undermine the idea. Duda et al. [18] used stereo
information to extract the lead vocals from complex audio for the purpose of
query-by-humming.

2.2 Separating Percussive Sounds from the Harmonic Part

It is often desirable to analyze the drum track of music separately from the
harmonic part. The sinusoids+noise model is the most widely-used technique
for this purpose [71]. It produces quite robust quality for the noise residual,
although the sinusoidal (harmonic) part often suffers quality degradation for
music with dense sets of sinusoids, such as orchestral music.
Ono et al. proposed a method which decomposes the power spectrogram
X(F ×T ) of a mixture signal into a harmonic part H and percussive part P so
that X = H + P [52]. The decomposition is done by minimizing an objective
function that measures variation over time n for the harmonic part and varia-
tion over frequency k for the percussive part. The method is straightforward to
implement and produces good results.
Non-negative matrix factorization (NMF) is a technique that decomposes the
spectrogram of a music signal into a linear sum of components that have a
fixed spectrum and time-varying gains [41,76]. Helen and Virtanen used the
NMF to separate the magnitude spectrogram of a music signal into a couple of
dozen components and then used a support vector machine (SVM) to classify
each component either to pitched instruments or to drums, based on features
extracted from the spectrum and the gain function of each component [31].

2.3 Extracting Melody and Bass Line

Vocal melody is usually the main focus of attention for an average music listener,
especially in popular music. It tends to be the part that makes music memorable
and easily reproducible by singing or humming [69].
Pattern Induction and Matching in Music Signals 191

Several different methods have been proposed for the main melody extraction
from polyphonic music. The task was first considered by Goto [28] and later
various methods for melody tracking have been proposed by Paiva et al. [54],
Ellis and Poliner [22], Dressler [17], and Ryynänen and Klapuri [65]. Typically,
the methods are based on framewise pitch estimation followed by tracking or
streaming over time. Some methods involve a timbral model [28,46,22] or a
musicological model [67]. For comparative evaluations of the different methods,
see [61] and [www.music-ir.org/mirex/].
Melody extraction is closely related to vocals separation: extracting the melody
faciliatates lead vocals separation, and vice versa. Several different approaches
have been proposed for separating the vocals signal from polyphonic music, some
based on tracking the pitch of the main melody [24,45,78], some based on timbre
models for the singing voice and for the instrumental background [53,20], and
yet others utilizing stereo information [4,18].
Bass line is another essential part in many music types and usually contains
a great deal of repetition and note patterns that are rhythmically and tonally
interesting. Indeed, high-level features extracted from the bass line and the play-
ing style have been successfully used for music genre classification [1]. Methods
for extracting the bass line from polyphonic music have been proposed by Goto
[28], Hainsworth [30], and Ryynänen [67].

2.4 Instrument Separation from Polyphonic Music


For human listeners, it is natural to organize simultaneously occurring sounds
to their respective sound sources. When listening to music, people are often able
to focus on a given instrument – despite the fact that music intrinsically tries to
make co-occurring sounds “blend” as well as possible.
Separating the signals of individual instruments from a music recording has
been recently studied using various approaches. Some are based on grouping
sinusoidal components to sources (see e.g. [10]) whereas some others utilize a
structured signal model [19,2]. Some methods are based on supervised learning
of instrument-specific harmonic models [44], whereas recently several methods
have been proposed based on unsupervised methods [23,75,77]. Some methods do
not aim at separating time-domain signals, but extract the relevant information
(such as instrument identities) directly in some other domain [36].
Automatic instrument separation from a monaural or stereophonic recording
would enable pattern induction and matching for the individual instruments.
However, source separation from polyphonic music is extremely challenging and
the existing methods are generally not as reliable as those intended for melody
or bass line extraction.

3 Pattern Induction
Pattern induction deals with the problem of detecting repeated sequential struc-
tures in music and learning the pattern underlying these repetitions. In the
following, we discuss the problem of musical pattern induction from a general
192 A. Klapuri

85

80
Pitch (MIDI)
75

70

0 1 2 3 4 5 6 7
Time (s)

Fig. 1. A “piano-roll” representation for an excerpt from Mozart’s Turkish March. The
vertical lines indicate a possible grouping of the component notes into phrases.

perspective. We assume that a time series of feature vectors x1 , x2 , . . . , xT de-


scribing the desired characteristics of the input signal is given. The task of
pattern induction, then, is to detect repeated sequences in this data and to
learn a prototypical pattern that can be used to represent all its occurrences.
What makes this task challenging is that the data is generally multidimensional
and real-valued (as opposed to symbolic data), and furthermore, music seldom
repeats itself exactly, but variations and transformations are applied on each
occurrence of a given pattern.

3.1 Pattern Segmentation and Clustering


The basic idea of this approach is to subdivide the feature sequence x1 , x2 , . . . , xT
into shorter segments and then cluster these segments in order to find repeated
patterns. The clustering part requires that a distance measure between two fea-
ture segments is defined – a question that will be discussed separately in Sec. 4
for different types of features.
For pitch sequences, such as the melody and bass lines, there are well-defined
musicological rules how individual sounds are perceptually grouped into melodic
phrases and further into larger musical entities in a hierarchical manner [43].
This process is called grouping and is based on relatively simple principles such
as preferring a phrase boundary at a point where the time or the pitch interval
between two consecutive notes is larger than in the immediate vicinity (see Fig. 1
for an example). Pattern induction, then, proceeds by choosing a certain time
scale, performing the phrase segmentation, cropping the pitch sequences accord-
ing to the shortest phrase, clustering the phrases using for example k-means
clustering, and finally using the pattern nearest to each cluster centroid as the
”prototype” pattern for that cluster.
A difficulty in implementing the phrase segmentation for audio signals is that
contrary to MIDI, note durations and rests are difficult to extract from audio.
Nevertheless, some methods produce discrete note sequences from music [67,82],
and thus enable segmenting the transcription result into phrases.
Musical meter is an alternative criterion for segmenting musical feature se-
quences into shorter parts for the purpose of clustering. Computational meter
analysis usually involves tracking the beat and locating bar lines in music. The
Pattern Induction and Matching in Music Signals 193

good news here is that meter analysis is a well-understood and feasible problem
for audio signals too (see e.g. [39]). Furthermore, melodic phrase boundaries of-
ten co-incide with strong beats, although this is not always the case. For melodic
patterns, for example, this segmenting rule effectively requires two patterns to be
similarly positioned with respect to the musical measure boundaries in order for
them to be similar, which may sometimes be a too strong assumption. However,
for drum patterns this requirement is well justified.
Bertin-Mahieux et al. performed harmonic pattern induction for a large
database of music in [7]. They calculated a 12-dimensional chroma vector for
each musical beat in the target songs. The beat-synchronous chromagram data
was then segmented at barline positions and the resulting beat-chroma patches
were vector quantized to obtain a couple of hundred prototype patterns.
A third strategy is to avoid segmentation altogether by using shift-invariant
features. As an example, let us consider a sequence of one-dimensional features
x1 , x2 , . . . , xT . The sequence is first segmented into partly-overlapping frames
that have length approximately the same as the patterns being sought. Then the
sequence within each frame is Fourier transformed and the phase information is
discarded in order to make the features shift-invariant. The resulting magnitude
spectra are then clustered to find repeated patterns. The modulation spectrum
features (aka fluctuation patterns) mentioned in the beginning of Sec. 2 are an
example of such a shift-invariant feature [15,34].

3.2 Self-distance Matrix


Pattern induction, in the sense defined in the beginning of this section, is possible
only if a pattern is repeated in a given feature sequence. The repetitions need not
be identical, but bear some similarity with each other. A self-distance matrix (aka
self-similarity matrix) offers a direct way of detecting these similarities. Given
a feature sequence x1 , x2 , . . . , xT and a distance function d that specifies the
distance between two feature vectors xi and xj , the self-distance matrix (SDM)
is defined as
D(i, j) = d(xi , xj ) (1)
for i, j ∈ {1, 2, . . . , T }. Frequently used distance measures include the Euclidean
distance xi −xj  and the cosine distance 0.5(1− xi , xj
/(xi xj )). Repeated
sequences appear in the SDM as off-diagonal stripes. Methods for detecting these
will be discussed below.
An obvious difficulty in calculating the SDM is that when the length T of the
feature sequence is large, the number of distance computations T 2 may become
computationally prohibitive. A typical solution to overcome this is to use beat-
synchronized features: a beat tracking algorithm is applied and the features xt
are then calculated within (or averaged over) each inter-beat interval. Since the
average inter-beat interval is approximately 0.5 seconds – much larger than a
typical analysis frame size – this greatly reduces the number of elements in the
time sequence and in the SDM. An added benefit of using beat-synchronous
features is that this compensates for tempo fluctuations within the piece under
analysis. As a result, repeated sequences appear in the SDM as stripes that run
194 A. Klapuri

30

Time (s)

20

10

0
0 10 20 30
Time (s)
Fig. 2. A self-distance matrix for Chopin’s Etude Op 25 No 9, calculated using beat-
synchronous chroma features. As the off-diagonal dark stripes indicate, the note se-
quence between 1s and 5s starts again at 5s, and later at 28s and 32s in a varied
form.

exactly parallel to the main diagonal. Figure 2 shows an example SDM calculated
using beat-synchronous chroma features.
Self-distance matrices have been widely used for audio-based analysis of the
sectional form (structure) of music pieces [12,57]. In that domain, several dif-
ferent methods have been proposed for localizing the off-diagonal stripes that
indicate repeating sequences in the music [59,27,55]. Goto, for example, first
calculates a marginal histogram which indicates the diagonal bands that con-
tain considerable repetition, and then finds the beginning and end points of the
repeted segments at a second step [27]. Serra has proposed an interesting method
for detecting locally similar sections in two feature sequences [70].

3.3 Lempel-Ziv-Welch Family of Algorithms

Repeated patterns are heavily utilized in universal lossless data compression al-
gorithms. The Lempel-Ziv-Welch (LZW) algorithm, in particular, is based on
matching and replacing repeated patterns with code values [80]. Let us denote a
sequence of discrete symbols by s1 , s2 , . . . , sT . The algorithm initializes a dictio-
nary which contains codes for individual symbols that are possible at the input.
At the compression stage, the input symbols are gathered into a sequence until
the next character would make a sequence for which there is no code yet in the
dictionary, and a new code for that sequence is then added to the dictionary.
The usefulness of the LZW algorithm for musical pattern matching is limited
by the fact that it requires a sequence of discrete symbols as input, as opposed to
real-valued feature vectors. This means that a given feature vectore sequence has
Pattern Induction and Matching in Music Signals 195

to be vector-quantized before processing with the LZW. In practice, also beat-


synchronous feature extraction is needed to ensure that the lengths of repeated
sequences are not affected by tempo fluctuation. Vector quantization (VQ, [25])
as such is not a problem, but choosing a suitable level of granularity becomes
very difficult: if the number of symbols is too large, then two repeats of a certain
pattern are quantized dissimilarly, and if the number of symbols is too small, too
much information is lost in the quantization and spurious repeats are detected.
Another inherent limitation of the LZW family of algorithms is that they re-
quire exact repetition. This is usually not appropriate in music, where variation
is more a rule than an exception. Moreover, the beginning and end times of the
learned patterns are arbitrarily determined by the order in which the input se-
quence is analyzed. Improvements over the LZW family of algorithms for musical
pattern induction have been considered e.g. by Lartillot et al. [40].

3.4 Markov Models for Sequence Prediction


Pattern induction is often used for the purpose of predicting a data sequence. N-
gram models are a popular choice for predicting a sequence of discrete symbols
s1 , s2 , . . . , sT [35]. In an N-gram, the preceding N − 1 symbols are used to deter-
mine the probabilities for different symbols to appear next, P (st |st−1 , . . . , st−N +1 ).
Increasing N gives more accurate predictions, but requires a very large amount of
training data to estimate the probabilities reliably. A better solution is to use a
variable-order Markov model (VMM) for which the context length N varies in
response to the available statistics in the training data [6]. This is a very desir-
able feature, and for note sequences, this means that both short and long note
sequences can be modeled within a single model, based on their occurrences in
the training data. Probabilistic predictions can be made even when patterns do
not repeat exactly.
Ryynänen and Klapuri used VMMs as a predictive model in a method that
transcribes bass lines in polyphonic music [66]. They used the VMM toolbox of
Begleiter et al. for VMM training and prediction [6].

3.5 Interaction between Pattern Induction and Source Separation


Music often introduces a certain pattern to the listener in a simpler form before
adding further “layers” of instrumentation at subsequent repetitions (and vari-
ations) of the pattern. Provided that the repetitions are detected via pattern
induction, this information can be fed back in order to improve the separation
and analysis of certain instruments or parts in the mixture signal. This idea was
used by Mauch et al. who used information about music structure to improve
recognition of chords in music [47].

4 Pattern Matching
This section considers the problem of searching a database of music for segments
that are similar to a given pattern. The query pattern is denoted by a feature
196 A. Klapuri

Time (query)

Query Target 1 Target 2


Time (target)

Fig. 3. A matrix of distances used by DTW to find a time-alignment between different


feature sequences. The vertical axis represents the time in a query excerpt (Queen’s
Bohemian Rhapsody). The horizontal axis corresponds to the concatenation of features
from three different excerps: 1) the query itself, 2) “Target 1” (Bohemian Rhapsody
performed by London Symphonium Orchestra) and Target 2 (It’s a Kind of Magic by
Queen). Beginnings of the three targets are indicated below the matrix. Darker values
indicate smaller distance.

seqence y1 , y2 , . . . , yM , and for convenience, x1 , x2 , . . . , xT is used to denote a


concatenation of the feature sequences extracted from all target music pieces.
Before discussing the similarity metrics between two music patterns, let us
consider the general computational challenges in comparing a query pattern
against a large database, an issue that is common to all types of musical patterns.

4.1 Temporal Alignment Problem in Pattern Comparison

Pattern matching in music is computationally demanding, because the query


pattern can in principle occur at any position of the target data and because
the time-scale of the query pattern may differ from the potential matches in
the target data due to tempo differences. These two issues are here referred to
as the time-shift and time-scale problem, respectively. Brute-force matching of
the query pattern at all possible locations of the target data and using different
time-scaled versions of the query pattern would be computationally infeasible
for any database of significant size.
Dynamic time warping (DTW) is a technique that aims at solving both the
time-shift and time-scale problem simultaneously. In DTW, a matrix of distances
is computed so that element (i, j) of the matrix represents the pair-wise distance
between element i of the query pattern and element j in the target data (see
Fig. 3 for an example). Dynamic programming is then applied to find a path
of small distances from the first to the last row of the matrix, placing suitable
constraints on the geometry of the path. DTW has been used for melodic pattern
matching by Dannenberg [13], for structure analysis by Paulus [55], and for cover
song detection by Serra [70], to mention a few examples.
Beat-synchronous feature extraction is an efficient mechanism for dealing with
the time-scale problem, as already discussed in Sec. 3. To allow some further
Pattern Induction and Matching in Music Signals 197

flexibility in pattern scaling and to mitigate the effect of tempo estimation errors,
it is sometimes useful to further time-scale the beat-synchronized query pattern
by factors 12 , 1, and 2, and match each of these separately.
A remaining problem to be solved is the temporal shift: if the target database
is very large, comparing the query pattern at every possible temporal position
in the database can be infeasible. Shift-invariant features are one way of dealing
with this problem: they can be used for approximate pattern matching to prune
the target data, after which the temporal alignment is computed for the best-
matching candidates. This allows the first stage of matching to be performed an
order of magnitude faster.
Another potential solution for the time-shift problem is to segment the target
database by meter analysis or grouping analysis, and then match the query
pattern only at temporal positions determined by estimated bar lines or group
boundaries. This approach was already discussed in Sec. 3.
Finally, efficient indexing techniques exist for dealing with extremely large
databases. In practice, these require that the time-scale problem is eliminated
(e.g. using beat-synchronous features) and the number of time-shifts is greatly
reduced (e.g. using shift-invariant features or pre-segmentation). If these con-
ditions are satisfied, the locality sensitive hashing (LSH) for example, enables
sublinear search complexity for retrieving the approximate nearest neighbours
of the query pattern from a large database [14]. Ryynanen et al. used LSH for
melodic pattern matching in [64].

4.2 Melodic Pattern Matching

Melodic pattern matching is usually considered in the context of query-by-


humming (QBH), where a user’s singing or humming is used as a query to
retrieve music with a matching melodic fragment. Typically, the user’s singing is
first transcribed into a pitch trajectory or a note sequence before the matching
takes place. QBH has been studied for more than 15 years and remains an active
research topic [26,48].
Research on QBH originated in the context of the retrieval from MIDI or
score databases. Matching approaches include string matching techniques [42],
hidden Markov models [49,33], dynamic programming [32,79], and efficient re-
cursive alignment [81]. A number of QBH systems have been evaluated in Music
Information Retrieval Evaluation eXchange (MIREX) [16].
Methods for the QBH of audio data have been proposed only quite recently
[51,72,18,29,64]. Typically, the methods extract the main melodies from the tar-
get musical audio (see Sec. 2.3) before the matching takes place. However, it
should be noted that a given query melody can in principle be matched di-
rectly against polyphonic audio data in the time-frequency or time-pitch do-
main. Some on-line services incorporating QBH are already available, see e.g.
[www.midomi.com], [www.musicline.de], [www.musipedia.org].
Matching two melodic patterns requires a proper definition of similarity. The
trivial assumption that two patterns are similar if they have identical pitches
is usually not appropriate. There are three main reasons that cause the query
198 A. Klapuri

pattern and the target matches to differ: 1) low quality of the sung queries (espe-
cially in the case of musically untrained users), 2) errors in extracting the main
melodies automatically from music recordings, and 3) musical variation, such as
fragmentation (elaboration) or consolidation (reduction) of a given melody [43].
One approach that works quite robustly in the presence of all these factors is
to calculate Euclidean distance between temporally aligned log-pitch trajecto-
ries. Musical key normalization can be implemented simply by normalizing the
two pitch contours to zero mean. More extensive review of research on melodic
similarity can be found in [74].

4.3 Patterns in Polyphonic Pitch Data

Instead of using only the main melody for music retrieval, polyphonic pitch
data can be processed directly. Multipitch estimation algorithms (see [11,38] for
review) can be used to extract multiple pitch values in successive time frames, or
alternatively, a mapping from time-frequency to a time-pitch representation can
be employed [37]. Both of these approaches yield a representation in the time-
pitch plane, the difference being that multipitch estimation algorithms yield a
discrete set of pitch values, whereas mapping to a time-pitch plane yields a
more continuous representation. Matching a query pattern against a database
of music signals can be carried out by a two-dimensional correlation analysis in
the time-pitch plane.

4.4 Chord Sequences

Here we assume that chord information is represented as a discrete symbol


sequence s1 , s2 , . . . , sT , where st indicates the chord identity at time frame t.

A#m D#m G#m C#m F#m Bm

F# B E A D G

F#m Bm Em Am Dm Gm

A D G C F A#

Am Dm Gm Cm Fm A#m

F A# D# G# C#

Fm A#m D#m G#m C#m

Fig. 4. Major and minor triads arranged in a two dimensional chord space. Here the
Euclidean distance between each two points can be used to approximate the distance
between chords. The dotted lines indicate the four distance parameters that define this
particular space.
Pattern Induction and Matching in Music Signals 199

Measuring the distance between two chord sequences requires that the distance
between each pair of different chords is defined. Often this distance is approxi-
mated by arranging chords in a one- or two-dimensional space, and then using
the geometric distance between chords in this space as the distance measure [62],
see Fig. 4 for an example. In the one-dimensional case, the circle of fifths is often
used.
It is often useful to compare two chord sequences in a key-invariant manner.
This can be done by expressing chords in relation to tonic (that is, using chord
degrees instead of the “absolute” chords), or by comparing all the 12 possible
transformations and choosing the minimum distance.

4.5 Drum Patterns and Rhythms


Here we discuss pattern matching in drum tracks that are presented as acoustic
signals and are possibly extracted from polyphonic music using the methods
described in Sec. 2.2. Applications of this include for example query-by-tapping
[www.music-ir.org/mirex/] and music retrieval based on drum track similarity.
Percussive music devoid of both harmony and melody can contain consider-
able amount of musical form and structure, encoded into the timbre, loudness,
and timing relationships between the component sounds. Timbre and loud-
ness characteristics can be conveniently represented by MFCCs extracted in
successive time frames. Often, however, the absolute spectral shape and loud-
ness of the components sounds is not of interest, but instead, the timbre and
loudness of sounds relative to each other defines the perceived rhythm. Paulus
and Klapuri reduced the rhythmic information into a two-dimensional signal
describing the evolution of loudness and spectral centroid over time, in or-
der to compare rhythmic patterns performed using an arbitrary set of sounds
[56]. The features were mean- and variance-normalized to allow comparison
across different sound sets, and DTW was used to align the two patterns under
comparison.
Ellis and Arroyo projected drum patterns into a low-dimensional representa-
tion, where different rhythms could be represented as a linear sum of so-called
eigenrhythms [21]. They collected 100 drum patterns from popular music tracks
and estimated the bar line positions in these. Each pattern was normalized and
the resulting set of patterns was subjected to principal component analysis in
order to obtain a set of basis patterns (”eigenrhythms”) that were then com-
bined to approximate the original data. The low-dimensional representation of
the drum patterns was used as a space for classification and for measuring sim-
ilarity between rhythms.
Non-negative matrix factorization (NMF, see Sec. 2.2) is another technique for
obtaining a mid-level representation for drum patterns [58]. The resulting com-
ponent gain functions can be subjected to the eigenrhythm analysis described
above, or statistical measures can be calculated to characterize the spectra and
gain functions for rhythm comparison.
200 A. Klapuri

5 Conclusions
This paper has discussed the induction and matching of sequential patterns
in musical audio. Such patterns are neglected by the commonly used ”bag-of-
features” approach to music retrieval, where statistics over feature vectors are
calculated to collapse the time structure altogether. Processing sequentical struc-
tures poses computational challenges, but also enables musically interesting re-
trieval tasks beyond those possible with the bag-of-features approach. Some of
these applications, such as query-by-humming services, are already available for
consumers.

Acknowledgments. Thanks to Jouni Paulus for the Matlab code for comput-
ing self-distance matrices. Thanks to Christian Dittmar for the idea of using
repeated patterns to improve the accuracy of source separation and analysis.

References
1. Abesser, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classification us-
ing bass-related high-level features and playing styles. In: Intl. Society on Music
Information Retrieval Conference, Kobe, Japan (2009)
2. Badeau, R., Emiya, V., David, B.: Expectation-maximization algorithm for multi-
pitch estimation and separation of overlapping harmonic spectra. In: Proc. IEEE
ICASSP, Taipei, Taiwan, pp. 3073–3076 (2009)
3. Barbour, J.: Analytic listening: A case study of radio production. In: International
Conference on Auditory Display, Sydney, Australia (July 2004)
4. Barry, D., Lawlor, B., Coyle, E.: Sound source separation: Azimuth discrimination
and resynthesis. In: 7th International Conference on Digital Audio Effects, Naples,
Italy, pp. 240–244 (October 2004)
5. Bartsch, M.A., Wakefield, G.H.: To catch a chorus: Using chroma-based repre-
sentations for audio thumbnailing. In: IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz, USA, pp. 15–18 (2001)
6. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov
models. J. of Artificial Intelligence Research 22, 385–421 (2004)
7. Bertin-Mahieux, T., Weiss, R.J., Ellis, D.P.W.: Clustering beat-chroma patterns in
a large music database. In: Proc. of the Int. Society for Music Information Retrieval
Conference, Utrecht, Netherlands (2010)
8. Bever, T.G., Chiarello, R.J.: Cerebral dominance in musicians and nonmusicians.
The Journal of Neuropsychiatry and Clinical Neurosciences 21(1), 94–97 (2009)
9. Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc.
Am. 89(1), 425–434 (1991)
10. Burred, J., Röbel, A., Sikora, T.: Dynamic spectral envelope modeling for the
analysis of musical instrument sounds. IEEE Trans. Audio, Speech, and Language
Processing (2009)
11. de Cheveigné, A.: Multiple F0 estimation. In: Wang, D., Brown, G.J. (eds.) Compu-
tational Auditory Scene Analysis: Principles, Algorithms and Applications. Wiley–
IEEE Press (2006)
12. Dannenberg, R.B., Goto, M.: Music structure analysis from acoustic signals. In:
Havelock, D., Kuwano, S., Vorländer, M. (eds.) Handbook of Signal Processing in
Acoustics, pp. 305–331. Springer, Heidelberg (2009)
Pattern Induction and Matching in Music Signals 201

13. Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. Journal
of New Music Research 32(2), 153–163 (2003)
14. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing
scheme based on p-stable distributions. In: ACM Symposium on Computational
Geometry, pp. 253–262 (2004)
15. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, Balti-
more, MD, pp. 159–165 (2003)
16. Downie, J.S.: The music information retrieval evaluation exchange (2005–2007): A
window into music information retrieval research. Acoustical Science and Technol-
ogy 29(4), 247–255 (2008)
17. Dressler, K.: An auditory streaming approach on melody extraction. In: Intl. Conf.
on Music Information Retrieval, Victoria, Canada (2006); MIREX evaluation
18. Duda, A., Nürnberger, A., Stober, S.: Towards query by humming/singing on audio
databases. In: International Conference on Music Information Retrieval, Vienna,
Austria, pp. 331–334 (2007)
19. Durrieu, J.L., Ozerov, A., Févotte, C., Richard, G., David, B.: Main instrument
separation from stereophonic audio signals using a source/filter model. In: Proc.
EUSIPCO, Glasgow, Scotland (August 2009)
20. Durrieu, J.L., Richard, G., David, B., Fevotte, C.: Source/filter model for unsu-
pervised main melody extraction from polyphonic audio signals. IEEE Trans. on
Audio, Speech, and Language Processing 18(3), 564–575 (2010)
21. Ellis, D., Arroyo, J.: Eigenrhythms: Drum pattern basis sets for classification
and generation. In: International Conference on Music Information Retrieval,
Barcelona, Spain
22. Ellis, D.P.W., Poliner, G.: Classification-based melody transcription. Machine
Learning 65(2-3), 439–456 (2006)
23. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tenson factorisation
models for musical source separation. Computational Intelligence and Neuroscience
(2008)
24. Fujihara, H., Goto, M.: A music information retrieval system based on singing voice
timbre. In: Intl. Conf. on Music Information Retrieval, Vienna, Austria (2007)
25. Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer Aca-
demic Publishers, Dordrecht (1991)
26. Ghias, A., Logan, J., Chamberlin, D.: Query by humming: Musical information
retrieval in an audio database. In: ACM Multimedia Conference 1995. Cornell
University, San Fransisco (1995)
27. Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, Hong Kong,
China, vol. 5, pp. 437–440 (April 2003)
28. Goto, M.: A real-time music scene description system: Predominant-F0 estimation
for detecting melody and bass lines in real-world audio signals. Speech Communi-
cation 43(4), 311–329 (2004)
29. Guo, L., He, X., Zhang, Y., Lu, Y.: Content-based retrieval of polyphonic music
objects using pitch contour. In: IEEE International Conference on Audio, Speech
and Signal Processing, Las Vegas, USA, pp. 2205–2208 (2008)
30. Hainsworth, S.W., Macleod, M.D.: Automatic bass line transcription from poly-
phonic music. In: International Computer Music Conference, Havana, Cuba, pp.
431–434 (2001)
202 A. Klapuri

31. Helén, M., Virtanen, T.: Separation of drums from polyphonic music using non-
negtive matrix factorization and support vector machine. In: European Signal Pro-
cessing Conference, Antalya, Turkey (2005)
32. Jang, J.S.R., Gao, M.Y.: A query-by-singing system based on dynamic program-
ming. In: International Workshop on Intelligent Systems Resolutions (2000)
33. Jang, J.S.R., Hsu, C.L., Lee, H.R.: Continuous HMM and its enhancement for
singing/humming query retrieval. In: 6th International Conference on Music Infor-
mation Retrieval, London, UK (2005)
34. Jensen, K.: Multiple scale music segmentation using rhythm, timbre, and harmony.
EURASIP Journal on Advances in Signal Processing (2007)
35. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New
Jersey (2000)
36. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Instrogram: Prob-
abilistic representation of instrument existence for polyphonic music. IPSJ Jour-
nal 48(1), 214–226 (2007)
37. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Intl. Society on Music Information Retrieval Conference, Kobe, Japan (2009)
38. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer, New York (2006)
39. Klapuri, A., Eronen, A., Astola, J.: Analysis of the meter of acoustic musical sig-
nals. IEEE Trans. Speech and Audio Processing 14(1) (2006)
40. Lartillot, O., Dubnov, S., Assayag, G., Bejerano, G.: Automatic modeling of mu-
sical style. In: International Computer Music Conference (2001)
41. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401, 788–791 (1999)
42. Lemström, K.: String Matching Techniques for Music Retrieval. Ph.D. thesis, Uni-
versity of Helsinki (2000)
43. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
44. Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specific harmonic
atoms for mid-level music representation. IEEE Trans. Audio, Speech, and Lan-
guage Processing 16(1), 116–128 (2008)
45. Li, Y., Wang, D.L.: Separation of singing voice from music accompaniment for
monaural recordings. IEEE Trans. on Audio, Speech, and Language Process-
ing 15(4), 1475–1487 (2007)
46. Marolt, M.: Audio melody extraction based on timbral similarity of melodic frag-
ments. In: EUROCON (November 2005)
47. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: Proc. 10th Intl. Society for Music Information Retrieval
Conference, Kobe, Japan (2009)
48. McNab, R., Smith, L., Witten, I., Henderson, C., Cunningham, S.: Towards the
digital music library: Tune retrieval from acoustic input. In: First ACM Interna-
tional Conference on Digital Libraries, pp. 11–18 (1996)
49. Meek, C., Birmingham, W.: Applications of binary classification and adaptive
boosting to the query-by-humming problem. In: Intl. Conf. on Music Information
Retrieval, Paris, France (2002)
50. Müller, M., Ewert, S., Kreuzer, S.: Making chroma features more robust to timbre
changes. In: Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Taipei, Taiwan, pp. 1869–1872 (April 2009)
Pattern Induction and Matching in Music Signals 203

51. Nishimura, T., Hashiguchi, H., Takita, J., Zhang, J.X., Goto, M., Oka, R.: Music
signal spotting retrieval by a humming query using start frame feature dependent
continuous dynamic programming. In: 2nd Annual International Symposium on
Music Information Retrieval, Bloomington, Indiana, USA, pp. 211–218 (October
2001)
52. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of
a monaural audio signal into harmonic/percussive components by complementary
diffucion on spectrogram. In: European Signal Processing Conference, Lausanne,
Switzerland, pp. 240–244 (August 2008)
53. Ozerov, A., Philippe, P., Bimbot, F., Gribonval, R.: Adaptation of Bayesian models
for single-channel source separation and its application to voice/music separation
in popular songs. IEEE Trans. on Audio, Speech, and Language Processing 15(5),
1564–1578 (2007)
54. Paiva, R.P., Mendes, T., Cardoso, A.: On the detection of melody notes in poly-
phonic audio. In: 6th International Conference on Music Information Retrieval,
London, UK, pp. 175–182
55. Paulus, J.: Signal Processing Methods for Drum Transcription and Music Structure
Analysis. Ph.D. thesis, Tampere University of Technology (2009)
56. Paulus, J., Klapuri, A.: Measuring the similarity of rhythmic patterns. In: Intl.
Conf. on Music Information Retrieval, Paris, France (2002)
57. Paulus, J., Müller, M., Klapuri, A.: Audio-based music structure analysis. In: Proc.
of the Int. Society for Music Information Retrieval Conference, Utrecht, Nether-
lands (2010)
58. Paulus, J., Virtanen, T.: Drum transcription with non-negative spectrogram fac-
torisation. In: European Signal Processing Conference, Antalya, Turkey (Septem-
ber 2005)
59. Peeters, G.: Sequence representations of music structure using higher-order similar-
ity matrix and maximum-likelihood approach. In: Intl. Conf. on Music Information
Retrieval, Vienna, Austria, pp. 35–40 (2007)
60. Peeters, G.: A large set of audio features for sound description (similarity and
classification) in the CUIDADO project. Tech. rep., IRCAM, Paris, France (April
2004)
61. Poliner, G., Ellis, D., Ehmann, A., Gómez, E., Streich, S., Ong, B.: Melody tran-
scription from music audio: Approaches and evaluation. IEEE Trans. on Audio,
Speech, and Language Processing 15(4), 1247–1256 (2007)
62. Purwins, H.: Profiles of Pitch Classes – Circularity of Relative Pitch and Key:
Experiments, Models, Music Analysis, and Perspectives. Ph.D. thesis, Berlin Uni-
versity of Technology (2005)
63. Rowe, R.: Machine musicianship. MIT Press, Cambridge (2001)
64. Ryynänen, M., Klapuri, A.: Query by humming of MIDI and audio using locality
sensitive hashing. In: IEEE International Conference on Audio, Speech and Signal
Processing, Las Vegas, USA, pp. 2249–2252
65. Ryynänen, M., Klapuri, A.: Transcription of the singing melody in polyphonic
music. In: Intl. Conf. on Music Information Retrieval, Victoria, Canada, pp. 222–
227 (2006)
66. Ryynänen, M., Klapuri, A.: Automatic bass line transcription from streaming poly-
phonic audio. In: IEEE International Conference on Audio, Speech and Signal
Processing, pp. 1437–1440 (2007)
67. Ryynänen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32(3), 72–86 (2008)
204 A. Klapuri

68. Schörkhuber, C., Klapuri, A.: Constant-Q transform toolbox for music processing.
In: 7th Sound and Music Computing Conference, Barcelona, Spain (2010)
69. Selfridge-Field, E.: Conceptual and representational issues in melodic comparison.
Computing in Musicology 11, 3–64 (1998)
70. Serra, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local
alignment applied to cover song identification. IEEE Trans. on Audio, Speech, and
Language Processing 16, 1138–1152 (2007)
71. Serra, X.: Musical sound modeling with sinusoids plus noise. In: Roads, C., Pope,
S., Picialli, A., Poli, G.D. (eds.) Musical Signal Processing, Swets & Zeitlinger
(1997)
72. Song, J., Bae, S.Y., Yoon, K.: Mid-level music melody representation of polyphonic
audio for query-by-humming system. In: Intl. Conf. on Music Information Retrieval,
Paris, France, pp. 133–139 (October 2002)
73. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis
– a unified approach to speech spectral estimation. In: IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Adelaide, Australia (1994)
74. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Universiteit
Utrecht (2007)
75. Vincent, E., Bertin, N., Badeau, R.: Harmonic and inharmonic nonnegative matrix
factorization for polyphonic pitch transcription. In: IEEE ICASSP, Las Vegas, USA
(2008)
76. Virtanen, T.: Unsupervised learning methods for source separation in monaural
music signals. In: Klapuri, A., Davy, M. (eds.) Signal Processing Methods for Music
Transcription, pp. 267–296. Springer, Heidelberg (2006)
77. Virtanen, T.: Monaural sound source separation by non-negative matrix factoriza-
tion with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech,
and Language Processing 15(3), 1066–1074 (2007)
78. Virtanen, T., Mesaros, A., Ryynänen, M.: Combining pitch-based inference and
non-negative spectrogram factorization in separating vocals from polyphonic music.
In: ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition,
Brisbane, Australia (September 2008)
79. Wang, L., Huang, S., Hu, S., Liang, J., Xu, B.: An effective and efficient method
for query by humming system based on multi-similarity measurement fusion. In:
International Conference on Audio, Language and Image Processing, pp. 471–475
(July 2008)
80. Welch, T.A.: A technique for high-performance data compression. Computer 17(6),
8–19 (1984)
81. Wu, X., Li, M., Yang, J., Yan, Y.: A top-down approach to melody match in pitch
countour for query by humming. In: International Conference of Chinese Spoken
Language Processing (2006)
82. Yeh, C.: Multiple fundamental frequency estimation of polyphonic recordings.
Ph.D. thesis, University of Paris VI (2008)
83. Yilmaz, O., Richard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Trans. on Signal Processing 52(7), 1830–1847 (2004)
Unsupervised Analysis and Generation of Audio
Percussion Sequences

Marco Marchini and Hendrik Purwins

Music Technology Group,


Department of Information and Communications Technologies,
Universitat Pompeu Fabra
Roc Boronat, 138, 08018 Barcelona, Spain
Tel.: +34-93 542-1365; Fax: +34-93 542-2455
{marco.marchini,hendrik.purwins}@upf.edu

Abstract. A system is presented that learns the structure of an au-


dio recording of a rhythmical percussion fragment in an unsupervised
manner and that synthesizes musical variations from it. The procedure
consists of 1) segmentation, 2) symbolization (feature extraction, cluster-
ing, sequence structure analysis, temporal alignment), and 3) synthesis.
The symbolization step yields a sequence of event classes. Simultane-
ously, representations are maintained that cluster the events into few or
many classes. Based on the most regular clustering level, a tempo estima-
tion procedure is used to preserve the metrical structure in the generated
sequence. Employing variable length Markov chains, the final synthesis
is performed, recombining the audio material derived from the sample
itself. Representations with different numbers of classes are used to trade
off statistical significance (short context sequence, low clustering refine-
ment) versus specificity (long context, high clustering refinement) of the
generated sequence. For a broad variety of musical styles the musical
characteristics of the original are preserved. At the same time, consider-
able variability is introduced in the generated sequence.

Keywords: music analysis, music generation, unsupervised clustering,


Markov chains, machine listening.

1 Introduction

In the eighteenth century, composers such as C. P. E. Bach and W. A. Mozart


used the Musikalisches Würfelspiel as a game to create music. They composed
several bars of music that could be randomly recombined in various ways, cre-
ating a new “composition” [3]. In the 1950s, Hiller and Isaacson’s automatically
composed the Illiac Suite and Xenakis’ used Markov chains and stochastic pro-
cesses in his compositions. Probably one of the most extensive work in style
imitation is the one by David Cope [3]. He let the computer compose compo-
sitions in the style of Beethoven, Prokofiev, Chopin, and Rachmaninoff. Pachet
[13] developed the Continuator, a MIDI-based system for real-time interaction

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 205–218, 2011.

c Springer-Verlag Berlin Heidelberg 2011
206 M. Marchini and H. Purwins

with musicians, producing jazz-style music. Another system with the same char-
acteristics as the Continuator, called OMax, was able to learn an audio stream
employing an indexing procedure explained in [5]. Hazan et al. [8] built a system
which first segments the musical stream and extracts timbre and onsets. An un-
supervised clustering process yields a sequence of symbols that is then processed
by n-grams. The method by Marxer and Purwins [12] consists of a conceptual
clustering algorithm coupled with a hierarchical N-gram. Our method presented
in this article was first described in detail in [11].
First, we define the system design and the interaction of its parts. Starting
from low-level descriptors, we translate them into a “fuzzy score representation”,
where two sounds can either be discretized yielding the same symbol or yielding
different symbols according to which level of interpretation is chosen (Section 2).
Then we perform skeleton subsequence extraction and tempo detection to align
the score to a grid. At the end, we get a homogeneous sequence in time, on which
we perform the prediction. For the generation of new sequences, we reorder the
parts of the score, respecting the statistical properties of the sequence while at
the same time maintaining the metrical structure (Section 3). In Section 4, we
discuss an example.

2 Unsupervised Sound Analysis


As represented in Figure 1, the system basically detects musical blocks in the
audio and re-shuffles them according to meter and statistical properties of the
sequence. We will now describe each step of the process in detail.

Audio Segments

Segmentation Generation of
Audio Sequences
Aligned
Continuation
Symbolization Multilevel Statistic Model
indices
Representation

Fig. 1. General architecture of the system

2.1 Segmentation

First, the audio input signal is analyzed by an onset detector that segments
the audio file into a sequence of musical events. Each event is characterized by
its position in time (onset) and an audio segment, the audio signal starting at
the onset position and ending at the following contiguous onset. In the further
processing, these events will serve two purposes. On one side, the events are
stored as an indexed sequence of audio fragments which will be used for the re-
synthesis in the end. On the other side, these events will be compared with each
other to generate a reduced score-like representation of the percussion patterns
to base a tempo analysis on (cf. Fig. 1 and Sec. 2.2).
Analysis and Generation of Percussion Sequences 207

We used the onset detector implemented in the MIR toolbox [9] that is based
only on the energy envelope, which proves to be sufficient for our purpose of
analyzing percussion sounds.

2.2 Symbolization

We will employ segmentation and clustering in order to transform the audio


signal into a discrete sequence of symbols (as shown in Fig. 3), thereby facilitating
statistical analysis. However, some considerations should be made.
As we are not restricting the problem to a monophonic percussion sequence,
non-trivial problems arise when one wants to translate a sequence of events into
a meaningful symbolic sequence. One would like to decide whether or not two
sounds have been played by the same percussion instrument (e.g. snare, bass
drum, open hi hat. . . ) and, more specifically, if two segments contain the same
sound in case of polyphony. With a similarity distance we can derive a value
representing the similarity between two sounds but when two sounds are played
simultaneously a different sound may be created. Thus, a sequence could exist
that allows for multiple interpretations since the system is not able to determine
whether a segment contains one or more sounds played synchronously. A way
to avoid this problem directly and to still get a useful representation is to use
a fuzzy representation of the sequence. If we listen to each segment very de-
tailedly, every segment may sound different. If we listen very coarsely, they may
all sound the same. Only listening with an intermediate level of refinement yields
a reasonable differentiation in which we recognize the reoccurrence of particular
percussive instruments and on which we can perceive meaningful musical struc-
ture. Therefore, we propose to maintain different levels of clustering refinement
simultaneously and then select the level on which we encounter the most regular
non-trivial patterns. In the sequel, we will pursue an implementation of this idea
and describe the process in more detail.

Feature Extraction. We have chosen to define the salient part of the event as
the first 200 ms after the onset position. This duration value is a compromise
between capturing enough information about the attack for representing the
sound reliably and still avoiding irrelevant parts at the end of the segment which
may be due to pauses or interfering other instruments. In the case that the
segment is shorter than 200 ms, we use the entire segment for the extraction
of the feature vector. Across the salient part of the event we calculate the Mel
Frequency Cepstral Coefficient (MFCC) vector frame-by-frame. Over all MFCCs
of the salient event part, we take the weighted mean, weighted by the RMS
energy of each frame. The frame rate is 100 frame for second, the FFT size is
512 samples and the window size 256.

Sound Clustering. At this processing stage, each event is characterized by a


13-dimensional vector (and the onset time). Events can thus be seen as points in
a 13-dimensional space in which a topology is induced by the Euclidean distance.
208 M. Marchini and H. Purwins

We used the single linkage algorithm to discover event clusters in this space
(cf. [6] for details). This algorithm recursively performs clustering in a bottom-
up manner. Points are grouped into clusters. Then clusters are merged with
additional points and clusters are merged with clusters into super clusters. The
distance between two clusters is defined as the shortest distance between two
points, each being in a different cluster, yielding a binary tree representation of
the point similarities (cf. Fig. 2). The leaf nodes correspond to single events. Each
node of the tree occurs at a certain height, representing the distance between
the two child nodes. Figure 2 (top) shows an example of a clustering tree of the
onset events of a sound sequence.

3.5

3
Cluster Distance

2.5

2 Threshold

1.5

0.5

2 4 8 6 1 5 3 7

0 1 2 3
Time (s)

Fig. 2. A tree representation of the similarity relationship between events (top) of


an audio percussion sequence (bottom). The threshold value chosen here leads to a
particular cluster configuration. Each cluster with more than one instance is indicated
by a colored subtree. The events in the audio sequence are marked in the colors of the
clusters they belong to. The height of each node is the distance (according to the single
linkage criterion) between its two child nodes. Each of the leaf nodes on the bottom of
the graph corresponds to an event.

The height threshold controls the (number of) clusters. Clusters are generated
with inter-cluster distances higher than the height threshold. Two thresholds
lead to the same cluster configuration if and only if their values are both within
the range delimited by the previous lower node and the next upper node in
the tree. It is therefore evident that by changing the height threshold, we can
get as many different cluster configurations as the number of events we have
in the sequence. Each cluster configuration leads to a different symbol alphabet
Analysis and Generation of Percussion Sequences 209

size and therefore to a different symbol sequence representing the original audio
file. We will refer to those sequences as representation levels or simply levels.
These levels are implicitly ordered. On the leaf level at the bottom of the tree
we find the lowest inter-cluster distances, corresponding to a sequence with each
event being encoded by a unique symbol due to weak quantization. On the root
level on top of the tree we find the cluster configuration with the highest inter-
cluster distances, corresponding to a sequence with all events denoted by the
same symbol due to strong quantization. Given a particular level, we will refer
to the events denoted by the same symbol as the instances of that symbol. We do
not consider the implicit inheritance relationships between symbols of different
levels.

Fig. 3. A continuous audio signal (top) is discretized via clustering yielding a sequence
of symbols (bottom). The colors inside the colored triangles denote the cluster of the
event, related to the type of sound, i.e. bass drum, hi-hat, or snare.

2.3 Level Selection


Handling different representations of the same audio file in parallel enables the
system to make predictions based on fine or coarse context structure, depending
on the situation. As explained in the previous section, if the sequence contains
n events the number of total possible distinct levels is n. As the number of
events increases, it is particularly costly to use all this levels together because
the number of levels also increases linearly with the number of onsets. Moreover,
as it will be clearer later, this representation will lead to over-fitted predictions
of new events.
This observation leads to the necessity to only select a few levels that can be
considered representative of the sequence in terms of structural regularity.
Given a particular level, let us consider a symbol σ having at least four in-
stances but not more than 60% of the total number of events and let us call
such a symbol an appropriate symbol. The instances of σ define a subsequence
of all the events that is supposedly made of more or less similar sounds accord-
ing to the degree of refinement of the level. Let us just consider the sequence of
210 M. Marchini and H. Purwins

onsets given by this subsequence. This sequence can be seen as a set of points
on a time line. We are interested to quantify the degree of temporal regularity of
those onsets. Firstly, we compute the histogram1 of the time differences (CIOIH)
between all possible combinations of two onsets (middle Fig. 4). What we obtain
is a sort of harmonic series of peaks that are more or less prominent according
to the self-similarity of the sequence on different scales. Secondly, we compute
the autocorrelation ac(t) (where t is the time in seconds) of the CIOIH which, in
case of a regular sequence, has peaks at multiples of its tempo. Let tusp be the
positive time value corresponding to its upper side peak. Given the sequence of
m onsets x = (x1 , . . . , xm ) we define the regularity of the sequence of onsets x
to be:
ac(tusp )
Regularity(x) = 1  tusp log(m)
tusp 0 ac(t)dt
This definition was motivated by the observation that the higher this value the
more equally the onsets are spaced in time. The logarithm of the number of
onsets was multiplied by the ratio to give more importance to symbols with
more instances.

Onset Sequence of one Symbol


0 2 4 6 8 10 12 14 16

Histogram of Complete IOI


10
Interval Instances
Number of

0
0 2 4 6 8 10 12 14 16
Cross Correlation of Histogram

1K Energy
Self Correlation

Upper Side
Histogram

Peak

−1 0 0.5 t usp 2 3 4 5
Time Interval (s)

Fig. 4. The procedure applied for computing the regularity value of an onset sequence
(top) is outlined. Middle: the histogram of the complete IOI between onsets. Bottom:
the autocorrelation of the histogram is shown for a subrange of IOI with relevant peaks
marked.

Then we extended, for each level, the regularity concept to an overall regularity
of the level. This simply corresponds to the mean of the regularities for all the
appropriate symbols of the level. The regularity of the level is defined to be zero
in case there is no appropriate symbol.
1
We used a discretization of 100 ms for the onset bars.
Analysis and Generation of Percussion Sequences 211

After the regularity value has been computed for each level, we yield the level
where the maximum regularity is reached. The resulting level will be referred so
as the regular level.
We also decided to keep the levels where we have a local maximum because
they generally refer to the levels where a partially regular interpretation of the
sequence is achieved. In the case where consecutive levels of a sequence share
the same regularity only the one is kept that is derived from a higher cluster
distance threshold. Figure 5 shows the regularity of the sequence for different
levels.

3.8

3.6
Regularity of the Sequence

3.4

3.2

2.8

2.6

2.4
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
Cluster Distance Threshold

Fig. 5. Sequence regularity for a range of cluster distance thresholds (x-axis). An ENST
audio excerpt was used for the analysis. The regularity reaches its global maximum
value in a central position. Towards the right, regularity increases and then remains
constant. The selected peaks are marked with red crosses implying a list of cluster
distance threshold values.

2.4 Beat Alignment


In order to predict future events without breaking the metrical structure we use
a tempo detection method and introduce a way to align onsets to a metrical
grid, accounting for the position of the sequence in the metrical context. For
our purpose of learning and regenerating the structure statistically, we do not
require a perfect beat detection. Even if we detect a beat that is twice, or half
as fast as the perceived beat, or that mistakes an on-beat for an off-beat, our
system could still tolerate this for the analysis of a music fragment, as long as
the inter beat interval and the beat phase is always misestimated in the same
way.
Our starting point is the regular level that has been found with the procedure
explained in the previous subsection. On this level we select the appropriate sym-
bol with the highest regularity value. The subsequence that carries this symbol
212 M. Marchini and H. Purwins

will be referred to as the skeleton subsequence since it is like an anchor structure


to which we relate our metrical interpretation of the sequence.

Tempo Alignment (Inter Beat Interval and Beat Phase). Once the skele-
ton subsequence is found, the inter beat interval is estimated with the procedure
explained in [4]. The tempo is detected considering the intervals between all
possible onset pairs of the sequence using a score voting criterion. This method
gives higher scores to the intervals that occur more often and that are related
by integer ratios to other occurring inter onset intervals.
Then the onsets of the skeleton subsequence are parsed in order to detect
a possible alignment of the grid to the sequence. We allow a tolerance of 6%
the duration of the inter beat interval for the alignment of an onset to the
grid position. We chose the interpretation that aligns the highest number of
instances to the grid. After discarding the onsets that are not aligned we obtain
the preliminary skeleton grid. In Fig. 6 the procedure is visually explicated.

2 1 1 1 1

1 2

Fig. 6. Above, a skeleton sequence is represented on a timeline. Below, three possible


alignments of the sequences are given by Dixon’s method [4]. All these alignments are
based on the same inter beat interval but on different beat phases. Each alignment
captures a particular onset subset (represented by a particular graphic marker) of the
skeleton sequence and discards the remaining onsets of the skeleton sequence. The
beat phase that allows to catch the highest number of onsets (the filled red crosses)
is selected and the remaining onsets are removed, thereby yielding the preliminary
skeleton grid.

Creation of Skeleton Grid. The preliminary skeleton grid is a sequence of


onsets spaced at multiples of a constant time interval, the inter beat interval.
But, as shown in Fig. 6, it can still have some gaps (due to missing onsets). The
missing onsets are, thus, detected and, in a first attempt, the system tries to align
the missing onsets with any onset of the entire event sequence (not just the one
symbol forming the preliminary skeleton grid). If there is any onset within a
tolerance range of ±6% of the inter beat interval of the expected beat position,
the expected onset will be aligned to this onset. If no onset within this tolerance
range is encountered, the system creats a (virtual) grid bar in the expected beat
position.
At the end of this completion procedure, we obtain a quasi-periodic skeleton
grid, a sequence of beats (events) sharing the same metrical position (the same
metrical phase).
Analysis and Generation of Percussion Sequences 213

Fig. 7. The event sequence derived from a segmentation by onset detection is indi-
cated by triangles. The vertical lines show the division of the sequence into blocks of
homogeneous tempo. The red solid lines represent the beat position (as obtained by
the skeleton subsequence). The other black lines (either dashed if aligned to a detected
onset or dotted if no close onset is found) represent the subdivisions of the measure
into four blocks.

Because of the tolerance used for building such a grid it could be noticed that
sometimes the effective measure duration could be slightly longer or slightly
shorter. This implements the idea that the grid should be elastic in the sense
that, up to a certain degree, it adapts to the (expressive) timing variations of
the actual sequence.
The skeleton grid catches a part of the complete list of onsets, but we would
like to built a grid where most of the onsets are aligned. Thereafter, starting from
the skeleton grid, the intermediate point between every two subsequent beats is
found and aligned with an onset (if it exists in a tolerance region otherwise a
place-holding onset is added). The procedure is recursively repeated until at least
80% of the onsets are aligned to a grid position or the number of created onsets
exceeds the number of total onsets. In Fig. 7, an example is presented along
with the resulting grid where the skeleton grid, its aligned, and the non-aligned
subdivisions are indicated by different line markers.
Note that, for the sake of simplicity, our approach assumes that the metrical
structure is binary. This causes the sequence to be eventually split erroneously.
However, we will see in a ternary tempo example that this is not a limiting factor
for the generation because the statistical representation somehow compensates
for it even if less variable generations are achieved. A more general approach
could be implemented with little modifications.
The final grid is made of blocks of time of almost equal duration that can
contain none, one, or more onset events. It is important that the sequence given
to the statistical model is almost homogeneous in time so that a certain number
of blocks corresponds to a defined time duration.
We used the following rules to assign a symbol to a block (cf. Fig 7):
– blocks starting on an aligned onset are denoted by the symbol of the aligned
onset,
– blocks starting on a non-aligned grid position are denoted by the symbol of
the previous block.
Finally, a metrical phase value is assigned to each block describing the number of
grid positions passed after the last beat position (corresponding to the metrical
214 M. Marchini and H. Purwins

position of the block). For each representation level the new representation of
the sequence will be the Cartesian product of the instrument symbol and the
phase.

3 Statistical Model Learning


Now we statistically analyze the structure of the symbol sequence obtained in
the last section. We employ variable length Markov chains (VLMC) for the sta-
tistical analysis of the sequences. In [2,15], a general method for inferencing long
sequences is described. For faster computation, we use a simplified implementa-
tion as described in [13]. We construct a suffix tree for each level based on the
sequence of that level. Each node of the tree represents a specific context that
had occurred in the past. In addition, each node carries a list of continuation
indices corresponding to block indices matching the context.
For audio, a different approach has been applied in [5]. This method does
not require an event-wise symbolic representation as it employs the factor oracle
algorithm. VLMC has not been applied to audio before, because of the absence
of an event-wise symbolic representation we presented above.

3.1 Generation Strategies


If we fix a particular level the continuation indices are drawn according to a
posterior probability distribution determined by the longest context found. But
which level should be chosen? Depending on the sequence, it could be better
to do predictions based either on a coarse or a fine level but it is not clear
which one should be preferred. First, we selected the lower level at which a
context of at least l̂ existed (for a predetermined fixed l̂, usually l̂ equal 3 or
4). This works quite good for many examples. But in some cases a context of
that length does not exist and the system often reaches the higher level where
too many symbols are provided inducing too random generations. On the other
side, it occurs very often that the lower level is made of singleton clusters that
have only one instance. In this case, a long context is found in the lower level
but since a particular symbol sequence only occurs once in the whole original
segment the system replicates the audio in the same order as the original. This
behavior often leads to the exact reproduction of the original until reaching its
end and then a jump at random to another block in the original sequence.
In order to increase recombination of blocks and still provide good contin-
uation we employ some heuristics taking into account multiple levels for the
prediction. We set p to be a recombination value between 0 and 1. We also need
to preprocess the block sequence to prevent arriving at the end of the sequence
without any musically meaningful continuation. For this purpose, before learn-
ing the sequence, we remove the last blocks until the remaining sequence ends
with a context of at least length two. We make use of the following heuristics to
generate the continuation in each step:
Analysis and Generation of Percussion Sequences 215

– Set a maximal context length ˆ l and compute the list of indices for each level
using the appropriate suffix tree. Store the achieved length of the context
for each level.
– Count the number of indices provided by each level. Select only the levels
that provide less than 75% the total number of blocks.
– Among these level candidates, select only the ones that have the longest
context.
– Merge all the continuation indices across the selected levels and remove the
trivial continuation (the next onset).
– In case there is no level providing such a context and the current block is
not the last, use the next block as a continuation.
– Otherwise, decide randomly with probability p whether to select the next
block or rather to generate the actual continuation by selecting randomly
between the merged indices.

4 Evaluation and Examples

We tested the system on two audio data bases. The first one is the ENST
database (see [7]) that provided a collection of around forty drum recording
examples. For a descriptive evaluation, we asked two professional percussion-
ists to judge several examples of generations as if they were performances of a
student. Moreover, we asked one of them to record beat boxing excerpts trying
to push the system to the limits of complexity and to critically assess the se-
quences that the system had generated from these recordings. The evaluations
of the generations created from the ENST examples revealed that the style of
the original had been maintained and that the generations had a high degree of
interestingness [10].
Some examples are available on the website [1] along with graphical anima-
tions visualizing the analysis process. In each video, we see the original sound
fragment and the generation derived from it. The horizontal axis corresponds
to the time in seconds and the vertical axis to the clustering quantization res-
olution. Each video shows an animated graphical representation in which each
block is represented by a triangle. At each moment, the context and the currently
played block is represented by enlarged and colored triangles.
In the first part of the video, the original sound is played and the animation
shows the extracted block representation. The currently played block is repre-
sented by an enlarged colored triangle and highlighted by a vertical dashed red
line. The other colored triangles highlight all blocks from the starting point of
the bar up to the current block. In the second part of the video, only the skele-
ton subsequence is played. The sequence on top is derived from applying the
largest clustering threshold (smallest number of clusters) and the one on the
bottom corresponds to the lowest clustering threshold (highest number of clus-
ters). In the final part of the video, the generation is shown. The colored triangles
216 M. Marchini and H. Purwins

represent the current block and the current context. The size of the colored
triangles decreases monotonically from the current block backwards displaying
the past time context window considered by the system. The colored triangles
appear only on the levels selected by the generation strategy.
In Figure 8, we see an example of successive states of the generation. The
levels used by the generator to compute the continuation and the context are
highlighted showing colored triangles that decrease in size from the largest, cor-
responding to the current block, to the smallest that is the furthest past context
block considered by the system. In Frame I, the generation starts with block
no 4, belonging to the event class indicated by light blue. In the beginning, no
previous context is considered for the generation. In Frame II, a successive block
no 11 of the green event class has been selected using all five levels α -  and a
context history of length 1 just consisting of block no 4 of the light blue event
class. Note that the context given by only one light blue block matches the con-
tinuation no 11, since the previous block (no 10) is also denoted by light blue at
all the five levels. In Frame III, the context is the bi-gram of the event classes
light blue (no 4) and green (no 11). Only level α is selected since at all other
levels the bi-gram that corresponds to the colors light blue and green appears
only once. However, at level α the system finds three matches (blocks no 6, 10
and 12) and randomly selects no 10. In Frame IV, the levels differ in the length
of the maximal past context. At level α one but only one match (no 11) is found
for the 3-gram light blue - green - light blue, and thus this level is discarded. At
levels β, γ and δ, no matches for 3-grams are found, but all these levels include
2 matches (block no 5 and 9) for the bi-gram (green - light blue). At level , no
match is found for a bi-gram either, but 3 occurrences of the light blue triangle
are found.

5 Discussion
Our system effectively generates sequences respecting the structure and the
tempo of the original sound fragment for medium to high complexity rhythmic
patterns.
A descriptive evaluation of a professional percussionist confirmed that the
metrical structure is correctly managed and that the statistical representation
generates musically meaningful sequences. He noticed explicitly that the drum
fills (short musical passages which help to sustain the listener’s attention during
a break between the phrases) were handled adequately by the system.
The critics by the percussionist were directed to the lack of dynamics, agogics
and musically meaningful long term phrasing which we did not address in our
approach.
Part of those feature could be achieved in the future by extending the system
to the analysis of non-binary meter. To achieve musically sensible dynamics and
agogics (rallentando, accelerando, rubato. . . ) of the generated musical continua-
tion for example by extrapolation [14] remains a challenge for future work.

  

  

  

  

  

        
             
             
    


  

  

  

  

  

        
             
             
    


  

  

  

  

  

        
             
             
    

Fig. 8. Nine successive frames of the generation. The red vertical dashed line marks the currently played event. In each frame, the largest
Analysis and Generation of Percussion Sequences

colored triangle denotes the last played event that influences the generation of the next event. The size of the triangles decreases going
back in time. Only for the selected levels the triangles are enlarged. We can see how the length of the context as well as the number of
selected levels dynamically change during the generation. Cf. Section 4 for a detailed discussion of this figure.
217
218 M. Marchini and H. Purwins

Acknowledgments

Many thanks to Panos Papiotis for his patience during lengthy recording sessions
and for providing us with beat boxing examples, the evaluation feedback, and
inspiring comments. Thanks a lot to Ricard Marxer for his helpful support. The
first author (MM) expresses his gratitude to Mirko Degli Esposti and Anna Rita
Addessi for their support and for motivating this work. The second author (HP)
was supported by a Juan de la Cierva scholarship of the Spanish Ministry of
Science and Innovation.

References
1. (December 2010), www.youtube.com/user/audiocontinuation
2. Buhlmann, P., Wyner, A.J.: Variable length markov chains. Annals of Statistics 27,
480–513 (1999)
3. Cope, D.: Virtual Music: Computer Synthesis of Musical Style. MIT Press, Cam-
bridge (2004)
4. Dixon, S.: Automatic extraction of tempo and beat from expressive performances.
Journal of New Music Research 30(1), 39–58 (2001)
5. Dubnov, S., Assayag, G., Cont, A.: Audio oracle: A new algorithm for fast learning
of audio structures. In: Proceedings of International Computer Music Conference
(ICMC), pp. 224–228 (2007)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. Wiley, Chichester
(2001)
7. Gillet, O., Richard, G.: Enst-drums: an extensive audio-visual database for drum
signals processing. In: ISMIR, pp. 156–159 (2006)
8. Hazan, A., Marxer, R., Brossier, P., Purwins, H., Herrera, P., Serra, X.: What/when
causal expectation modelling applied to audio signals. Connection Science 21, 119–
143 (2009)
9. Lartillot, O., Toiviainen, P., Eerola, T.: A matlab toolbox for music information
retrieval. In: Annual Conference of the German Classification Society (2007)
10. Marchini, M.: Unsupervised Generation of Percussion Sequences from a Sound
Example. Master’s thesis (2010)
11. Marchini, M., Purwins, H.: Unsupervised generation of percussion sound sequences
from a sound example. In: Sound and Music Computing Conference (2010)
12. Marxer, R., Purwins, H.: Unsupervised incremental learning and prediction of au-
dio signals. In: Proceedings of 20th International Symposium on Music Acoustics
(2010)
13. Pachet, F.: The continuator: Musical interaction with style. In: Proceedings of
ICMC, pp. 211–218. ICMA (2002)
14. Purwins, H., Holonowicz, P., Herrera, P.: Polynomial extrapolation for prediction of
surprise based on loudness - a preliminary study. In: Sound and Music Computing
Conference, Porto (2009)
15. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic
automata with variable memory length. Mach. Learn. 25(2-3), 117–149 (1996)
Identifying Attack Articulations
in Classical Guitar

Tan Hakan Özaslan, Enric Guaus, Eric Palacios, and Josep Lluis Arcos

IIIA, Artificial Intelligence Research Institute


CSIC, Spanish National Research Council
Campus UAB, 08193 Bellaterra, Spain
{tan,eguaus,epalacios,arcos}@iiia.csic.es

Abstract. The study of musical expressivity is an active field in sound


and music computing. The research interest comes from different mo-
tivations: to understand or model music expressivity; to identify the
expressive resources that characterize an instrument, musical genre, or
performer; or to build synthesis systems able to play expressively. Our re-
search is focused on the study of classical guitar and deals with modeling
the use of the expressive resources in the guitar. In this paper, we present
a system that combines several state of the art analysis algorithms to
identify guitar left hand articulations such as legatos and glissandos.
After describing the components of our system, we report some experi-
ments with recordings containing single articulations and short melodies
performed by a professional guitarist.

1 Introduction

Musical expressivity can be studied by analyzing the differences (deviations) be-


tween a musical score and its execution. These deviations are mainly motivated
by two purposes: to clarify the musical structure [26,10,23] and as a way to com-
municate affective content [16,19,11]. Moreover, these expressive deviations vary
depending on the musical genre, the instrument, and the performer. Specifically,
each performer has his/her own unique way to add expressivity by using the
instrument.
Our research on musical expressivity aims at developing a system able to
model the use of the expressive resources of a classical guitar. In guitar playing,
both hands are used: one hand is used to press the strings in the fretboard and
the other to pluck the strings. Strings can be plucked using a single plectrum
called a flatpick or by directly using the tips of the fingers. The hand that presses
the frets is mainly determining the notes while the hand that plucks the strings
is mainly determining the note onsets and timbral properties. However, left hand
is also involved in the creation of a note onset or different expressive articulations
like legatos, glissandos, and vibratos.
Some guitarists use the right hand to pluck the strings whereas others use the
left hand. For the sake of simplicity, in the rest of the document we consider the

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 219–241, 2011.

c Springer-Verlag Berlin Heidelberg 2011
220 T.H. Özaslan et al.

Fig. 1. Main diagram model of our system

hand that plucks the strings as the right hand and the hand that presses the
frets as the left hand.
As a first stage of our research, we are developing a tool able to automatically
identify, from a recording, the use of guitar articulations. According to Norton
[22], guitar articulations can be divided into three main groups related to the
place of the sound where they act: attack, sustain, and release articulations.
In this research we are focusing on the identification of attack articulations
such as legatos and glissandos. Specifically, we present an automatic detection
and classification system that uses as input audio recordings. We can divide
our system into two main modules (Figure 1): extraction and classification. The
extraction module determines the expressive articulation regions of a classical
guitar recording whereas the classification module analyzes these regions and
determines the king of articulation (legato or glissando).
In both, legato and glissando, left hand is involved in the creation of the note
onset. In the case of ascending legato, after plucking the string with the right
hand, one of the fingers of the l eft hand (not already used for pressing one of the
frets), presses a fret causing another note onset. Descending legato is performed
by plucking the string with a left-hand finger that was previously used to play
a note (i.e. pressing a fret).
The case of glissando is similar but this time after plucking one of the strings
with the right hand, the left hand finger that is pressing the string is slipped to
another fret also generating another note onset.
When playing legato or glissando on guitar, it is common for the performer to
play more notes within a beat than the stated timing enriching the music that
is played. A powerful legato and glissando can be differentiated between each
other easily by ear. However, in a musical phrase context where the legato and
glissando are not isolated, it is hard to differentiate among these two expressive
articulations.
The structure of the paper is as follows: Section 2 briefly describes the current
state of the art of guitar analysis studies. Section 3 describes our methodology
for articulation determination and classification. Section 4 focuses on the exper-
iments conducted to evaluate our approach. Last section, Section 5, summarizes
current results and presents the next research steps.
Identifying Attack Articulations in Classical Guitar 221

2 Related Work
Guitar is the one of the most popular instruments in western music. Thus, most
of the music genres include the guitar. Although plucked instruments and guitar
synthesis have been studied extensively (see [9,22]), the analysis of expressive
articulations from real guitar recordings has not been fully tackled. This analysis
is complex because guitar is an instrument with a rich repertoire of expressive
articulations and because, when playing guitar melodies, several strings may be
vibrating at the same time. As an additional statement, even the synthesis of a
single tone is a complex subject [9].
Expressive studies go back to the early twentieth century. In 1913, Johnstone
[15] analyzed piano performers. Johnstone’s analysis can be considered as one
of the first studies focusing on musical expressivity. Advances in audio process-
ing techniques risen the opportunity to analyze audio recordings in a finer level
(see [12] for an overview). Up to now, there are several studies focused on the
analysis of expressivity of different instruments. Although the instruments ana-
lyzed differ, most of them focus on analyzing monophonic or single instrument
recordings.
For instance, Mantaras et al [20] presented a survey of computer music sys-
tems based on Artificial Intelligence techniques. Examples of AI-based systems
are SaxEx [1] and TempoExpress [13]. Saxex is cased-based reasoning system
that generates expressive jazz saxophone melodies from recorded examples of
human performances. More recently, TempoExpress performs tempo transfor-
mations of audio recordings taking into account the expressive characteristics of
a performance and using a CBR approach.
Regarding guitar analysis, an interesting research is from Stanford Univer-
sity. Traube [28], estimated the plucking point on a guitar string by using a
frequency-domain technique applied to acoustically recorded signals. The pluck-
ing point of a guitar string affects the sound envelope and influences the timbral
characteristics of notes. For instance, plucking close to the guitar hole produces
more mellow and sustained sounds where plucking near to the bridge (end of the
guitar body) produces sharper and less sustained sounds. Traube also proposed
an original method to detect the fingering point, based on the plucking point
information.
In another interesting paper, Lee [17] proposes a new method for extraction of
the excitation point of an acoustic guitar signal. Before explaining the method,
three state of the art techniques are examined in order to compare with the new
one. The techniques analyzed are matrix pencil inverse-filtering, sinusoids plus
noise inverse-filtering, and magnitude spectrum smoothing. After describing and
comparing these three techniques, the author proposes a new method, statistical
spectral interpolation, for excitation signal extraction.
Although fingering studies are not directly related with expressivity, their re-
sults may contribute to clarify and/or constrain the use of left-hand expressive
articulations. Hank Heijink and Ruud G. J. Meulenbroek[14] performed a be-
havioral study about the complexity of the left hand fingering of classical guitar.
Different audio and camera recordings of six professional guitarists playing the
222 T.H. Özaslan et al.

same song were used to find optimal places and fingerings for the notes. Several
constraints were introduced to calculate cost functions such as; minimization of
jerk, torque change, muscle-tension change, work, energy and neuromotor vari-
ance. As a result of the study, they found a significant effect on timing.
On another interesting study, [25] investigates the the optimal fingering posi-
tion for a given set of notes. Their method, path difference learning, uses tabla-
tures and AI techniques to obtain fingering positions and transitions. Radicioni
et al [24] also worked on finding the proper fingering position and transitions.
Specifically, they calculated the weights of the finger transitions between finger
positions by using the weights of Heijing [14]. Burns and Wanderley [4] proposed
a method to visually detect and recognize fingering gestures of the left hand of
a guitarist by using affordable camera.
Unlike the general trend in the literature, Trajano [27] investigated the right
hand fingering. Although he analyzed the right-hand, his approach has similar-
ities with left-hand studies. In his article, Trajano uses his own definitions and
cost functions to calculate the optimal selection of right hand fingers.
The first step when analyzing guitar expressivity is to identify and characterize
the way notes are played, i.e. guitar articulations. The analysis of expressive
articulations has been previously performed with image analysis techniques. Last
but not least, one of the few studies that is focusing on guitar expressivity is
the PhD thesis of Norton [22]. In his dissertation Norton proposed the use of a
motion caption system based on PhaseSpace Inc, to analyze guitar articulations.

3 Methodology
Articulation refers to how the pieces of something are joined together. In music,
these pieces are the notes and the different ways of executing them are called
articulations. In this paper we propose a new system that is able to determine
and classify two expressive articulations from audio files. For this purpose we
have two main modules: the extraction module and the classification module (see
Figure 1). In the extraction module, we determine the sound segments where
expressive articulations are present. The purpose of this module is to classify
audio regions as expressive articulations or not. Next, the classification module,
analyzes the regions that were identified as candidates of expressive articulations
by the extraction module, and label them as legato or glissando.
3.1 Extraction
The goal of the extraction module is to find the places where a performer played
expressive articulations. To that purpose, we analyzed a recording using several
audio analysis algorithms, and combined the information obtained from them to
take a decision.
Our approach is based on first determining the note onsets caused when pluck-
ing the strings. Next, a more fine grained analysis is performed inside the regions
delimited by two plucking onsets to determine whether an articulation may be
present. A simple representation diagram of extraction module is shown in
Figure 2.
Identifying Attack Articulations in Classical Guitar 223

Fig. 2. Extraction module diagram

For the analysis we used Aubio [2]. Aubio is a library designed for the anno-
tation of audio signals. Aubio library includes four main applications: abioonset,
aubionotes, aubiocut, and aubiopitch. Each application gives us the chance of
trying different algorithms and also tuning several other parameters. In the cur-
rent prototype we are using aubioonset for our plucking detection sub-module
and aubionotes for our pitch detection sub-module.
At the end we combine the outputs from both sub-modules and decide whether
there is an expressive articulation or not. In the next two sections, the plucking
detection sub-module and the pitch detection sub-module is described. Finally,
we explain how we combine the information provided by these two sub-modules
to determine the existence of expressive articulations.
Plucking Detection. Our first task is to determine the onsets caused by the
plucking hand. As we stated before, guitar performers can apply different artic-
ulations by using both of their hands. However, the kind of articulations that
we are investigating (legatos and glissandos) are performed by the left hand.
Although they can cause onsets, these onsets are not as powerful in terms of
both energy and harmonicity [28]. Therefore, we need an onset determination
algorithm suitable to this specific characterictic.
The High Frequency Content measure is a measure taken across a signal
spectrum, and can be used to characterize the amount of high-frequency content
(HFC) in the signal. The magnitudes of the spectral bins are added together, but
multiplying each magnitude by the bin position [21]. As Brossier stated, HFC
is effective with percussive onsets but less successful determining non-percussive
and legato phrases [3]. As right hand onsets are more percussive than left hand
onsets, HFC was the strongest candidate of detection algorithm for right hand
onsets. HFC is sensitive to abrupt onsets but not too much sensitive for the
changes of fundamental frequency caused by the left hand. This is the main
reason why we chose HFC to measure the changes on the harmonic content of
the signal.
Aubioonset library gave us the opportunity to tune the peak-picking thresh-
old, which we tested with a set of hand labeled recordings, including both ar-
ticulated and non-articulated notes. We used 1.7 for peak picking threshold and
224 T.H. Özaslan et al.

Fig. 3. HFC onsets

Fig. 4. Features of the portion between two onsets

−95db for silence threshold. We used this set as our ground truth and tuned our
values according to this set.
An example of the resulting onsets proposed by HFC is shown in Figure 3.
Specifically, in the exemplified recording 5 plucking onsets are detected, onsets
caused by the plucking hand, which are shown with vertical lines. Between some
of two detected onsets expressive articulations are present. However, as shown
in the figure, HFC succeeds as it only determines the onsets caused by the right
hand.
Next, each portion between two plucking onsets is analyzed individually.
Specifically, we are interested in determining two points: the end of the attack
and the release start. From experimental measures, attack end position is con-
sidered 10ms after the amplitude reaches its local maximum. The release start
Identifying Attack Articulations in Classical Guitar 225

Fig. 5. Note Extraction without chroma feature

position is considered as the final point where local amplitude is equal or greater
than 3 percent of the local maximum. For example, in Figure 4, the first portion
of the Figure 3 is zoomed. The first and the last lines are the plucking onsets
identified by HFC algorithm. The first dashed line is the place where attack
finishes. The second dashed line is the place where release starts.

Pitch Detection. Our second task was to analyze the sound fragment between
two onsets. Since we know the onset values of plucking hand, what we require is
another peak detection algorithm with a lower threshold in order to capture the
changes in fundamental frequency. Specifically, if fundamental frequency is not
constant between two onsets, we consider that the possibility of the existence of
an expressive articulation is high.
In the pitch detection module, i.e to extract onsets and their corresponding
fundamental frequencies, we used aubionotes. In Aubio library, both onset de-
tection and fundamental frequency estimation algorithms can be chosen from a
bunch of alternatives. For onset detection, this time we need a more sensitive
algorithm than the one we used to detect the right hand onset detection. Thus,
we used complex domain algorithm [8] to determine the peaks and Yin [6] for
the fundamental frequency estimation. Complex domain onset detection is based
on a combination of phase approach and energy based approach.
We used 2048 bins as our window size, 512 bins as our hop size, 1 as our pick
peaking threshold and −95db as our silence threshold. With these parameters
we obtained an output like the shown in Figure 5. As shown in the figure, first
results were not as we expected. Specifically, they were noisier than expected.
There were noisy parts, especially at the beginning of the notes, which generated
226 T.H. Özaslan et al.

Fig. 6. Note Extraction with chroma feature

false-positive peaks. For instance in Figure 5, many false positive note onsets are
detected between the interval from 0 to 0.2 seconds.
A careful analysis of the results demonstrated that the false-positive peaks
were located in the region of the notes frequency borders. Therefore, we propose
a lightweight solution for the problem: to apply a chroma filtering to the regions
that are in the borders of Complex domain peaks. As shown in Figure 6, after
applying chroma conversion, the results are drastically improved.
Next, we analyzed the fragments between two onsets based on the segments
provided by the plucking detection module. Specifically, we analyzed the sound
fragment between attack ending point and release starting point (because the
noisiest part of a signal is the attack part and the release part of a signal contains
unnecessary information for pitch detection [7]). Therefore, for our analysis we
take the fragment between attack and release parts where pitch information is
relatively constant.
Figure 7 shows fundamental frequency values and right hand onsets. X-axis
represents the time domain bins and Y-axis represents the frequency. In Figure 7,
vertical lines depict the attack and release parts respectively. In the middle
there is a change in frequency, which was not determined as an onset by the
first module. Although it seems like an error, it is a success result for our model.
Specifically, in this phrase there is glissando, which is a left hand articulation, and
was not identified as an onset by plucking detection module (HFC algorithm),
but identified by the pitch detection module (Complex Domain algorithm). The
output of the pitch detection module for this recording is shown in Table 1.

Analysis and Annotation. After obtaining the results from plucking detec-
tion and pitch detection modules, the goal of the analysis and annotation module
is to determine the candidates of expressive articulations. Specifically, from the
results of the pitch detection module, we analyze the differences of fundamental
Identifying Attack Articulations in Classical Guitar 227

Fig. 7. Example of a glissando articulation

Table 1. Output of the pitch detection module

Note Start (ms.) Fundamental Frequency


0.02 130
0.19 130
0.37 130
0.46 146
0.66 146
0.76 146
099 146
1.10 146
1.41 174
1.48 116

frequencies in the segments between attack and release parts (provided by the
plucking detection module). For instance, in Table 1 the light gray values rep-
resent the attack and release parts, which we did not take into account while
applying our decision algorithm.
The differences of fundamental frequencies are calculated by subtracting to
each bin its preceding bin. Thus, when the fragment we are examining is a non-
articulated fragment, this operation returns 0 for all bins. On the other side,
in expressively articulated fragments some peaks will arise (see Figure 8 for an
example).
In Figure 8 there is only one peak, but in other recordings some consecutive
peaks may arise. The explanation is that the left hand also causes an onset, i.e.
it generates also a transient part. As a result of this transient, more than one
change in fundamental frequency may be present. If those changes or peaks are
close to each other we consider them as a single peak.
We define this closeness with a pre-determined consecutiveness threshold.
Specifically, if the maximum distance between these peaks is 5 bins, we
228 T.H. Özaslan et al.

Fig. 8. Difference vector of pitch frequency values of fundamental frequency array

consider them as an expressive articulation candidate peak. However, if the


peaks are separated each other more than the consecutiveness threshold, the
fragment is not considered an articulation candidate. Our consideration is that
it responds to a probable noisy part of the signal, a crackle in the recording, or
a digital conversion error.

3.2 Classification
The classification module analyzes the regions identified by the extraction mod-
ule and label them as legato or glissando. A diagram of the classification module
is shown in Figure 9. In this section, first, we describe our research to select the
appropriate descriptor to analyze the behavior of legato and glissando. Then, we
explain the new two components, Models Builder and Detection.
Selecting a Descriptor. After extracting the regions which contain candidates
of expressive articulations, the next step was to analyze them. Because different
expressive articulations (legato vs glissando) should present different character-
istics in terms of changes in amplitude, aperiodicity, or pitch [22], we focused
the analysis on comparing these deviations.
Specifically, we built representations of these three features (amplitude, ape-
riodicity, and pitch). Representations helped us to compare different data with
different length and density. As we stated above, we are mostly interested in
changes: changes in High Frequency Content, changes in fundamental frequency,
changes in amplitude, etc. Therefore, we explored the peaks in the examined
data because peaks are the points where changes occur.
As an example, Figure 10 shows, from top to bottom, amplitude evolution,
pitch evolution, and changes in aperiodicity for both legato and glissando. As
both Figures show, glissando and legato examples, the changes in pitch are simi-
lar. However, the changes in amplitude and aperiodicity present a characteristic
slope.
Thus, as a first step we concentrated on determining which descriptor could
be used. To make this decision, we built models for both aperiodicty and
Identifying Attack Articulations in Classical Guitar 229

Fig. 9. Classification module diagram

(a) Features of a legato example (b) features of a glissando example

Fig. 10. From top to bottom, representations of amplitude, pitch and aperiodicty of
the examined regions

amplitude by using a set of training data. As a result, we obtained two mod-


els (for amplitude and aperiodicity) for both legato and glissando as is shown
in Figure 11a and Figure 11b. Analyzing the results, amplitude is not a good
candidate because the models behave similarly. In contrast, aperiodicity models
present a different behavior. Therefore, we selected aperiodicity as the descrip-
tor. The details of this model construction will be explained in Building the
Models section.

Preprocessing. Before analyzing and testing our recordings, we applied two


different preprocessing techniques to the data in order to make them smoother
and ready for comparison: Smoothing and Envelope Approximation.

1. Smoothing. As expected, aperiodicity portion of the audio file that we


are examining includes noise. Our first concern was to avoid this noise and
obtain a nicer representation. In order to do that first we applied a 50 step
running median smoothing. Running median smoothing is also known as
median filtering. Median filtering is widely used in digital image processing
230 T.H. Özaslan et al.

(a) Amplitude models (b) Aperiodicity models

Fig. 11. Models for Legato and Glissando

(a) Aperiodicity (b) Smoothed Aperiodicity

Fig. 12. Features of aperiodicity

because under certain conditions, it preserves edges whilst removing noise.


In our situation since we are interested in the edges and in removing noise,
this approach fits our purposes. By smoothing, the peaks locations of the
aperiodicity curves become more easy to extract. In Figure 12, comparison
of aperiodicity and smoothed aperiodicity graphs exemplify the smoothing
process and show the results we pursued.
2. Envelope Approximation. After obtaining a smoother data, an envelope
approximation algorithm was applied. The core idea of the envelope approxi-
mation is to obtain a fixed length representation of the data, specially consid-
ering the peaks and also avoiding small deviations by connecting these peak
approximations linearly. The envelope approximation algorithm has three
parts: peak peaking, scaling of peak positions according to a fixed length, and
linearly connecting the peaks. After the envelope approximation, all data re-
gions we are investigating had the same length, i.e. regions were compressed
or enlarged depending on their initial size.
We collect all the peaks above a pre-determined threshold. Next, we scale
all these peak positions. For instance, imagine that our data includes 10000
bins and we want to scale this data to 1000. And lets say, our peak positions
are : 1460, 1465, 1470, 1500 and 1501. What our algorithm does is to scale
these peak locations dividing all peak locations by 10 (since we want to scale
Identifying Attack Articulations in Classical Guitar 231

Fig. 13. Envelope approximation of a legato portion

10000 to 1000) and round them. So they become 146, 146, 147, 150 and 150.
As shown, we have 2 peaks in 146 and 150. In order to fix this duplicity,
we choose the ones with the highest peak value. After collecting and scaling
peak positions, the peaks are linearly connected. As shown in Figure 13,
the obtained graph is an approximation of the graph shown in Figure 12b.
Linear approximation helps the system to avoid consecutive small tips and
dips.
In our case all the recordings were performed at 60bpm and all the notes in the
recordings are 8th notes. That is, each note is half a second, and each legato or
glissando portion is 1 second. We recorded with a sampling rate of 44100, and
we did our analysis by using a hop size of 32 bins, i.e. 44100/32 = 1378 bins.
We knew that this was our highest limit. For the sake of simplicity, we scaled
our x-axis to 1000 bins.

Building the Models. After applying the preprocessing techniques, we ob-


tained equal length aperiodicity representations of all our expressive articulation
portions. Next step was to construct models for both legato and glissando by us-
ing this data. In this section we describe how we constructed the models shown in
Figure 11a and Figure 11b. The following steps were used to construct the mod-
els: Histogram Calculation, Smoothing and Envelope approximation (explained
in Preprocessing section), and finally, SAX representation. In this section we
present the Histogram Calculation and the SAX representation techniques.

1. Histogram Calculation. Another method that we are using is histogram


envelope calculation. We use this technique to calculate the peak density
of a set of data. Specifically, a set of recordings containing 36 legato and 36
glissando examples (recorded by a professional classical guitarist) was used as
training set. First, for each legato and glissando example, we determined the
peaks. Since we want to model the places where condensed peaks occur, this
232 T.H. Özaslan et al.

(a) Legato Histogram (b) Glisando Histogram

Fig. 14. Peak histograms of legato and glissando training sets

(a) Legato Final Envelope (b) Glisando Final Envelope

Fig. 15. Final envelope approximation of peak histograms of legato and glissando
training sets

time we used a threshold of 30 percent and collect the peaks with amplitude
values above this threshold. Notice that the threshold is different than the
used in envelope approximation. Then, we used histograms to compute the
density of the peak locations. Figure 14 shows the resulting histograms.
After constructing the histograms, as shown in Figure 14, we used our en-
velope approximation method to construct the envelopes of legato and glis-
sando histogram models (see Figure 15).
2. SAX: Symbolic Aggregate Approximation. Although the histogram
envelope approximations of legato and glissando in Figure 15 are close to our
purposes, they still include noisy sections. Rather than these abrupt changes
(noises), we are interested in a more general representation reflecting the
changes more smoothly. SAX (Symbolic Aggregate Approximation) [18], is
a symbolic representation used in time series analysis that provides a dimen-
sionality reduction while preserving the properties of the curves. Moreover,
SAX representation makes the distance measurements easier. Then, we
applied the SAX representation to histogram envelope approximations.
Identifying Attack Articulations in Classical Guitar 233

(a) Legato SAX Representation (b) Glisando SAX Representation

Fig. 16. SAX representation of legato and glissando final models

As we mentioned in Envelope Approximation, we scaled the x-axis to


1000. We made tests with step sizes of 10 and 5. As we report in the Ex-
periments section, an step size of 5 gave better results. We also tested with
step sizes lower than 5, but the performance clearly decreased. Since we are
using an step size of 5, each step becomes 100 bins in length. After obtaining
the SAX representation of each expressive articulation, we used our distance
calculation algorithm which we are going to explain in the next section.

Detection. After obtaining the SAX representation of glissando and legato


models, we divided them into 2 regions, a first region between bins 400 and 600,
and a second region between bins 600 and 800 (see Figure 17). For the expressive
articulation excerpt, we have the envelope approximation representation with
the same length of the SAX representation of final models. So, we can compare
the regions. For the final expressive articulation models (see Figure 16) we took
the value for each region and compute the deviation (slope) between these two
regions. We performed this computation for both legato and glissando models
separately.
We also computed the same deviation for each expressive articulation enve-
lope approximation (see Figure 18). But this time, since we do not have SAX
representation, for each region we do not have single values. Therefore, for each
region we computed the local maxima and took the deviation (slope) of these
two local maxima. After obtaining this value, we may compare this deviation
vlue with the numbers that we obtained from both final models of legato and
glissando. If the deviation value is closer to the legato model, the expressive
articulation will be labeled as a legato and vice versa.

4 Experiments

The goal of the experiments realized was to test the performance of our model.
Since different modules have been designed, and they work independently of each
other, we tested Extraction and Classification modules separately. After applying
separate studies, we combined the results to assess the overall performance of
the proposed system.
234 T.H. Özaslan et al.

(a) Legato (b) Glisando

Fig. 17. Peak occurrence deviation

Fig. 18. Expressive articultion difference

As it was explained in Section 1, legato and glissando can be played in as-


cending or descending intervals. Thus, we were interested in studying the results
distinguishing among these two movements. Additionally, since in a guitar there
are three nylon strings and three metallic strings, we also studied the results
taking into account these two sets of strings.

4.1 Recordings

Borrowing from Carlevaro’s guitar exercises [5], we recorded a collection of as-


cending and descending chromatic scales. Legato and glissando examples were
recorded by a professional classical guitar performer. The performer was asked
to play chromatic scales in three different regions of the guitar fretboard. Specif-
ically, we recorded notes from the first 12 frets of the fretboard where each
recording concentrated on 4 specific frets. The basic exercise from the first fret-
board region is shown in Figure 19.
Identifying Attack Articulations in Classical Guitar 235

Fig. 19. Legato Score in first position

(a) Phrase 1 (b) Phrase 2 (c) Phrase 3

(d) Phrase 4 (e) Phrase 5

Fig. 20. Short melodies

Each scale contains 24 ascending and 24 descending notes. Each exercise con-
tains 12 expressive articulations (the ones connected with an arch). Since we
repeated the exercise at three different positions, we obtained 36 legato and 36
glissando examples. Notice that we also performed recordings with a neutral ar-
ticulation (neither legatos nor glissandos). We presented all the 72 examples to
our system.
As a preliminary test with more realistic recordings, we also recorded a small
set of 5-6 note phrases. They include different articulations in random places (see
Figure 20). As shown in Table 3, each phrase includes different combinations of
expressive articulations varying from 0-2. For instance, Phrase 3 (see Figure 20c)
does not have any expressive articulation and Phrase 4 (see Figure 20d) contains
the same notes of Phrase 3 but including two expressive articulations: first a
legato and next an appoggiatura.

4.2 Experiments with Extraction Module

First, we analyzed the accuracy of the extraction module to identify regions with
legatos. The hypothesis was that legatos are the articulations easiest to detect
because they are composed of two long notes. Next, we analyzed the accuracy
to identify regions with glissandos. Because in this situation the first note (the
glissando) has a short duration, it may be confused with the attack.

Scales. We first applied our system to single expressive and non-expressive


articulations. All the recordings were hand labeled; they were also our ground
236 T.H. Özaslan et al.

Table 2. Performance of extraction module applied to single articulations


Recordings Nylon String Metalic String
Non-expressive 90% 90%
Ascending Legatos 80% 90%
Descending Legatos 90% 70%
Ascending Glissandos 70% 70%
Descending Glissandos 70% 70%

Table 3. Results of extraction module applied to short phrases

Excerpt Name Ground Truth Detected


Phrase 1 1 2
Phrase 2 2 2
Phrase 3 0 0
Phrase 4 2 3
Phrase 5 1 1

truth. We compared the output results with annotations. The output was the
number of determined expressive articulations in the sound fragment.
Analyzing the experiments (see Table 2), different conclusions can be ex-
tracted. First, as expected, legatos are easier to detect than glissandos. Second,
in non-steel strings the melodic direction does not cause a different performance.
Regarding steel strings, descending legatos are more difficult to detect than as-
cending legatos (90% versus 70%). This result is not surprising because the
plucking action of left hand fingers in descending legatos is slightly similar to
a right hand plucking. However, this difference does not appear in glissandos
because the finger movement is the same.

Short melodies. We tested the performance of the extraction model to analyze


the recordings of short melodies with the same settings used with scales except
for the release threshold. Specifically, since in short phrase recordings the transi-
tion parts between two notes have more noise, the average value of the amplitude
between two onsets was higher. Because of that, the release threshold in a more
realistic scenario has to be increased. Specifically, after some experiments, we
fixed the release threshold to 30%.
Analyzing the results, the performance of our model was similar to the pre-
vious experiments, i.e. when we analyze single articulations. However, in two
phrases where a note was played with a soft right-hand plucking, these notes
were proposed as legato candidates (Phrase 1 and Phrase 4 ).
The final step of the extraction model is to annotate the sound fragments where
a possible attack articulation (legato or glissando) is detected. Specifically, to help
the system’s validation, the whole recording is presented to the user and the can-
didate fragments to expressive articulations are colored. As example, Figure 21
shows the annotation of Phrase 2 (see score in Figure 20b). Phrase 2 has two ex-
pressive articulations that correspond with the portions colored in black.
Identifying Attack Articulations in Classical Guitar 237

Fig. 21. Annotated output of Phrase 2

4.3 Experiments with Classification Module

After testing the Extraction Module, we tested the same audio files (this time
using only the legato and glissando examples) to test our Classification Mod-
ule. As explained in Section 3.2, we performed experiments applying different
step sizes for the SAX representation. Specifically (see results reported in Ta-
ble 4), we may observe that a step size of 5 is the most appropriate setting. This
result corroborates that a higher resolution when discretizing is not required
and demonstrates that the SAX representation provides a powerful technique to
summarize the information about changes.
The overall performance for legato identification was 83.3% and the overall
performance for glissando identification was 80.5%. Notice that identification of
ascending legato reached a 100% of accuracy whereas descending legato achieved
only a 66.6%. Regarding glissando, there was no significant difference between
ascending or descending accuracy (83.3% versus 77.7%). Finally, analyzing the
results when considering the string type, the results presented a similar accuracy
on both nylon and metallic strings.

4.4 Experiments with the whole system

After testing the main modules separately, we studied the performance of the
whole system using the same recordings. From our previous experiments, an step
size of 5 gave the best analyzes results, therefore we run these experiments with
only an step size of 5.
Since we had errors both in the extraction module and classification mod-
ule, the combined results presented a lower accuracy (see results on Table 5).
238 T.H. Özaslan et al.

Table 4. Performance of classification module applied to test set

Step Size
Recordings 5 10
Ascending Legato 100.0 % 100.0 %
Descending Legato 66.6 % 72.2 %
Ascending Glissando 83.3 % 61.1 %
Descending Glissando 77.7 % 77.7 %
Legato Nylon Strings 80.0 % 86.6 %
Legato Metallic Strings 86.6 % 85.6 %
Glissando Nylon Strings 83.3 % 61.1 %
Glissando Metallic Strings 77.7 % 77.7 %

Table 5. Performance of our whole model applied to test set

Recordings Accuracy
Ascending Legato 85.0 %
Descending Legato 53.6 %
Ascending Glissando 58.3 %
Descending Glissando 54.4%
Legato Nylon Strings 68.0 %
Legato Metallic Strings 69.3 %
Glissando Nylon Strings 58.3 %
Glissando Metallic Strings 54.4 %

In Ascending Legatos we had a 100% of accuracy in the classification module


experiments (see Table 4), but since there was a 15% total error in detecting
ascending legato candidates from the classification module (see Table 2), the
overall accuracy results decrease to 85% (see Table 5).
Also regarding the ascending Glissandos, although we reached a high accuracy
in the classification module (83.3%), because of having 70% accuracy in the
extraction module, the overall result decreased to 58.3%. Similar conclusions
can be extracted for rest of the accuracy results.

5 Conclusions

In this paper we presented a system that combines several state of the art analysis
algorithms to identify left hand articulations such as legatos and glissandos.
Specifically, our proposal uses HFC for plucking detection and Complex Domain
and YIN algorithms for pitch detection. Then, combining the data coming from
these different sources, we developed a first decision mechanism, the Extraction
module, to identify regions where attack articulations may be present. Next, the
Classification module analyzes the regions annotated by the extraction Module
and tries to determine the articulation type. Our proposal is to use aperiodicity
Identifying Attack Articulations in Classical Guitar 239

information to identify the articulation and a SAX representation to characterize


articulation models. Finally, applying a distance measure to the trained models,
articulation candidates are classified as legato or glissando.
We reported experiments to validate our proposal by analyzing a collection
of chromatic exercises and short melodies recorded by a professional guitarist.
Although we are aware that our current system may be improved, the results
showed that it is able to identify and classify successfully these two attack-
based articulations. As expected, legato are more easy to identify to glissando.
Specifically, the short duration of a glissando is sometimes confused as a single
note attack.
As a next step, we plan to incorporate more analysis and decision components
into our system with the aim of covering all the main expressive articulations
used in guitar playing. We are currently working in improving the performance
of both modules and also adding additional expressive resources such as vibrato
analysis. Additionally, we are exploring the possibility of dynamically chang-
ing the parameters of the analysis algorithms like, for instance, using different
parameters depending on the string where notes are played.

Acknowledgments
This work was partially funded by NEXT-CBR (TIN2009-13692-C03-01), IL4LTS
(CSIC-200450E557) and by the Generalitat de Catalunya under the grant 2009-
SGR-1434. Tan Hakan Özaslan is a Phd student of the Doctoral Program in Infor-
mation, Communication, and Audiovisuals Technologies of the Universitat Pom-
peu Fabra. We also want to thank the professional guitarist Mehmet Ali Yıldırım
for his contribution with the recordings.

References
1. Arcos, J.L., López de Mántaras, R., Serra, X.: Saxex: a case-based reasoning sys-
tem for generating expressive musical performances. Journal of New Music Re-
search 27(3), 194–210 (1998)
2. Brossier, P.: Automatic annotation of musical audio for interactive systems. Ph.D.
thesis, Centre for Digital music, Queen Mary University of London (2006)
3. Brossier, P., Bello, J.P., Plumbley, M.D.: Real-time temporal segmentation of note
objects in music signals. In: Proceedings of the International Computer Music
Conference, ICMC 2004 (November 2004)
4. Burns, A., Wanderley, M.: Visual methods for the retrieval of guitarist fingering.
In: NIME 2006: Proceedings of the 2006 conference on New interfaces for musical
expression, Paris, pp. 196–199 (June 2006)
5. Carlevaro, A.: Serie didactica para guitarra. vol. 4. Barry Editorial (1974)
6. de Cheveigné, A., Kawahara, H.: Yin, a fundamental frequency estimator for speech
and music. The Journal of the Acoustical Society of America 111(4), 1917–1930
(2002)
240 T.H. Özaslan et al.

7. Dodge, C., Jerse, T.A.: Computer Music: Synthesis, Composition, and Perfor-
mance. Macmillan Library Reference (1985)
8. Duxbury, C., Bello, J., Davies, J., Sandler, M., Mark, M.: Complex domain on-
set detection for musical signals. In: Proceedings Digital Audio Effects Workshop
(2003)
9. Erkut, C., Valimaki, V., Karjalainen, M., Laurson, M.: Extraction of physical and
expressive parameters for model-based sound synthesis of the classical guitar. In:
108th AES Convention, pp. 19–22 (February 2000)
10. Gabrielsson, A.: Once again: The theme from Mozart’s piano sonata in A major
(K. 331). A comparison of five performances. In: Gabrielsson, A. (ed.) Action and
perception in rhythm and music, pp. 81–103. Royal Swedish Academy of Music,
Stockholm (1987)
11. Gabrielsson, A.: Expressive intention and performance. In: Steinberg, R. (ed.) Mu-
sic and the Mind Machine, pp. 35–47. Springer, Berlin (1995)
12. Gouyon, F., Herrera, P., Gómez, E., Cano, P., Bonada, J., Loscos, A., Amatriain,
X., Serra, X.: In: ontent Processing of Music Audio Signals, pp. 83–160. Logos
Verlag, Berlin (2008), http://smcnetwork.org/public/S2S2BOOK1.pdf
13. Grachten, M., Arcos, J., de Mántaras, R.L.: A case based approach to expressivity-
aware tempo transformation. Machine Learning 65(2-3), 411–437 (2006)
14. Heijink, H., Meulenbroek, R.G.J.: On the complexity of classical guitar play-
ing:functional adaptations to task constraints. Journal of Motor Behavior 34(4),
339–351 (2002)
15. Johnstone, J.A.: Phrasing in piano playing. Withmark New York (1913)
16. Juslin, P.: Communicating emotion in music performance: a review and a theoret-
ical framework. In: Juslin, P., Sloboda, J. (eds.) Music and emotion: theory and
research, pp. 309–337. Oxford University Press, New York (2001)
17. Lee, N., Zhiyao, D., Smith, J.O.: Excitation signal extraction for guitar tones. In:
International Computer Music Conference, ICMC 2007 (2007)
18. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic rep-
resentation of time series. Data Mining and Knowledge Discovery 15(2), 107–144
(2007)
19. Lindström, E.: 5 x oh, my darling clementine the influence of expressive intention
on music performance (1992) Department of Psychology, Uppsala University
20. de Mantaras, R.L., Arcos, J.L.: Ai and music from composition to expressive per-
formance. AI Mag. 23(3), 43–57 (2002)
21. Masri, P.: Computer modeling of Sound for Transformation and Synthesis of Mu-
sical Signal. Ph.D. thesis, University of Bristol (1996)
22. Norton, J.: Motion capture to build a foundation for a computer-controlled instru-
ment by study of classical guitar performance. Ph.D. thesis, Stanford University
(September 2008)
23. Palmer, C.: Anatomy of a performance: Sources of musical expression. Music Per-
ception 13(3), 433–453 (1996)
24. Radicioni, D.P., Lombardo, V.: A constraint-based approach for annotating music
scores with gestural information. Constraints 12(4), 405–428 (2007)
25. Radisavljevic, A., Driessen, P.: Path difference learning for guitar fingering prob-
lem. In: International Computer Music Conference (ICMC 2004) (2004)
Identifying Attack Articulations in Classical Guitar 241

26. Sloboda, J.A.: The communication of musical metre in piano performance. Quar-
terly Journal of Experimental Psychology 35A, 377–396 (1983)
27. Trajano, E., Dahia, M., Santana, H., Ramalho, G.: Automatic discovery of right
hand fingering in guitar accompaniment. In: Proceedings of the International Com-
puter Music Conference (ICMC 2004), pp. 722–725 (2004)
28. Traube, C., Depalle, P.: Extraction of the excitation point location on a string
using weighted least-square estimation of a comb filter delay. In: Procs. of the 6th
International Conference on Digital Audio Effects, DAFx 2003 (2003)
Comparing Approaches to the Similarity of
Musical Chord Sequences

W.B. de Haas1 , Matthias Robine2 , Pierre Hanna2 ,


Remco C. Veltkamp1 , and Frans Wiering1
1
Utrecht University, Department of Information and Computing Sciences
PO Box 80.089, 3508 TB Utrecht, The Netherlands
{bas.dehaas,remco.veltkamp,frans.wiering}@cs.uu.nl
2
LaBRI - Université de Bordeaux
F-33405 Talence cedex, France
{pierre.hanna,matthias.robine}@labri.fr

Abstract. We present a comparison between two recent approaches to


the harmonic similarity of musical chords sequences. In contrast to earlier
work that mainly focuses on the similarity of musical notation or musical
audio, in this paper we specifically use on the symbolic chord description
as the primary musical representation. For an experiment, a large chord
sequence corpus was created. In this experiment we compare a geomet-
rical and an alignment approach to harmonic similarity, and measure
the effects of chord description detail and a priori key information on
retrieval performance. The results show that an alignment approach sig-
nificantly outperforms a geometrical approach in most cases, but that the
geometrical approach is computationally more efficient than the align-
ment approach. Furthermore, the results demonstrate that a priori key
information boosts retrieval performance, and that using a triadic chord
representation yields significantly better results than a simpler or more
complex chord representation.

Keywords: Music Information Retrieval, Musical Harmony, Similarity,


Chord Description, Evaluation, Ground-truth Data.

1 Introduction

In the last decades Music Information Retrieval (MIR) has evolved into a broad
research area that aims at making large repositories of digital music maintain-
able and accessible. Within MIR research two main directions can be discerned:
symbolic music retrieval and the retrieval of musical audio. The first direction
traditionally uses score-based representations to research typical retrieval prob-
lems. One of the most important and most intensively studied of these is prob-
ably the problem of determining the similarity of a specific musical feature, e.g.
melody, rhythm, etc. The second direction–musical audio retrieval–extracts fea-
tures from the audio signal and uses these features for estimating whether two
pieces of music share certain musical properties. In this paper we focus on a

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 242–258, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Comparing Approaches to the Similarity of Musical Chord Sequences 243

musical representation that is symbolic but can be automatically derived from


musical audio with reasonable effectiveness: chord descriptions.
Only recently, partly motivated by the growing interest in audio chord finding,
MIR researchers have started using chords descriptions as principal representa-
tion for modeling music similarity. Naturally, these representations are specifi-
cally suitable for capturing the harmonic similarity of a musical piece. However,
determining the harmonic similarity of sequences of chords descriptions gives rise
to three questions. First, what is harmonic similarity? Second, why do we need
harmonic similarity? Last, do sequences of chord descriptions provide a valid
and useful abstraction of the musical data for determining music similarity? The
first two questions we will address in this introduction; the third question we will
answer empirically in a retrieval experiment. In this experiment we will compare
a geometrical and an alignment based harmonic similarity measure.
The first question–what is harmonic similarity–is difficult to answer. We strongly
believe that if we want to model what makes two pieces of music similar, we must
not only look at the musical data, but especially at the human listener. It is im-
portant to realize that music only becomes music in the mind of the listener, and
probably not all information needed for good similarity judgment can be found in
the data alone. Human listeners, musician or non-musician, have extensive culture-
dependent knowledge about music that needs to be taken into account when judg-
ing music similarity.
In this light we consider the harmonic similarity of two chord sequences to
be the degree of agreement between structures of simultaneously sounding notes
(i.e. chords) and the agreement between global as well as local relations between
these structures in both sequences as perceived by the human listener. With the
agreement between structures of simultaneously sounding notes we denote the
similarity that a listener perceives when comparing two chords in isolation and
without the surrounding musical context. However, chords are rarely compared
in isolation and the relations between the global context–the key–of a piece and
the relations to the local context play a very important role in the perception
of tonal harmony. The local relations can be considered the relations between
functions of chords within a limited time frame, for instance the preparation of
a chord with a dominant function with a sub-dominant. All these factors play a
role in the perception of tonal harmony and should be shared by two compared
pieces up to certain extent to if they are considered similar.
The second question about the usefulness of harmonic similarity is easier to
answer, since music retrieval based on harmony sequences offers various benefits.
It allows for finding different versions of the same song even when melodies vary.
This is often the case in cover songs or live performances, especially when these
performances contain improvisations. Moreover, playing the same harmony with
different melodies is an essential part of musical styles like jazz and blues. Also,
variations over standard basses in baroque instrumental music can be harmoni-
cally very related.
The application of harmony matching methods is broadened further by the
extensive work on chord description extraction from musical audio data within
244 W.B. de Haas et al.

the MIR community, e.g. [20,5]. Chord labeling algorithms extract symbolic
chord labels from musical audio: these labels can be matched directly using the
algorithms covered in this paper.
If you would ask a jazz musician to answer the third question–whether se-
quences of chord descriptions are useful–he will probably agree that they are,
since working with chord descriptions is everyday practice in jazz. However, we
will show in this paper that they are also useful for retrieving pieces with a sim-
ilar but not identical chord sequence by performing a large experiment. In this
experiment we compare two harmonic similarity measures, the Tonal Pitch Step
Distance (TPSD) [11] and the Chord Sequence Alignment System (CSAS) [12],
and test the influence of different degrees of detail in the chord description and
the knowledge of the global key of a piece on retrieval performance.
The next section gives a brief overview of the current achievements in chord
sequence similarity matching and harmonic similarity in general, Section 3 de-
scribes the data used in the experiment and Section 4 presents the results.

Contribution. This paper presents an overview of chord sequence based har-


monic similarity. Two harmonic similarity approaches are compared in an ex-
periment. For this experiment a new large corpus of 5028 chord sequences was
assembled. Six retrieval tasks are defined for this corpus, to which both algo-
rithms are subjected. All tasks use the same dataset, but differ in the amount of
chord description detail and in the use of a priori key information. The results
show that a computational costly alignment approach significantly outperforms
a much faster geometrical approach in most cases, that a priori key information
boosts retrieval performance, and that using a triadic chord representation yields
significantly better results than a simpler or more complex chord representation.

2 Background: Similarity Measures for Chord Sequences

The harmonic similarity of musical information has been investigated by many


authors, but the number of systems that focus solely on similarity chords se-
quences is much smaller. Of course it is always possible to convert notes into
chords and vice versa, but this is not a trivial task. Several algorithms can cor-
rectly segment and label approximately 80 percent of a symbolic dataset (see for
a review [26]). Within the audio domain hidden Markov Models are frequently
used for chord label assignment, e.g. [20,5]. The algorithms considered in this
paper abstract from these labeling tasks and focus on the similarity between
chord progressions only. As a consequence, we assume that we have a sequence
of symbolic chord labels describing the chord progression in a piece of music.
The systems currently known to us that are designed to match these sequences
of symbolic chord descriptions are the TPSD [11], the CSAS [12] and a harmony
grammar approach [10]. The first two are quantitatively compared in this paper
and are introduced in the next two subsections, respectively. They have been
compared before, but all previous evaluations of TPSD and CSAS were done
with relatively small datasets (<600 songs) and the dataset used in [12] was
Comparing Approaches to the Similarity of Musical Chord Sequences 245

All The Things You Are


13
12
11
10
9
TPS Score

8
7
6
5
4
3
2
1
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Beat

Fig. 1. A plot demonstrating the comparison of two similar versions of All the Things
You Are using the TPSD. The total area between the two step functions, normalized
by the duration of the shortest song, represents the distance between both songs. A
minimal area is obtained by shifting one of the step functions cyclically.

different from the one used in [11]. The harmony grammar approach could, at
the time of writing, not compete in this experiment because in its current state
it is yet unable to parse all the songs in the used dataset.
The next section introduces the TPSD and the improvements over the im-
plementation used for the experiment here and the implementation in [11]. Sec-
tion 2.2 highlights the different variants of the CSAS. The main focus of this
paper is on the similarity of sequences of chord labels, but there exist other
relevant harmony based retrieval methods: some of these are briefly reviewed in
Section 2.3.

2.1 Tonal Pitch Step Distance

The TPSD uses Lerdahl’s [17] Tonal Pitch Space (TPS) as its main musical
model. TPS is a model of tonality that fits musicological intuitions, correlates
well with empirical findings from music cognition [16] and can be used to calcu-
late a distance between two arbitrary chords. The TPS model can be seen as a
scoring mechanism that takes into account the number of steps on the circle of
fifths between the roots of the chords, and the amount of overlap between the
chord structures of the two chords and their relation to the global key.
The general idea behind the TPSD is to use the TPS to compare the change
of chordal distance to the tonic over time. For every chord the TPS distance
between the chord and the key of the sequence is calculated, which results in
a step function (see Figure 1). As a consequence, information about the key
of the piece is essential. Next, the distance between two chord sequences is de-
fined as the minimal area between the two step functions over all possible hor-
izontal circular shifts. To prevent that longer sequences yield larger distances,
the score is normalized by dividing it by the duration of the shortest song.
246 W.B. de Haas et al.

The TPS is an elaborate model that allows to compare every arbitrary chord in
an arbitrary key to every other possible chord in any key. The TPSD does not
use the complete model and only utilizes the parts that facilitate the comparison
of two chords within the same key. In the current implementation of the TPSD
time is represented in beats, but generally any discrete representation could be
used.
The TPSD version used in this paper contains a few improvements compared
to the version used in [11]: by applying a different step function matching al-
gorithm from [4], and by exploiting the fact that we use discrete time units
that enable us to sort in linear time using counting sort [6], a running time of
O(nm) is achieved where n and m are the number of chord symbols in both
songs. Furthermore, to be able to use the TPSD in situations where a priori key
information is not available, the TPSD is extended with a key finding algorithm.

Key finding. The problem of finding the global key of piece of music is called
key finding. For this study, this is done on the basis of chord information only.
The rationale behind the key finding algorithm that we present here is the fol-
lowing: we consider the key that minimizes the total TPS distance and best
matches the starting and ending chord, the key of the piece.
For minimizing the total TPS distance, the TPSD key finding uses TPS based
step functions as well. We assume that when a song matches a particular key, the
TPS distances between the chords and the tonic of the key are relatively small.
The general idea is to calculate 24 step functions for a single chord sequence, one
for each major and minor key. Subsequently, all these keys are ranked by sorting
them on the area between their TPS step function and the x-axis; the smaller
the total area, the better this key fits the piece, and the higher the rank. Often,
the key at the top rank is the correct key. However, among the false positives at
rank one, non-surprisingly, the IV, V and VI relative to the ground-truth key1
are found regularly. This makes sense because, when the total of TPS distances
of the chords to C is small, the distances to F, G and Am might be small as well.
Therefore, to increase performance, an additional scoring mechanism is designed
that takes into account the IV, V and VI relative to the ground-truth key. Of
all 24 keys, the candidate key that minimizes the following sum S is considered
the key of the piece.


β if the first chord matches the key
S = αr(I) + r(IV ) + r(V ) + r(V I) + (1)
β if the last chord matches the key

Here r(.) denotes the rank of the candidate key, a parameter α determines
how important the tonic is compared to other frequently occurring scale degrees
and β controls the importance of the key matching first and last chord. The
parameters α and β were tuned by hand and an α of 2, and a β of 4 were
found to give good results. Clearly, this simple key-finding algorithm is biased
1
The roman numbers here represent the diatonic interval between the key in the
ground-truth and the predicted key.
Comparing Approaches to the Similarity of Musical Chord Sequences 247

towards western diatonic music, but for the corpus used in this paper it performs
quite well. The algorithm scores 88.8 percent correct on a subset of 500 songs
of the corpus used in the experiment below for which we manually checked the
correctness of the ground-truth key. The above algorithm takes O(n) time, where
n is the number of chord symbols, because the number of keys is constant.
Root interval step functions. For the tasks where only the chord root is
used we use a different step function representation (See Section 4). In these
tasks the interval between the chord root and the root note of the key defines
the step height and the duration of the chord again defines the step length. This
matching method is very similar to the melody matching approach by Aloupis
et al. [2]. Note that the latter was never tested in practice. The matching and
key finding methods are not different from the other variants of the TPSD. Note
that in all TPSD variants, chord inversions are ignored.

2.2 Chord Sequence Alignment System


The CSAS algorithm is based on local alignment and computes similarity scores
between sequences of symbols representing chords or distances between chords
and key. String matching techniques can be used to quantify the differences be-
tween two such sequences. Among several existing methods, Smith and Water-
man’s approach [25] detects similar areas in two sequences of abitrary symbols.
This local alignment or local similarity algorithm locates and extracts a pair of
regions, one from each of the two given strings, that exhibit high similarity. A
similarity score is calculated by performing elementary operations transforming
the one string into the other. The operations used to transform the sequences are
deletion, insertion or substitution of a symbol. The total transformation from the
one string into the other can be solved with a dynamic programming in quadratic
time.
The following example illustrates local alignment by computing a distance
between the first chords of two variants of the song All The Things You Are
considering only the root notes of the chords. The I, S and D denote Insertion,
Substitution, and Deletion of a symbol, respectively. An M represents a matching
symbol:

string 1 F B E A D D G C C
string 2 F F B A A A D D C C
operation I M M S M I M M D M M
score −1 +2 +2 −2 +2 −1 +2 +2 −1 +2 +2

Algorithms based on local alignment have been successfully adapted for melodic
similarity [21,13,15] and recently it has been used to determine harmonic similar-
ity [12] as well. Two steps are necessary to apply the alignment technique to the
comparison of chord progressions: the choice of the representation of a chord se-
quence, and the scores of the elementary operations between symbols. To take the
durations of the chords into account, we represent the chords at every beat. The
algorithm has therefore a complexity of O(nm), where n and m are the sizes of
248 W.B. de Haas et al.

the compared songs in beats. The score function can either be adapted to the cho-
sen representation or can simply be binary, i.e. the score is positive (+2) if the
two chords described are identical, and negative (−2) otherwise. The insertion or
deletion score is set to −1.

Absolute representation. One way of representing a chord sequence is to


simply represent the chord progression as a sequence of absolute root notes and in
that case prior knowledge of the key is not required. An absolute representation
of the chord progression of the 8 first bars of the song All The Things You Are
is then:
F, B, E, A, D, D, G, C, C
In this case, the substitution scores may be determined by considering the differ-
ence in semitones, the number of steps on the circle of fifths between the roots, or
by the consonance of the interval between the roots, as described in [13]. For in-
stance, the cost of substituting a C with a G (fifth) is lower than the substitution
of a C with a D (Second). Taking into account the mode in the representation
can affect the score function as well: a substitution of a C for a Dm is different
from a substitution of a C for a D, for example. If the two modes are identical,
one may slightly increase the similarity score, and decrease it otherwise. Another
possible representation of the chord progression is a sequence of absolute pitch
sets. In that case one can use musical distances between chords, like Lerdahl’s
TPS model [17] or the distance introduced by Paiement et al. [22], as a score
function for substitution.

Key-relative representation. If key information is known beforehand, a chord


can be represented as a distance to this key. The distance can be expressed in
various ways: in semitones, or as the number of steps on the circle of fifths
between the root of the chord and the tonic of the key of the song, or with more
complex musical models, such as TPS. If in this case the key is A and the chord
is represented by the difference in semitones, the representation of the chord
progression of the first eight bars of the song All The Things You Are will be:

3, 2, 5, 0, 5, 6, 1, 4, 4

If all the notes of the chords are taken into account, the TPS or Paiement
distances can be used between the chords and the triad of the key to construct
the representation. The representation is then a sequence of distances, and we use
an alignment between these distances instead of between the chords themselves.
This representation is very similar to the representation used in the TPSD. The
score functions used to compare the resulting sequences can then be binary, or
linear in similarity regarding the difference observed in the values.

Transposition invariance. In order to be robust to key changes, two iden-


tical chord progressions transposed in different keys have to be considered as
similar. The usual way to deal with this issue [27] is to choose a chord represen-
tation which is transposition invariant. A first option is to represent transitions
Comparing Approaches to the Similarity of Musical Chord Sequences 249

between successive chords, but this has been proven to be less accurate when
applied to alignment algorithms [13]. Another option is to consider a key rela-
tive representation, like the representation described above which is by definition
transposition invariant. However, this approach is not robust against local key
changes. With an absolute representation of chords, we use an adaptation of
the local alignment algorithm proposed in [1]. It allows to take into account an
unlimited number of local transpositions and can be applied to representations
of chord progressions to account for modulations.
According to the choice of the representation and the score function, several
variants are possible in order to build an algorithm for harmonic similarity. In
Section 4 we explain the different representations and scoring functions used in
the different tasks of the experiment and their effects on retrieval performance.

2.3 Other Methods for Harmonic Similarity


The third harmonic similarity measure using chord descriptions is a generative
grammar approach [10]. The authors use a generative grammar of tonal har-
mony to parse the chord sequences, which results in parse trees that represent
harmonic analyses of these sequences. Subsequently, a tree that contains all the
information shared by the two parse trees of two compared songs is constructed
and several properties of this tree can be analyzed yielding several similarity
measures. Currently a parser can reject a sequence of chords as being ungram-
matical.
Another interesting retrieval system based on harmonic similarity is the one
developed by Pickens and Crawford [23]. Instead of describing a musical segment
with one chord, they represent a musical segment as a vector describing the ‘fit’
between the segment and every major and minor triad. This system then uses
a Markov model to model the transition distributions between these vectors for
every piece. Subsequently, these Markov models are ranked using the Kullback-
Leibler divergence. It would be interesting to compare the performance of these
systems to the algorithms tested in here in the future.
Other interesting work has been done by Paiement et al. [22]. They define a
similarity measure for chords rather than for chord sequences. Their similarity
measure is based on the sum of the perceived strengths of the harmonics of the
pitch classes in a chord, resulting in a vector of twelve pitch classes for each
musical segment. Paiement et al. subsequently define the distance between two
chords as the euclidean distance between two of these vectors that correspond
to the chords. Next, they use a graphical model to model the hierarchical depen-
dencies within a chord progression. In this model they used their chord similarity
measure for the calculation of the substitution probabilities between chords.

3 A Chord Sequence Corpus


The Chord Sequence Corpus used in the experiment consists of 5,028 unique
human-generated Band-in-a-Box files that are collected from the Internet. Band-
in-a-Box is a commercial software package [9] that is used to generate musical
250 W.B. de Haas et al.

Table 1. A leadsheet of the song All The Things You Are. A dot represents a beat, a
bar represents a bar line, and the chord labels are presented as written in the Band-
in-a-Box file.

|Fm7 . . . |Bbm7 . . . |Eb7 . . . |AbMaj7 . . . |


|DbMaj7 . . . |Dm7b5 . G7b9 . |CMaj7 . . . |CMaj7 . . . |
|Cm7 . . . |Fm7 . . . |Bb7 . . . |Eb7 . . . |
|AbMaj7 . . . |Am7b5 . D7b9 . |GMaj7 . . . |GMaj7 . . . |
|A7 . . . |D7 . . . |GMaj7 . . . |GMaj7 . . . |
|Gbm7 . . . |B7 . . . |EMaj7 . . . |C+ . . . |
|Fm7 . . . |Bbm7 . . . |Eb7 . . . |AbMaj7 . . . |
|DbMaj7 . . . |Dbm7 . Gb7 . |Cm7 . . . |Bdim . . . |
|Bbm7 . . . |Eb7 . . . |AbMaj7 . . . |. . . . |

accompaniment based on a lead sheet. A Band-in-a-Box file stores a sequence


of chords and a certain style, whereupon the program synthesizes and plays a
MIDI-based accompaniment. A Band-in-a-Box file therefore contains a sequence
of chords, a melody, a style description, a key description, and some information
about the form of the piece, i.e. the number of repetitions, intro, outro etc.
For extracting the chord label information from the Band-in-a-Box files we have
extended software developed by Simon Dixon and Matthias Mauch [19].
Throughout this paper we have been referring to chord labels or chord de-
scriptions. To rule out any possible vagueness, we adopt the following definition
of a chord: a chord always consist of a root, a chord type and an optional in-
version. The root note is the fundamental note upon which the chord is built,
usually as a series of ascending thirds. The chord type (or quality) is the set of
intervals relative to the root that make up the chord and the inversion is defined
as the degree of the chord that is played as bass note. One of the most distinctive
features of the chord type is its mode, which can either be major or minor.
Although a chord label always describes these three properties, root, chord
type and inversion, musicians and researchers use different syntactical systems
to describe them, and also Band-in-a-Box uses its own syntax to represent the
chords. Harte et al. [14] give an in depth overview of the problems related to
representing chords and suggests a unambiguous syntax for chord labels. An
example of a chord sequence as found in a Band-in-a-Box file describing the
chord sequence of All the Things You Are is given in Table 1.
All songs of the chord sequence corpus were collected from various Internet
sources. These songs were labeled and automatically checked for having a unique
chord sequence. All chord sequences describe complete songs and songs with
fewer than 3 chords or shorter than 16 beats were removed from the corpus in
an earlier stage. The titles of the songs, which function as a ground-truth, as well
as the correctness of the key assignments, were checked and corrected manually.
The style of the songs is mainly jazz, latin and pop.
Within the collection, 1775 songs contain two or more similar versions, forming
691 classes of songs. Within a song class, songs have the same title and share
a similar melody, but may differ in a number of ways. They may, for instance,
differ in key and form, they may differ in the number of repetitions, or have a
Comparing Approaches to the Similarity of Musical Chord Sequences 251

Table 2. The distribution of the song class sizes in the Chord Sequence Corpus

Class Size Frequency Percent


1 3,253 82.50
2 452 11.46
3 137 3.47
4 67 1.70
5 25 .63
6 7 .18
7 1 .03
8 1 .03
10 1 .03
Total 5028 100

special introduction or ending. The richness of the chords descriptions may also
diverge, i.e. a C7
9
13 may be written instead of a C7 , and common substitutions
frequently occur. Examples of the latter are relative substitution, i.e. Am instead
of C, or tritone substitution, e.g. F#7 instead of C7 . Having multiple chord
sequences describing the same song allows for setting up a cover-song finding
experiment. The the title of the song is used as ground-truth and the retrieval
challenge is to find the other chord sequences representing the same song.
The distribution of the song class sizes is displayed in Table 2 and gives an
impression of the difficulty of the retrieval task. Generally, Table 2 shows that
the song classes are relatively small and that for the majority of the queries there
is only one relevant document to be found. It furthermore shows that 82.5% of
the songs is in the corpus for distraction only. The chord sequence corpus is
available to the research community on request.

4 Experiment: Comparing Retrieval Performance

We compared the TPSD and the CSAS in six retrieval tasks. For this experiment
we used the chord sequence corpus described above, which contains sequences
that clearly describe the same song. For each of these tasks the experimental
setup was identical: all songs that have two or more similar versions were used as
a query, yielding 1775 queries. For each query a ranking was created by sorting
the other songs on their TPSD and CSAS scores and these rankings and the
runtimes of the compared algorithms were analyzed.

4.1 Tasks

The tasks, summarized in Table 3, differed in the level of chord information used
by the algorithms and in the usage of a priori global key information. In tasks 1-3
no key information was presented to the algorithms and in the remaining three
tasks we used the key information, which was manually checked for correctness,
as stored in the Band-in-a-Box files. The tasks 1-3 and 4-6 furthermore differed
in the amount of chord detail that was presented to the algorithms: in tasks 1
252 W.B. de Haas et al.

Table 3. The TPSD and CSAS are compared in six different retrieval tasks

Task nr. Chord Structure Key Information


1 Roots Key inferred
2 Roots + triad Key inferred
3 Complete Chord Key inferred
4 Roots Key as stored in the Band-in-a-Box file
5 Roots + triad Key as stored in the Band-in-a-Box file
6 Complete Chord Key as stored in the Band-in-a-Box file

and 4 only the root note of the chord was available to the algorithms, in tasks
2 and 5 the root and the triad were available and in tasks 3 and 6 the complete
chord as stored in the Band-in-a-Box file was presented to the algorithms.
The different tasks required specific variants of the tested algorithms. For tasks
1-3 the TPSD used the TPS key finding algorithm as described in Section 2.1. For
the tasks 1 and 4, involving only chord roots, a simplified variant of the TPSD
was used, for the tasks 2, 3, 5 and 6 we used the regular TPSD, as described in
Section 2.1 and [11].
To measure the impact of the chord representation and substitution functions
on retrieval performance, different variants of the CSAS were built also. In some
cases the choices made did not yield the best possible results, but they allow the
reader to understand the effects of the parameters used on retrieval performance.
The CSAS algorithms in tasks 1-3 all used an absolute representation and the
algorithms in tasks 4-6 used a key relative representation. In tasks 4 and 5 the
chords were represented as the difference in semitones to the root of the key of
the piece and in task 6 as the Lerdahl’s TPS distance between the chord and the
triad from the key (as in the TPSD). The CSAS variants in tasks 1 and 2 used
a consonance based substitution function and algorithms in tasks 4-6 a binary
substitution function was used. In tasks 2 and 5 a binary substitution function
for the mode was used as well: if the mode of the substituted chords matched,
no penalty was given, if they did not match, a penalty was given.
A last parameter that was varied was the use of local transpositions. The
CSAS variants applied in tasks 1 and 3 did not consider local transpositions, but
the CSAS algorithm used in task 2 did allow local transpositions (see Section 2.2
for details).
The TPSD was implemented in Java and the CSAS was implemented in C++,
but a small Java program was used to parallelize the matching process. All runs
were done on a Intel Xeon quad-core CPU at a frequency of 1.86 GHz. and 4 Gb
of RAM running 32 bit Linux. Both algorithms were parallelized to optimally
use the multiple cores of the CPUs.

4.2 Results
For each task and for each algorithm we analyzed the rankings of all 1775 queries
with 11-point precision recall curves and Mean Average Precision (MAP). Figure 2
displays the interpolated average precision and recall chart for the TPSD and
the CSAS for all tasks listed in Table 3. We calculated the interpolated average
Comparing Approaches to the Similarity of Musical Chord Sequences 253

Key inferred Key relative


1.0
Interpolated Average Precision

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall Recall
TPSD Roots only (tasks 1 and 4)
CSAS Roots + triad (tasks 2 and 5)
Complete chords (tasks 3 and 6)

Fig. 2. The 11-point interpolated precision and recall charts for the TPSD and the
CSAS for tasks 1–3, on the left, and 4–6 on the right

precision as in [18] and probed it at 11 different recall levels. In all evaluations


the query was excluded from the analyzed rankings. In tasks 2 and 4-6 the CSAS
outperforms the TPSD and in tasks 1 and 3 the TPSD outperforms the CSAS.
The curves all have a very similar shape, this is probably due to the specific
sizes of the song classes and the fairly limited amount of large song classes (see
Table 2).
In Figure 3 we present the MAP and the runtimes of the algorithms on two
different axes. The MAP is displayed on the left axis and the runtimes are shown
on right axis that has an exponential scale doubling the amount of time at every
tick. The MAP is a single-figure measure, which measures the precision at all re-
call levels and approximates the area under the (uninterpolated) precision recall
graph [18]. Having a single measure of retrieval quality makes it easier to evalu-
ate the significance of the differences between results. We tested whether there
were significant differences in MAP by performing a non-parametric Friedman
test, with a significance level of α = .05. We chose the Friedman test because the
underlying distribution of the data is unknown and in contrast to an ANOVA
the Friedman does not assume a specific distribution of variance. There were sig-
nificant differences between the 12 runs, χ2 (11, N = 1775) = 2, 6182, p < .0001.
To determine which of the pairs of measurements differed significantly we con-
ducted a post hoc Tukey HSD test3 . Other than a regular T-test, the Tukey HSD
test can be safely used for comparing multiple means [7]. A summary of the an-
alyzed confidence intervals is given in Table 4. Significant and non-significant
differences are denoted with +’s and –’s, respectively.
2
The 12 different runs introduce 11 degrees of freedom and 1775 individual queries
were examined per run.
3
All statistical tests were performed in Matlab 2009a.
254 W.B. de Haas et al.

Key inferred Key relative

1 341:20

0.9 170:40

Run time (hours : minutes)


85:20
0.8
Mean Average Precision

42:40
0.7
21:20
0.6
10:40

0.5 5:20

0.4 2:40

1:20
0.3
0:40
0.2
0:20
0.1 0:10

0 0:05
1) 2) 3) 1) 2) 3) 4) 5) 6) 4) 5) 6)
sk sksk sk sk sk sk sk sk ask ask sk
(ta (ta
(ta x (ta (ta (ta (ta (ta x (ta (t (t (ta
s ad x s s d s d x
tri ot a d
plle
e tot a pl
e t a
pl
e
ro p ro tri ro
o tri ro
o tri
S D ts
+
c om S s
+
c om D s
+
c om S ts
+
com
o A t S t A o
TP ro SD C
S
ro
o S TP
o
ro PSD C
S
ro SAS
D TP S SA D T S
S A C S A C
TP C
S TP C
S

Fig. 3. The MAP and Runtimes of the TPSD and the CSAS. The MAP is displayed on
the left axis and the runtimes are displayed on an exponential scale on the right axis.
On the left side of the chart the key inferred tasks are displayed and the key relative
tasks are displayed on the right side.

The overall retrieval performance of all algorithms on all tasks can be con-
sidered good, but there are some large differences between tasks and between
algorithms, both in performance and in runtime. With a MAP of .70 the over-
all best performing setup was the CSAS using triadic chord descriptions and
a key relative representation (task 5). The TPSD also performs best on task
5 with an MAP of .58. In tasks 2 and 4-6 the CSAS significantly outperforms
the TPSD. On tasks 1 and 3 the TPSD outperforms the CSAS in runtime as
well as performance. For these two tasks, the results obtained by the CSAS are
significantly lower because local transpositions are not considered. These results
show that taking into account transpositions has a high impact on the quality
of the retrieval system, but also on the runtime.
The retrieval performance of the CSAS is good, but comes at a price. On
average over six of the twelve runs, the CSAS runs need about 136 times as
much time to complete as the TPSD. The TPSD takes about 30 minutes to 1.5
hours to match all 5028 pieces, while the CSAS takes about 2 to 9 days. Due to
the fact that the CSAS run in task 2 takes 206 hours to complete, there was not
enough time to perform a run on task 1 and 3 with the CSAS variant that takes
local transpositions into account.
Comparing Approaches to the Similarity of Musical Chord Sequences 255

Table 4. This table shows for each pair of runs if the mean average precision, as
displayed in Figure 3 differed significantly (+) or not (–)

Key Inferred Key Information available


task1 task2 task3 task1 task2 task3 task4 task5 task6 task4 task5
TPSD TPSD TPSD CSAS CSAS CSAS TPSD TPSD TPSD CSAS CSAS
key task2 TPSD +
inferred task3 TPSD – +
task1 CSAS + + +
task2 CSAS + + + +
task3 CSAS + + + + +
key task4 TPSD + – + + + +
information task5 TPSD + + + + + + +
available task6 TPSD + – + + + + – +
task4 CSAS + + + + + + + – –
task5 CSAS + + + + + + + + + +
task6 CSAS + + + + – + + + + + –

In task 6 both algorithms represent the chord sequences as TPS distances to


the triad of the key. Nevertheless, the TPSD is outperformed by the CSAS. This
difference as well as other differences in performance might well be explained by
the insertion and deletion operations in the CSAS algorithm: if one takes two
identical pieces an inserts one arbitrary extra chord somewhere in the middle of
the piece, an asynchrony is created between the two step functions which has a
large effect on the estimated distance, while the CSAS distance only gains one
extra deletion score.
For the CSAS algorithm we did a few additional runs that are not reported
here. These runs showed that the difference in retrieval performance using dif-
ferent substitution costs (binary, consonance or semi-tones) is limited.
The runs in which a priori key information was available performed better,
regardless of the task or algorithm (compare tasks 1 and 4, 2 and 5, and 3 and
6 for both algorithms in Table 4). This was to be expected because there are
always errors in the key finding, which hampers the retrieval performance.
The amount of detail in the chord description has a significant effect on the
retrieval performance of all algorithms. In almost all cases, using only the triadic
chord description for retrieval yields better results than using only the root or the
complex chord descriptions. Only the difference in CSAS performance between
using complex chords or triads is not significant in task 5 and 6. The differences
between using only the root or using the complete chord are smaller and not
always significant.
Thus, although colorful additions to chords may sound pleasant to the hu-
man ear, they are not always beneficial for determining the similarity between
the harmonic progressions they represent. There might be a simple explanation
for these differences in performance. Using only the root of a chord already
leads to good retrieval results, but by removing good information about the
mode one looses information that can aid in boosting the retrieval performance.
256 W.B. de Haas et al.

On the other hand keeping all rich chord information seems to distract the
evaluated retrieval systems. Pruning the chord structure down to the triad might
be seen as a form of syntactical noise-reduction, since the chord additions, if they
do not have a voice leading function, have a rather arbitrary character and just
add some harmonic spice.

5 Concluding Remarks
We performed a comparison of two different chord sequence similarity measures,
the Tonal Pitch Space Distance (TPSD) and the Chord Sequence Alignment
System (CSAS), on a large newly assembled corpus of 5028 symbolic chord se-
quences. The comparison consisted of six different tasks, in which we varied the
amount of detail in the chord description and the availability of a priori key in-
formation. The CSAS variants outperform the TPSD significantly in most cases,
but is in all cases far more costly to use. The use of a priori key information
improves performance and using only the triad of a chord for similarity matching
gives the best results for the tested algorithms. Nevertheless, we can positively
answer the third question that we have asked ourselves in the introduction–do
chord descriptions provided a useful and valid abstraction–because the experi-
ment presented in the previous section clearly shows that chord descriptions can
be used for retrieving harmonically related pieces.
The retrieval performance of both algorithms is good, especially if one con-
siders the size of the corpus and the relatively small class sizes (see Table 2),
but there is still room for improvement. Both algorithms cannot deal with large
structural changes, e.g. adding repetitions, a bridge, etc. A prior analysis of
the structure of the piece combined with partial matching could improve the
retrieval performance. Another important issue is that the compared systems
treat all chords as equally important. This is musicologically not plausible. Con-
sidering the musical function in the local as well as global structure of the chord
progression, like is done in [10] or with sequences of notes in [24], might still
improve the retrieval results.
With runtimes that are measured in days, the CSAS is a costly system. The
runtimes might be improved by using GPU programming [8], or with filtering
steps using algorithms such as BLAST [3].
The harmonic retrieval systems and experiments presented in this paper con-
sider a specific form of symbolic music representations only. Nevertheless, the
application of the methods here presented is not limited to symbolic music and
audio applications are currently investigated. Especially the recent developments
in chord label extraction are very promising because the output of these meth-
ods could be matched directly with the systems here presented. The good per-
formance of the proposed algorithms leads us to believe that, both in the audio
and symbolic domain, retrieval systems will benefit from chord sequence based
matching in the near future.
Comparing Approaches to the Similarity of Musical Chord Sequences 257

References

1. Allali, J., Ferraro, P., Hanna, P., Iliopoulos, C.S.: Local transpositions in alignment
of polyphonic musical sequences. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE
2007. LNCS, vol. 4726, pp. 26–38. Springer, Heidelberg (2007)
2. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuñez, Y., Rappa-
port, D., Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic
Similarity. Computer Music Journal 30(3), 67–76 (2004)
3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment
Search Tool. Journal of Molecular Biology 215, 403–410 (1990)
4. Arkin, E., Chew, L., Huttenlocher, D., Kedem, K., Mitchell, J.: An Efficiently Com-
putable Metric for Comparing Polygonal Shapes. IEEE Transactions on Pattern
Analysis and Machine Intelligence 13(3), 209–216 (1991)
5. Bello, J., Pickens, J.: A Robust Mid-Level Representation for Harmonic Content
in Music Signals. In: Proceedings of the International Symposium on Music Infor-
mation Retrieval, pp. 304–311 (2005)
6. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
7. Downie, J.S.: The Music Information Retrieval Evaluation Exchange (2005–2007):
A Window into Music Information Retrieval Research. Acoustical Science and
Technology 29(4), 247–255 (2008)
8. Ferraro, P., Hanna, P., Imbert, L., Izard, T.: Accelerating Query-by-Humming on
GPU. In: Proceedings of the Tenth International Society for Music Information
Retrieval Conference (ISMIR), pp. 279–284 (2009)
9. Gannon, P.: Band-in-a-Box. PG Music (1990), http://www.pgmusic.com/ (last
viewed February 2011)
10. de Haas, W.B., Rohrmeier, M., Veltkamp, R.C., Wiering, F.: Modeling Harmonic
Similarity Using a Generative Grammar of Tonal Harmony. In: Proceedings of the
Tenth International Society for Music Information Retrieval Conference (ISMIR),
pp. 549–554 (2009)
11. de Haas, W.B., Veltkamp, R.C., Wiering, F.: Tonal Pitch Step Distance: A Simi-
larity Measure for Chord Progressions. In: Proceedings of the Ninth International
Society for Music Information Retrieval Conference (ISMIR), pp. 51–56 (2008)
12. Hanna, P., Robine, M., Rocher, T.: An Alignment Based System for Chord Se-
quence Retrieval. In: Proceedings of the 2009 Joint International Conference on
Digital Libraries, pp. 101–104. ACM, New York (2009)
13. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for
Evaluating Similarity between Monophonic Musical Sequences. Journal of New
Music Research 36(4), 267–279 (2007)
14. Harte, C., Sandler, M., Abdallah, S., Gómez, E.: Symbolic Representation of Mu-
sical Chords: A Proposed Syntax for Text Annotations. In: Proceedings of the
Sixth International Society for Music Information Retrieval Conference (ISMIR),
pp. 66–71 (2005)
15. van Kranenburg, P., Volk, A., Wiering, F., Veltkamp, R.C.: Musical Models for
Folk-Song Melody Alignment. In: Proceedings of the Tenth International Society
for Music Information Retrieval Conference (ISMIR), pp. 507–512 (2009)
16. Krumhansl, C.: Cognitive Foundations of Musical Pitch. Oxford University Press,
USA (2001)
17. Lerdahl, F.: Tonal Pitch Space. Oxford University Press, Oxford (2001)
258 W.B. de Haas et al.

18. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
19. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering Chord Idioms
through Beatles and Real Book Songs. In: Proceedings of the Eighth International
Society for Music Information Retrieval Conference (ISMIR), pp. 255–258 (2007)
20. Mauch, M., Noland, K., Dixon, S.: Using Musical Structure to Enhance Automatic
Chord Transcription. In: Proceedings of the Tenth International Society for Music
Information Retrieval Conference (ISMIR), pp. 231–236 (2009)
21. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161–175 (1990)
22. Paiement, J.F., Eck, D., Bengio, S.: A Probabilistic Model for Chord Progressions.
In: Proceedings of the Sixth International Conference on Music Information Re-
trieval (ISMIR), London, UK, pp. 312–319 (2005)
23. Pickens, J., Crawford, T.: Harmonic Models for Polyphonic Music Retrieval. In:
Proceedings of the Eleventh International Conference on Information and Knowl-
edge Management, pp. 430–437. ACM, New York (2002)
24. Robine, M., Hanna, P., Ferraro, P.: Music Similarity: Improvements of Edit-based
Algorithms by Considering Music Theory. In: Proceedings of the ACM SIGMM
International Workshop on Multimedia Information Retrieval (MIR), Augsburg,
Germany, pp. 135–141 (2007)
25. Smith, T., Waterman, M.: Identification of Common Molecular Subsequences.
Journal of Molecular Biology 147, 195–197 (1981)
26. Temperley, D.: The Cognition of Basic Musical Structures. MIT Press, Cambridge
(2001)
27. Uitdenbogerd, A.L.: Music Information Retrieval Technology. Ph.D. thesis, RMIT
University, Melbourne, Australia (July 2002)
Songs2See and GlobalMusic2One:
Two Applied Research Projects in Music
Information Retrieval at Fraunhofer IDMT

Christian Dittmar, Holger Großmann, Estefanı́a Cano, Sascha Grollmisch,


Hanna Lukashevich, and Jakob Abeßer

Fraunhofer IDMT
Ehrenbergstr. 31, 98693 Ilmenau, Germany
{dmr,grn,cano,goh,lkh,abr}@idmt.fraunhofer.de
http://www.idmt.fraunhofer.de

Abstract. At the Fraunhofer Institute for Digital Media Technology


(IDMT) in Ilmenau, Germany, two current research projects are directed
towards core problems of Music Information Retrieval. The Songs2See
project is supported by the Thuringian Ministry of Economy, Employ-
ment and Technology through granting funds of the European Fund for
Regional Development. The target outcome of this project is a web-based
application that assists music students with their instrumental exercises.
The unique advantage over existing e-learning solutions is the opportu-
nity to create personalized exercise content using the favorite songs of
the music student. GlobalMusic2one is a research project supported by
the German Ministry of Education and Research. It is set out to develop
a new generation of hybrid music search and recommendation engines.
The target outcomes are novel adaptive methods of Music Information
Retrieval in combination with Web 2.0 technologies for better quality
in the automated recommendation and online marketing of world music
collections.

Keywords: music information retrieval, automatic music transcription,


music source separation, automatic music annotation, music similarity
search, music education games.

1 Introduction
Successful exploitation of results from basic research is the indicator for the
practical relevance of a research field. During recent years, the scientific and
commercial interest in the comparatively young research discipline called Music
Information Retrieval (MIR) has grown considerably. Stimulated by the ever-
growing availability and size of digital music catalogs and mobile media players,
MIR techniques become increasingly important to aid convenient exploration
of large music collections (e.g., through recommendation engines) and to enable
entirely new forms of music consumption (e.g., through music games). Evidently,
commercial entities like online music shops, record labels and content aggregators

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 259–272, 2011.

c Springer-Verlag Berlin Heidelberg 2011
260 C. Dittmar et al.

have realized that these aspects can make them stand out among their competi-
tors and foster customer loyalty. However, the industry’s willingness to fund basic
research in MIR is comparatively low. Thus, only well described methods have
found successful application in the real world. For music recommendation and
retrieval, these are doubtlessly services based on collaborative filtering1 (CF).
For music transcription and interaction, these are successful video game titles
using monophonic pitch detection2 . The two research projects in the scope of
this paper provide the opportunity to progress in core areas of MIR, but always
with a clear focus on suitability for real-world applications.
This paper is organized as follows. Each of the two projects is described in
more detail in Sec. 2 and Sec. 3. Results from the research as well as the de-
velopment perspective are reported. Finally, conclusions are given and future
directions are sketched.

2 Songs2See
Musical education of children and adolescents is an important factor in their
personal self-development regardless if it is about learning a musical instrument
or in music courses at school. Children, adolescents and adults must be con-
stantly motivated to practice and complete learning units. Traditional forms of
teaching and even current e-learning systems are often unable to provide this
motivation. On the other hand, music-based games are immensely popular [7],
[17], but they fail to develop skills which are transferable to musical instruments
[19]. Songs2See is set out to develop educational software for music learning
which provides the motivation of game playing and at the same time develops
real musical skills. Using music signal analysis as the key technology, we want
to enable students to use popular musical instruments as game controllers for
games which teach the students to play music of their own choice. This should
be possible regardless of the representation of music they possess (audio, score,
tab, chords, etc.). As a reward, the users receive immediate feedback from the
automated analysis of their rendition. The game application will provide the
students with visual and audio feedback regarding fine-grained details of their
performance with regard to timing (rhythm), intonation (pitch, vibrato), and
articulation (dynamics). Central to the analysis is automatic music transcrip-
tion, i.e., the extraction of a scalable symbolic representation from real-world
music recordings using specialized computer algorithms [24], [11]. Such symbolic
representation allows to render simultaneous visual and audible playbacks for
the students, i.e., it can be translated to traditional notation, a piano-roll view
or a dynamical animation showing the fingering on the actual instrument. The
biggest additional advantage is the possibility to let the students have their fa-
vorite song transcribed into a symbolic representation by the software. Thus, the
students can play along to actual music they like, instead of specifically produced
and edited learning pieces. In order to broaden the possibilities when creating
1
See for example http://last.fm
2
See for example http://www.singstargame.com/
Songs2See and GlobalMusic2One 261

exercises, state-of-the-art methods for audio source separation are exploited.


Application of source separation techniques allows to attenuate accompanying
instruments that obscure the instrument of interest or alternatively to cancel
out the original instrument in order to create a play-along backing track [9].
It should be noted that we are not striving for hi-fi audio quality with regards
to the separation. It is more important, that the students can use the above
described functionality to their advantage when practicing, a scenario in which
a certain amount of audible artifacts is acceptable without being disturbing for
the user.

2.1 Research Results


A glimpse of the main research directions shall be given here as an update to
the overview given in [12]. Further details about the different components of the
system are to be published in [17]. Past research activities allowed us to use
well-proven methods for timbre-based music segmentation [26], key detection [6]
and beat grid extraction [40] out of the box. In addition, existing methods for
automatic music transcription and audio source separation are advanced in the
project.
Automatic music transcription. We were able to integrate already avail-
able transcription algorithms that allow us to transcribe the drums, bass, main
melody and chords in real-world music segments [11], [33]. In addition, we en-
able a manual error correction by the user that is helpful when dealing with
difficult signals (high degree of polyphony, overlapping percussive instruments
etc.). With regard to the project goals and requirements of potential users it
became clear that is is necessary to also transcribe monotimbral, polyphonic
instruments like the piano and the guitar. Therefore, we conducted a review
of the most promising methods for multi-pitch transcription. These comprise
the iterative pitch-salience estimation in [23], the maximum likelihood approach
in [13] and a combination of the Specmurt principle [34] with shift-invariant
non-negative factorization [35]. Results show, that it is necessary to combine
any of the aforementioned multi-pitch estimators with a chromagram-based [37]
pre-processing to spare computation time. That is especially true for real-time
polyphonic pitch detection. For the monophonic real-time case, we could suc-
cessfully exploit the method pointed out in [18].
Audio source separation. We focused on assessing different algorithms for
audio source separation that have been reported in the literature. A thorough
review, implementation and evaluation of the methods described in [14] was
conducted. Inspection of the achievable results led to the conclusion, that sound
separation based on tensor factorization is powerful but at the moment compu-
tationally too demanding to be applied in our project. Instead, we focused on
investigating the more straightforward method described in [31]. This approach
separates polyphonic music into percussive and harmonic parts via spectrogram
diffusion. It has received much interest in the MIR community, presumably for
262 C. Dittmar et al.

its simplicity. An alternative approach for percussive vs. harmonic separation


has been published in [10]. In that paper, further use of phase information in
sound separation problems has been proposed. In all cases, phase information is
complementary to the use of magnitude information. Phase contours for musical
instruments exhibit similar micromodulations in frequency for certain instru-
ments and can be an alternative of spectral instrument templates or instrument
models. For the case of overlapped harmonics, phase coupling properties can be
exploited. Although a multitude of further source separation algorithms has been
proposed in the literature, only few of them make use of user interaction. In [36],
a promising concept for using an approximate user input for extracting sound
sources is described. Using a combination of the separation methods described
in [9] and [20] in conjunction with user approved note transcriptions, we are able
to separate melodic instruments from the background accompaniment in good
quality. In addition, we exploit the principle described in [41] in order to allow
the user to focus on certain instrument tracks that are well localized within the
stereo panorama.

Fig. 1. Screenshot of Songs2See web application


Songs2See and GlobalMusic2One 263

2.2 Development Results


In order to have “tangible” results early on, the software development started in
parallel with the research tasks. Therefore, interfaces have been kept as generic
as possible in order to enable adaption to alternative or extended core algorithms
later on.
Songs2See web application. In order to achieve easy accessibility to the
exercise application, we decided to go for a web-based approach using Flex3 .
Originally, the Flash application built with this tool did not allow direct pro-
cessing of the microphone input. Thus, we had to implement a streaming server
solution. We used the open source Red54 in conjunction with the transcoder
library Xuggler5 . This way, we conducted real-time pitch detection [11] in the
server application and returned the detected pitches to the web interface. Fur-
ther details about the implementation are to be published in [18]. Only in their
latest release in July 2010, i.e., Adobe Flash Player 10.1, has Flash incorporated
the possibility of handling audio streams from a microphone input directly on
the client side. A screenshot of the prototype interface is shown in Fig. 1. It can
be seen that the user interface shows further assistance than just the plain score
sheet. The fingering on the respective instrument of the student is shown as an
animation. The relative tones are displayed as well as the relative position of
the pitch produced by the players’ instrument. This principle is well known from
music games and has been adapted here for the educational purposes. Further
helpful functions, such as transpose, tempo change and stylistic modification will
be implemented in the future.
Songs2See editor. For the creation of music exercises, we developed an ap-
plication with the working title Songs2See Editor. It is a stand-alone graphical
user interface based on Qt6 that allows the average or expert user the creation
of musical exercises. The editor already allows to go through the prototypical
work-flow. During import of a song, timbre segmentation is conducted and the
beat grid and key candidates per segment are estimated. The user can choose
the segment of interest, start the automatic transcription or use the source sep-
aration functionality. For immediate visual and audible feedback of the results,
a piano-roll editor is combined with a simple synthesizer as well as sound sepa-
ration controls. Thus, the user can grab any notes he suspects to be erroneously
transcribed, move or delete them. In addition the user is able to seamlessly
mix the ratio between the separated melody instrument and the background
accompaniment. In Fig. 2, the interface can be seen. We expect that the users
of the editor will creatively combine the different processing methods in order to
analyze and manipulate the audio tracks to their liking. In the current stage of
development, export of MIDI and MusicXML is already possible. In a later stage
support for other popular formats, such as TuxGuitar will be implemented.
3
See http://www.adobe.com/products/flex/
4
See http://osflash.org/red5
5
See http://www.xuggle.com/
6
See http://qt.nokia.com/products
264 C. Dittmar et al.

Fig. 2. Screenshot of Songs2See editor prototype

3 GlobalMusic2One
GlobalMusic2one is developing a new generation of adaptive music search engines
combining state-of-the-art methods of MIR with Web 2.0 technologies. It aims
at reaching better quality in automated music recommendation and browsing
inside global music collections. Recently, there has been a growing research in-
terest in music outside the mainstream popular music from the so-called western
culture group [39],[16]. For well-known mainstream music, large amounts of user
generated browsing traces, reviews, play-lists and recommendations available in
different online communities can be analyzed through CF methods in order to
reveal similarities between artists, songs and albums. For novel or niche content
one obvious solution to derive such data is content-based similarity search. Since
the early days of MIR, the search for music items related to a specific query song
or a set of those (Query by Example) has been a consistent focus of scientific
interest. Thus, a multitude of different approaches with varying degree of com-
plexity has been proposed [32]. Another challenge is the automatic annotation
(a.k.a. “auto-tagging” [8]) of world music content. It is obvious that the broad
term “World Music” is one of the most ill-defined tags when being used to lump
all “exotic genres” together. It lacks justification because this category comprises
such a huge variety of different regional styles, influences, and a mutual mix up
thereof. On the one hand, retaining the strict classification paradigm for such a
high variety of musical styles inevitably limits the precision and expressiveness
of a classification system that shall be applied to a world-wide genre taxonomy.
With GlobalMusic2One, the user may create new categories allowing the system
to flexibly adapt to new musical forms of expression and regional contexts. These
Songs2See and GlobalMusic2One 265

categories can, for example, be regional sub-genres which are defined through
exemplary songs or song snippets. This self-learning MIR framework will be
continuously expanded with precise content-based descriptors.

3.1 Research Results

With automatic annotation of world music content, songs often cannot be as-
signed to one single genre label. Instead, various rhythmic, melodic and harmonic
influences conflate into multi-layered mixtures. Common classifier approaches
fail due to their immanent assumption that for all song segments, one dominant
genre exists and thus is retrievable.

Multi-domain labeling. To overcome these problems, we introduced the


“multi-domain labeling” approach [28] that breaks down multi-label annota-
tions towards single-label annotations within different musical domains, namely
timbre, rhythm, and tonality. In addition, a separate annotation of each tempo-
ral segment of the overall song is enabled. This leads to a more meaningful and
realistic two-dimensional description of multi-layered musical content. Related
to that topic, classification of singing vs. rapping in urban music has been de-
scribed in [15]. In another paper [27] we applied the recently proposed Multiple
Kernel Learning (MKL) technique that has been successfully used for real-world
applications in the fields of computational biology, image information retrieval
etc. In contrast to classic Support Vector Machines (SVM), MKL provides a
possibility of weighting over different kernels depending on a feature set.

Clustering with constraints. Inspired by the work in [38], we investigated


clustering with constraints with application to active exploration of music collec-
tions. Constrained clustering has been developed to improve clustering methods
through pairwise constraints. Although these constraints are received as queries
from a noiseless oracle, most of the methods involve a random procedure stage
to decide which elements are presented to the oracle. In [29] we applied spectral
clustering with constraints to a music dataset, where the queries for constraints
were selected in a deterministic way through outlier identification perspective.
We simulated the constraints through the ground-truth music genre labels. The
results showed that constrained clustering with the deterministic outlier iden-
tification method achieved reasonable and stable results through the increment
of the number of constraint queries. Although the constraints were enhancing
the similarity relations between the items, the clustering was conducted in the
static feature space. In [30] we embedded the information about the constraints
to a feature selection procedure, that adapted the feature space regarding the
constraints. We proposed two methods for the constrained feature selection:
similarity-based and constrained-based. We applied the constrained clustering
with embedded feature selection for the active exploration of music collections.
Our experiments showed that the proposed feature selection methods improved
the results of the constrained clustering.
266 C. Dittmar et al.

Rule-based classification. The second important research direction was rule-


based classification with high-level features. In general, high-level features can
again be categorized according to different musical domains like rhythm, har-
mony, melody or instrumentation. In contrast to low-level and mid-level audio
features, they are designed with respect to music theory and are thus inter-
pretable by human observers. Often, high-level features are derived from au-
tomatic music transcription or classification into semantic categories. Different
approaches for the extraction of rhythm-related high-level features have been re-
ported in [21], [25] and [1]. Although automatic extraction of high-level features
is still quite error-prone, we proved in [4] that they can be used in a rule-based
classification scheme with a quality comparable to state-of-the-art pattern recog-
nition using SVM. The concept of rule-based classification was inspected in detail
in [3] using a fine granular manual annotation of high-level features referring to
rhythm, instrumentation etc. In this paper, we tested rule-based classification
on a restricted dataset of 24 manually annotated audio tracks and achieved an
accuracy rate of over 80%.

Novel audio features. Adding the fourth domain instrumentation to the


multi-domain approach described in Sec. 3.1 required the design and implemen-
tation of novel audio features tailored towards instrument recognition in poly-
phonic recordings. Promising results even with instruments from non-European
cultural areas are reported in [22]. In addition, we investigated the automatic
classification of rhythmic patterns in global music styles in [42]. In this work,
special measures have been taken to make the features and distance measures
tempo independent. This is done implicitly, without the need for a preceding
beat grid extraction that is commonly recommended in the literature to derive
beat synchronous feature vectors. In conjunction with the approach to rule-based
classification described in Sec. 3.1, novel features for the classification of bass-
playing styles have been published in [5] and [2]. In this paper, we compared
an approach based on high-level features and another one based on similarity
measures between bass patterns. For both approaches, we assessed two different
strategies: classification of patterns as a whole and classification of all measures
of a pattern with a subsequent accumulation of the classification results. Fur-
thermore, we investigated the influence of potential transcription errors on the
classification accuracy. Given a taxonomy consisting of 8 different bass playing
styles, best classification accuracy values of 60.8% were achieved for the feature-
based classification and 68.5% for the pattern similarity approach.

3.2 Development Results

As with the Songs2See project, the development phase in GlobalMusic2One


started in parallel with the research activities and is now near its completion.
It was mandatory to have early prototypes of the required software for certain
tasks in the project.
Songs2See and GlobalMusic2One 267

Annotation Tool. We developed an Qt-based Annotation Tool that facilitates


the gathering of conceptualized annotations for any kind of audio content. The
program was designed for expert users to enable them to manually describe audio
files efficiently on a very detailed level. The Annotation Tool can be configured
to different and extensible annotation schemes making it flexible for multiple ap-
plication fields. The tool supports single labeling, and multi-labeling, as well as
the new approach of multi-domain labeling. However, strong labeling is not en-
forced by the tool, but remains under the control of the user. As a unique feature
the audio annotation tool comes with a automated, timbre-based audio segmen-
tation algorithm integrated that helps the user to intuitively navigate through
the audio file during the annotation process and select the right segments. Of
course, the granularity of the segmentation can be adjusted and every single
segment border can be manually corrected if necessary. There are now approx.
100 observables that can be chosen from while annotating. This list is under
steady development. In Fig. 3, a screenshot of the Annotation Tool configured
to the Globalmusic2one description scheme can be seen.

Fig. 3. Screenshot of the Annotation Tool configured to the Globalmusic2one descrip-


tion scheme, showing the annotation of a stylistically and structurally complex song

PAMIR framework. The Personalized Adaptive MIR framework (PAMIR)


is a server system written in Python that loosely strings together various func-
tional blocks. These blocks are e.g., a relational database, a server for comput-
ing content-based similarities between music items and various machine learning
servers. PAMIR is also the instance that enables the adaptivity with respect to
the user preferences. Basically, it allows to conduct feature-selection and auto-
mated picking of the most suitable classification strategy for a given classification
268 C. Dittmar et al.

problem. Additionally, it enables content-based similarity search both on song


and segment level. The latter method can be used to retrieve parts of songs,
that are similar to the query, whereas the complete song may exhibit different
properties. Visualizations of the classification and similarity-search results are
delivered to the users via a web interface. In Fig. 4, the prototype web inter-
face to the GlobalMusic2One portal is shown. It can be seen that we feature an
intuitive similarity map for explorative browsing inside the catalog as well as
target-oriented search masks for specific music properties. The same interface
allows the instantiation of new concepts and manual annotation of reference
songs. In the upper left corner, a segment player is shown that allows to jump
directly to different parts in the songs.

Fig. 4. Screenshot of the GlobalMusic2One prototype web client

4 Conclusions and Outlook


In this paper we presented an overview of the applied MIR projects Songs2See
and GlobalMusic2One. Both address core challenges of MIR with strong focus to
Songs2See and GlobalMusic2One 269

real-world applications. With Songs2See, development activities will be strongly


directed towards implementation of more advanced features in the editor as
well as the web application. The main efforts inside GlobalMusic2One will be
concentrated on consolidating the framework and the web client.

Acknowledgments

The Thuringian Ministry of Economy, Employment and Technology supported


this research by granting funds of the European Fund for Regional Develop-
ment to the project Songs2See 7 , enabling transnational cooperation between
Thuringian companies and their partners from other European regions. Addi-
tionally, this work has been partly supported by the German research project
GlobalMusic2One 8 funded by the Federal Ministry of Education and Research
(BMBF-FKZ: 01/S08039B).

References
1. Abeßer, J., Dittmar, C., Großmann, H.: Automatic genre and artist classification
by analyzing improvised solo parts from musical recordings. In: Proceedings of the
Audio Mostly Conference (AMC), Piteå, Sweden (2008)
2. Abeßer, J., Bräuer, P., Lukashevich, H., Schuller, G.: Bass playing style detection
based on high-level features and pattern similarity. In: Proceedings of the 11th In-
ternational Society for Music Information Retrieval Conference (ISMIR), Utrecht,
Netherlands (2010)
3. Abeßer, J., Lukashevich, H., Dittmar, C., Bräuer, P., Krause, F.: Rule-based clas-
sification of musical genres from a global cultural background. In: Proceedings
of the 7th International Symposium on Computer Music Modeling and Retrieval
(CMMR), Malaga, Spain (2010)
4. Abeßer, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classification using
bass-related high-level features and playing styles. In: Proceedings of the 10th
International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
5. Abeßer, J., Lukashevich, H., Schuller, G.: Feature-based extraction of plucking and
expression styles of the electric bass guitar. In: Proceedings of the IEEE Interna-
tional Conference on Acoustic, Speech, and Signal Processing (ICASSP), Dallas,
Texas, USA (2010)
6. Arndt, D., Gatzsche, G., Mehnert, M.: Symmetry model based key finding. In:
Proceedings of the 126th AES Convention, Munich, Germany (2009)
7. Barbancho, A., Barbancho, I., Tardon, L., Urdiales, C.: Automatic edition of songs
for guitar hero/frets on fire. In: Proceedings of the IEEE International Conference
on Multimedia and Expo (ICME), New York, USA (2009)
8. Bertin-Mahieux, T., Eck, D., Maillet, F., Lamere, P.: Autotagger: a model for
predicting social tags from acoustic features on large music databases. Journal of
New Music Research 37(2), 115–135 (2008)
7
See http://www.songs2see.eu
8
See http://www.globalmusic2one.net
270 C. Dittmar et al.

9. Cano, E., Cheng, C.: Melody line detection and source separation in classical sax-
ophone recordings. In: Proceedings of the 12th International Conference on Digital
Audio Effects (DAFx), Como, Italy (2009)
10. Cano, E., Schuller, G., Dittmar, C.: Exploring phase information in sound source
separation applications. In: Proceedings of the 13th International Conference on
Digital Audio Effects (DAFx 2010), Graz, Austria (2010)
11. Dittmar, C., Dressler, K., Rosenbauer, K.: A toolbox for automatic transcription
of polyphonic music. In: Proceedings of the Audio Mostly Conference (AMC),
Ilmenau, Germany (2007)
12. Dittmar, C., Großmann, H., Cano, E., Grollmisch, S., Lukashevich, H., Abeßer,
J.: Songs2See and GlobalMusic2One - Two ongoing projects in Music Information
Retrieval at Fraunhofer IDMT. In: Proceedings of the 7th International Symposium
on Computer Music Modeling and Retrieval (CMMR), Malaga, Spain (2010)
13. Duan, Z., Pardo, B., Zhang, C.: Multiple fundamental frequency estimation by
modeling spectral peaks and non-peak regions. EEE Transactions on Audio,
Speech, and Language Processing (99), 1–1 (2010)
14. Fitzgerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factoriza-
tion models for musical sound source separation. Computational Intelligence and
Neuroscience (2008)
15. Gärtner, D.: Singing / rap classification of isolated vocal tracks. In: Proceedings
of the 11th International Society for Music Information Retrieval Conference (IS-
MIR), Utrecht, Netherlands (2010)
16. Gómez, E., Haro, M., Herrera, P.: Music and geography: Content description of
musical audio from different parts of the world. In: Proceedings of the 10th In-
ternational Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
17. Grollmisch, S., Dittmar, C., Cano, E.: Songs2see: Learn to play by playing. In:
Proceedings of the 41st AES International Conference on Audio in Games, London,
UK (2011)
18. Grollmisch, S., Dittmar, C., Cano, E., Dressler, K.: Server based pitch detection
for web applications. In: Proceedings of the 41st AES International Conference on
Audio in Games, London, UK (2011)
19. Grollmisch, S., Dittmar, C., Gatzsche, G.: Implementation and evaluation of an im-
provisation based music video game. In: Proceedings of the IEEE Consumer Elec-
tronics Society’s Games Innovation Conference (IEEE GIC), London, UK (2009)
20. Gruhne, M., Schmidt, K., Dittmar, C.: Phoneme recognition on popular music.
In: 8th International Conference on Music Information Retrieval (ISMIR), Vienna,
Austria (2007)
21. Herrera, P., Sandvold, V., Gouyon, F.: Percussion-related semantic descriptors of
music audio files. In: Proceedings of the 25th International AES Conference, Lon-
don, UK (2004)
22. Kahl, M., Abeßer, J., Dittmar, C., Großmann, H.: Automatic recognition of tonal
instruments in polyphonic music from different cultural backgrounds. In: Proceed-
ings of the 36th Jahrestagung für Akustik (DAGA), Berlin, Germany (2010)
23. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Proceedings of the 10th International Society for Music Information Retrieval
Conference (ISMIR), Kobe, Japan (2009)
Songs2See and GlobalMusic2One 271

24. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer Science + Business Media LLC, New York (2006)
25. Lidy, T., Rauber, A., Pertusa, A., Iesta, J.M.: Improving genre classification by
combination of audio and symbolic descriptors using a transcription system. In:
Proceedings of the 8th International Conference on Music Information Retrieval
(ISMIR), Vienna, Austria (2007)
26. Lukashevich, H.: Towards quantitative measures of evaluating song segmentation.
In: Proceedings of the 9th International Conference on Music Information Retrieval
(ISMIR), Philadelphia, Pennsylvania, USA (2008)
27. Lukashevich, H.: Applying multiple kernel learning to automatic genre classifica-
tion. In: Proceedings of the 34th Annual Conference of the German Classification
Society (GfKl), Karlsruhe, Germany (2010)
28. Lukashevich, H., Abeßer, J., Dittmar, C., Großmann, H.: From multi-labeling to
multi-domain-labeling: A novel two-dimensional approach to music genre classi-
fication. In: Proceedings of the 10th International Society for Music Information
Retrieval Conference (ISMIR), Kobe, Japan (2009)
29. Mercado, P., Lukashevich, H.: Applying constrained clustering for active explo-
ration of music collections. In: Proceedings of the 1st Workshop on Music Recom-
mendation and Discovery (WOMRAD), Barcelona, Spain (2010)
30. Mercado, P., Lukashevich, H.: Feature selection in clustering with constraints: Ap-
plication to active exploration of music collections. In: Proceedings of the 9th Int.
Conference on Machine Learning and Applications (ICMLA), Washington DC,
USA (2010)
31. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of a
monaural audio signal into harmonic/percussive components by complememntary
diffusion on spectrogram. In: Proceedings of the 16th European Signal Processing
Conferenc (EUSIPCO), Lausanne, Switzerland (2008)
32. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., Widmer, G.: On rhythm and gen-
eral music similarity. In: Proceedings of the 10th International Society for Music
Information Retrieval Conference (ISMIR), Kobe, Japan (2009)
33. Ryynänen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32, 72–86 (2008)
34. Sagayama, S., Takahashi, K., Kameoka, H., Nishimoto, T.: Specmurt anasylis:
A piano-roll-visualization of polyphonic music signal by deconvolution of log-
frequency spectrum. In: Proceedings of the ISCA Tutorial and Research Workshop
on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea (2004)
35. Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as
nonnegative factorizations. Computational Intelligence and Neuroscience (2008)
36. Smaragdis, P., Mysore, G.J.: Separation by “humming”: User-guided sound extrac-
tion from monophonic mixtures. In: Proceedings of IEEE Workshop on Applica-
tions Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA
(2009)
37. Stein, M., Schubert, B.M., Gruhne, M., Gatzsche, G., Mehnert, M.: Evaluation
and comparison of audio chroma feature extraction methods. In: Proceedings of
the 126th AES Convention, Munich, Germany (2009)
38. Stober, S., Nürnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 53–65. Springer, Heidelberg (2010)
272 C. Dittmar et al.

39. Tzanetakis, G., Kapur, A., Schloss, W.A., Wright, M.: Computational ethnomusi-
cology. Journal of Interdisciplinary Music Studies 1(2), 1–24 (2007)
40. Uhle, C.: Automatisierte Extraktion rhythmischer Merkmale zur Anwendung in
Music Information Retrieval-Systemen. Ph.D. thesis, Ilmenau University, Ilmenau,
Germany (2008)
41. Vinyes, M., Bonada, J., Loscos, A.: Demixing commercial music productions via
human-assisted time-frequency masking. In: Proceedings of the 120th AES con-
venction, Paris, France (2006), http://www.mtg.upf.edu/files/publications/
271dd4-AES120-mvinyes-jbonada-aloscos.pdf (last viewed February 2011)
42. Völkel, T., Abeßer, J., Dittmar, C., Großmann, H.: Automatic genre classification
of latin american music using characteristic rhythmic patterns. In: Proceedings of
the Audio Mostly Conference (AMC), Piteå, Sweden (2010)
MusicGalaxy:
A Multi-focus Zoomable Interface for
Multi-facet Exploration of Music Collections

Sebastian Stober and Andreas Nürnberger

Data & Knowledge Engineering Group


Faculty of Computer Science
Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany
{sebastian.stober,andreas.nuernberger}@ovgu.de
http://www.dke-research.de

Abstract. A common way to support exploratory music retrieval sce-


narios is to give an overview using a neighborhood-preserving projection
of the collection onto two dimensions. However, neighborhood cannot
always be preserved in the projection because of the inherent dimen-
sionality reduction. Furthermore, there is usually more than one way
to look at a music collection and therefore different projections might
be required depending on the current task and the user’s interests. We
describe an adaptive zoomable interface for exploration that addresses
both problems: It makes use of a complex non-linear multi-focal zoom
lens that exploits the distorted neighborhood relations introduced by the
projection. We further introduce the concept of facet distances represent-
ing different aspects of music similarity. User-specific weightings of these
aspects allow an adaptation according to the user’s way of exploring the
collection. Following a user-centered design approach with focus on us-
ability, a prototype system has been created by iteratively alternating
between development and evaluation phases. The results of an exten-
sive user study including gaze analysis using an eye-tracker prove that
the proposed interface is helpful while at the same time being easy and
intuitive to use.

Keywords: exploration, interface, multi-facet, multi-focus.

1 Introduction

There is a lot of ongoing research in the field of music retrieval aiming to improve
retrieval results for queries posed as text, sung, hummed or by example as well
as to automatically tag and categorize songs. All these efforts facilitate scenarios
where the user is able to somehow formulate a query – either by describing the
song or by giving examples. But what if the user cannot pose a query because
the search goal is not clearly defined? E.g., he might look for background music
for a photo slide show but does not know where to start. All he knows is that he
can tell if it is the right music the moment he hears it. In such a case, exploratory

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 273–302, 2011.

c Springer-Verlag Berlin Heidelberg 2011
274 S. Stober and A. Nürnberger

similar

dissimilar

Fig. 1. Possible problems caused by projecting objects represented in a high-


dimensional feature space (left) onto a low-dimensional space for display (right)

retrieval systems can help by providing an overview of the collection and letting
the user decide which regions to explore further.
When it comes to get an overview of a music collection, neighborhood-preser-
ving projection techniques have become increasingly popular. Beforehand, the
objects to be projected – depending on the approach, this may be artists, albums,
tracks or any combination thereof – are analyzed to extract a set of descriptive
features. (Alternatively, feature information may also be annotated manually or
collected from external sources.) Based on these features, the objects can be
compared – or more specifically: appropriate distance- or similarity measures
can be defined. The general objective of the projection can then be paraphrased
as follows: Arrange the objects in two or three dimensions (on the display) in
such a way that neighboring objects are very similar and the similarity decreases
with increasing object distance (on the display). As the feature space of the ob-
jects to be projected usually has far more dimensions than the display space,
the projection inevitably causes some loss of information – irrespective of which
dimensionality reduction techniques is applied. Consequently, this leads to a dis-
torted display of the neighborhoods such that some objects will appear closer
than they actually are (type I error), and on the other hand some objects that
are distant in the projection may in fact be neighbors in feature space (type
II error). Such neighborhood distortions are depicted in Figure 1. These “pro-
jection errors” cannot be fixed on a global scale without introducing new ones
elsewhere as the projection is already optimal w.r.t. some criteria (depending
on the technique used). In this sense, they should not be considered as errors
made by the projection technique but of the resulting (displayed) arrangement.
When a user explores a projected collection, type I errors increase the number
of dissimilar (i.e. irrelevant) objects displayed in a region of interest. While this
might become annoying, it is much less problematic than type II errors. They
result in similar (i.e. relevant) objects to be displayed away from the region of
interest – the neighborhood they actually belong to. In the worst case they could
even be off-screen if the display is limited to the currently explored region. This
way, a user could miss objects he is actually looking for.
Zoomable Interface for Multi-facet Exploration of Music Collections 275

The interactive visualization technique described in this paper exploits these


distorted neighborhood relations during user-interaction. Instead of trying to
globally repair errors in the projection, the general idea is to temporarily fix
the neighborhood in focus. The approach is based on a multi-focus fish-eye
lens that allows a user to enlarge and explore a region of interest while at the
same time adaptively distorting the remaining collection to reveal distant regions
with similar tracks. It can therefor be considered as a focus-adaptive distortion
technique.
Another problem that arises when working with similarity-based neighbor-
hoods is that music similarity is highly subjective and may depend on a person’s
background. Consequently, there is more than one way to look at a music col-
lection – or more specifically to compare two tracks based on their features.
The user-interface presented in this paper therefore allows the user to modify
the underlying distance measure by adapting weights for different aspects of
(dis-)similarity.
The remainder of this paper is structured as follows: Section 2 gives an overview
of related approaches that aim to visualize a music collection. Subsequently,
Section 3 outlines the approach developed in this work. The underlying tech-
niques are addressed in Section 4 and Section 5 explains how a user can interact
with the proposed visualization. In order to evaluate the approach, a user study
has been conducted which is described in Section 6. Finally, Section 7 concludes
with a brief summary.

2 Related Work
There exists a variety of approaches that in some way give an overview of a music
collection. For the task of music discovery which is closely related to collection
exploration, a very broad survey of approaches is given in [7]. Generally, there are
several possible levels of granularity that can be supported, the most common
being: track, album, artist and genre. Though a system may cover more than
one granularity level (e.g. in [51] visualized as disc or TreeMap [41]), usually a
single one is chosen. The user-interface presented in this paper focuses on the
track level as do most of the related approaches. (However, like most of the other
techniques, it may as well be applied on other levels such as for albums or artist.
All that is required is an appropriate feature representation of the objects of
interest.) Those approaches focusing on a single level can roughly be categorized
into graph-based and similarity-based overviews.
Graphs facilitate a natural navigation along relationship-edges. They are espe-
cially well-suited for the artist level as social relations can be directly visualized
(as, e.g., in the Last.fm Artist Map 1 or the Relational Artist Map RAMA [39]).
However, building a graph requires relations between the objects – either from
domain knowledge or artificially introduced. E.g., there are some graphs that use
similarity-relations obtained from external sources (such as the APIs of Last.fm 2
1
http://sixdegrees.hu/last.fm/interactive map.html
2
http://www.last.fm/api
276 S. Stober and A. Nürnberger

or EchoNest 3 ) and not from an analysis of the objects themselves. Either way,
this results in a very strong dependency and may quickly become problematic
for less main-stream music where such information might not be available. This
is why a similarity-based approach is chosen here instead.
Similarity-based approaches require the objects to be represented by one or
more features. They are in general better suited for track level overviews due
the vast variety of content-based features that can be extracted from tracks.
For albums and artists, either some means for aggregating the features of the
individual tracks are needed or non-content-based features, e.g. extracted from
knowledge resources like MusicBrainz 4 and Wikipedia 5 or cultural meta-data
[54], have to be used. In most cases the overview is then generated using some
metric defined on these features which leads to proximity of similar objects in the
feature space. This neighborhood should be preserved in the collection overview
which usually has only two dimensions. Popular approaches for dimensionality
reduction are Self-Organizing Maps (SOMs) [17], Principle Component Analysis
(PCA) [14] and Multidimensional Scaling (MDS) techniques [18].
In the field of music information retrieval, SOMs are widely used. SOM-based
systems comprise the SOM-enhanced Jukebox (SOMeJB) [37], the Islands of
Music [35,34] and nepTune [16], the MusicMiner [29], the PlaySOM - and Pock-
etSOM -Player [30] (the latter being a special interface for mobile devices), the
BeatlesExplorer [46] (the predecessor prototype of the system presented here),
the SoniXplorer [23,24], the Globe of Music [20] and the tabletop applications
MUSICtable [44], MarGrid [12], SongExplorer [15] and [6]. SOMs are prototype-
based and thus there has to be a way to initially generate random prototypes
and to modify them gradually when objects are assigned. This poses special
requirements regarding the underlying feature space and distance metric. More-
over, the result depends on the random initialization and the neural network
gradient descend algorithm may get stuck in a local minimum and thus not
produce an optimal result. Further, there are several parameters that need to
be tweaked according to the data set such as the learning rate, the termination
criterion for iteration, the initial network structure, and (if applicable) the rules
by which the structure should grow. However, there are also some advantages
of SOMs: Growing versions of SOMs can adapt incrementally to changes in the
data collection whereas other approaches may always need to generate a new
overview from scratch. Section 4.2 will address this point more specifically for
the approach taken here. For the interactive task at hand, which requires a real-
time response, the disadvantages of SOMs outweigh their advantages. Therefore,
the approach taken here is based on MDS.
Given a set of data points, MDS finds an embedding in the target space
that maintains their distances (or dissimilarities) as far as possible – without
having to know their actual values. This way, it is also well suited to compute a
layout for spring- or force-based approaches. PCA identifies the axes of highest

3
http://developer.echonest.com
4
http://musicbrainz.org
5
http://www.wikipedia.org
Zoomable Interface for Multi-facet Exploration of Music Collections 277

Fig. 2. In SoundBite [22], a seed song and its nearest neighbors are connected by lines

variance termed principal components for a set of data points in high-dimensional


space. To obtain a dimensionality reduction to two-dimensional space, the data
points are simply projected onto the two principal component axes with the
highest variance. PCA and MDS are closely related [55]. In contrast to SOMs,
both are non-parametric approaches that compute an optimal solution (with
respect to data variance maximization and distance preservation respectively)
in fixed polynomial time. Systems that apply PCA, MDS or similar force-based
approaches comprise [4], [11], [10], the fm4 Soundpark [8], MusicBox [21], and
SoundBite [22].
All of the above approaches use some kind of projection technique to visualize
the collection but only a small number tries to additionally visualize properties
of the projection itself: The MusicMiner [29] draws mountain ranges between
songs that are displayed close to each other but dissimilar. The SoniXplorer
[23,24] uses the same geographical metaphor but in a 3D virtual environment
that the user can navigate with a game pad. The Islands of Music [35,34] and its
related approaches [16,30,8] use the third dimension the other way around: Here,
islands or mountains refer to regions of similar songs (with high density). Both
ways, local properties of the projection are visualized – neighborhoods of either
dissimilar or similar songs. In contrast (and possibly as a supplementation) to
this, the technique proposed in this paper aims to visualize properties of the
projection that are not locally confined: As visualized in [32], there may be
distant regions in a projection that contain very similar objects. This is much
like a “wormhole” connecting both regions through the high-dimensional feature
space. To our knowledge, the only attempt so far to visualize such distortions
caused by the projection is described in [22]. The approach is to draw lines that
connect a selected seed track (highlighted with a circle) with its neighbors as
shown in Figure 2.
278 S. Stober and A. Nürnberger

Additionally, our goal is to support user-adaptation during the exploration


process by means of weighting aspects of music similarity. Of the above ap-
proaches, only the revised SoniXplorer [24], MusicBox [21] and the BeatlesEx-
plorer [46] – our original SOM-based prototype – allow automatic adaptation
of the view on the collection through interaction. Apart from this, there exist
systems that also adapt a similarity measure but not to change the way the col-
lection is presented in an overview but to directly generate playlists: MPeer [3]
allows to navigate the similarity space defined by the audio content, the lyrics
and cultural meta-data collected from the web through an intuitive joystick in-
terface. In the E-Mu Jukebox [53], weights for five similarity components (sound,
tempo, mood, genre and year – visually represented by adapters) can be changed
by dragging them on a bull’s eye. PATS (Personalized Automatic Track Selec-
tion) [36] and the system described in [56] do not require manual adjustment of
the underlying similarity measure but learn from the user as he selects songs that
in his opinion do not fit to the current context-of-use. PAPA (Physiology and
Purpose-Aware Automatic Playlist Generation) [33] as well as the already com-
mercially available BODiBEAT music player6 uses sensors that measure several
bio-signals (such as the pulse) of the user as immediate feedback for the music
currently played. In this case not even a direct interaction of the user with the
system is required to continuously adapt playlist-models for different purposes.
However, as we have shown in a recent survey [50], users do not particularly like
the idea of having their bio-signals logged – especially if they cannot control the
impact of this information on the song recommendation process. In contrast to
these systems that purely focus on the task of playlist generation, we pursuit a
more general goal in providing an adaptive overview of the collection that can
then be used to easily generate playlists as e.g. already shown in [16] or [21].

3 Outline
The goal of our work is to provide a user with an interactive way of exploring
a music collection that takes into account the above described inevitable lim-
itations of a low-dimensional projection of a collection. Further, it should be
applicable for realistic music collections containing several thousands of tracks.
The approach taken can be outlined as follows:
– An overview of the collection is given, where all tracks are displayed as points
at any time. For a limited number of tracks that are chosen to be spatially
well distributed and representative, an album cover thumbnail is shown for
orientation.
– The view on the collection is generated by a neighborhood-preserving pro-
jection (e.g. MDS, SOM, PCA) from some high-dimensional feature space
onto two dimensions. I.e., in general tracks that are close in feature space
will likely appear as neighbors in the projection.

6
http://www.yamaha.com/bodibeat
Zoomable Interface for Multi-facet Exploration of Music Collections 279

– Users can adapt the projection by choosing weights for several aspects of
music (dis-)similarity. This gives them the possibility to look at a collection
from different perspectives. (This adaptation is purely manual, i.e. the visu-
alization as described in this paper is only adaptable w.r.t. music similarity.
Techniques to further enable adaptive music similarity are, e.g., discussed in
[46,49].)
– In order to allow immediate visual feedback in case of similarity adaptation,
the projection technique needs to guarantee near real-time performance –
even for large music collections. The quality of the produced projection is
only secondary – a perfect projection that correctly preserves all distances
between all tracks is extremely unlikely anyways.
– The projection will inevitably contain distortions of the actual distances of
the tracks. Instead of trying to improve the quality of the projection method
and trying to fix heavily distorted distances, they are exploited during in-
teraction with the projection:
The user can zoom into a region of interest. The space for this region is
increased, thus allowing to display more details. At the same time the sur-
rounding space is compacted but not hidden from view. This way, there re-
mains some context for orientation. To accomplish such a behavior the zoom
is based on a non-linear distortion similar to so called “fish-eye” lenses.
At this point the original (type II) projection errors come into play: Instead
of putting a single lens focus on the region of interest, additional focuses are
introduced in regions that contain tracks similar to those in primary focus.
The resulting distortion brings original neighbors back closer to each other.
This gives the user another option for interactive exploration.
Figure 3 depicts the outline of the approach. The following sections cover the
underlying techniques (Section 4) and the user-interaction (Section 5) in detail.

4 Underlying Techniques
4.1 Features and Facets
The prototype system described here uses collections of music tracks. As a pre-
requisite, it is assumed that the tracks are represented by some descriptive fea-
tures that can, e.g., be extracted, manually annotated or obtained form external
sources. In the current implementation, content-based features are extracted
utilizing the capabilities of the frameworks CoMIRVA [40] and JAudio [28].
Specifically, Gaussian Mixture Models of the Mel Frequency Cepstral Coeffi-
cients (MFCCs) according to [2] and [26] and “fluctuation patterns” describing
how strong and fast beats are played within specific frequency bands [35] are
computed with CoMIRVA. JAudio is used to extract a global audio descriptor
“MARSYAS07” as described in [52]. Further, lyrics for all songs were obtained
through the web service of LyricWiki7 , filtered for stop words, stemmed and de-
scribed by document vectors with TFxIDF term weights [38]. Additional features
7
http://lyricwiki.org
280 S. Stober and A. Nürnberger

feature extraction

high-dimensional
feature space

facet
S1 S2 …S
l subspaces
d1(i
(i,
( jj) d2((i
(i, jj) dl((i
(i, j)

nearest neighbor indexing


landmark selection

N images facet
Nxmxl distance
index
distances cuboid
s
et
ac
offline m landmarks l
f

online projection
distance
projection
rojec
e aggregator
agg
facet
adjust weights
user

lens distortion
distance
neighborhood facet
tance aggregator

zoom
aggre

display

ing
filtering

Fig. 3. Outline of the approach showing the important processing steps and data struc-
tures. Top: preprocessing. Bottom: interaction with the user with screenshots of the
graphical user interface.
Zoomable Interface for Multi-facet Exploration of Music Collections 281

that are currently only used for the visualization are ID3 tags (artist, album,
title, track number and year) extracted from the audio files, track play counts
obtained from a Last.fm profile, and album covers gathered through web search.

Distance Facets. Based on the features associated with the tracks, facets are
defined (on subspaces of the feature space) that refer to different aspects of music
(dis-)similarity. This is depicted in Figure 3 (top).
Definition 1. Given a set of features F , let S be the space determined by the
feature values for a set of tracks T . A facet f is defined by a facet distance
measure δf on a subspace Sf ⊆ S of the feature space, where δf satisfies the
following conditions for any x, y ∈ T :
– δ(x, y) ≥ 0 and δ(x, y) = 0 if and only if x = y
– δ(x, y) = δ(y, x) (symmetry)
Optionally, δ is a distance metric if it additionally obeys the triangle inequality
for any x, y, z ∈ T :
– δ(x, z) ≤ δ(x, y) + δ(y, z) (triangle inequality)
E.g., a facet “timbre” could be defined on the MFCC-based feature described in
[26] whereas a facet “text” could compare the combined information from the
features “title” and “lyrics”.
It is important to stress the difference to common faceted browsing and search
approaches that rely on a faceted classification of objects to support users in
exploration by filtering available information. Here, no such filtering by value is
applied. Instead, we employ the concept of facet distances to express different
aspects of (dis-)similarity that can be used for filtering.

Facet Distance Normalization. In order to avoid a bias when aggregating


several facet distance measures, the values should be normalized. The following
normalization truncates very high facet distance values δf (x, y) of a facet f and
results in a value range of [0, 1]

δf (a, b)
δf (a, b) = min 1, (1)
μ+σ
where μ is the mean
1 
μ= δf (x, y) (2)
|{(x, y) ∈ T 2 }|
(x,y)∈T 2

and σ is the standard deviation


!
" 
" 1
σ=# (δf (x, y) − μ)2 (3)
|{(x, y) ∈ T 2 }|
(x,y)∈T 2

of all distance values with respect to δf .


282 S. Stober and A. Nürnberger

Table 1. Facets defined for the current implementation

facet name feature distance metric


timbre GMM of MFCCs Kullback-Leibler divergence
rhythm fluctuation patterns euclidean distance
dynamics MARSYAS07 euclidean distance
lyrics TFxIDF weighted term vectors cosine distance

Facet Distance Aggregation. The actual distance between tracks x, y ∈ T


w.r.t. to the facets f1 , . . . , fl can be computed by aggregating the individual
facet distances δf1 (x, y), . . . , δfl (x, y). For the aggregation, basically any function
could be used. Common parametrized aggregation functions are:

l 2
– d= i=1 wi δfi (x, y) (weighted euclidean distance)
l
– d = i=1 wi δfi (x, y)2 (squared weighted eucl. distance)
l
– d = i=1 wi δfi (x, y) (weighted sum)
– d = maxi=1..l {wi δfi (x, y)} (maximum)
– d = mini=1..l {wi δfi (x, y)} (minimum)
These aggregation functions allow to control the importance of the facet dis-
tances d1 , . . . , dl through their associated weights w1 , . . . , wl . Default settings
for the facet weights and the aggregation function are defined by an expert (who
also defined the facets themselves) and can later be adapted by the user dur-
ing interaction with the interface. Table 1 lists the facets used in the current
implementation.

4.2 Projection
In the projection step shown in Figure 3 (bottom), the position of all tracks on
the display is computed according to their (aggregated) distances in the high-
dimensional feature space. Naturally, this projection should be neighborhood-
preserving such that tracks close to each other in feature space are also close in
the projection. We propose to use a landmark- or pivot-based Multidimensional
Scaling approach (LMDS) for the projection as described in detail in [42,43].
This is a computationally efficient approximation to classical MDS. The general
idea of this approach is as follows: A representative sample of objects – called
“landmarks” – is drawn randomly from the whole collection.8 For this landmark
sample, an embedding into low-dimensional space is computed using classical
MDS. The remaining objects can then be located within this space according to
their distances to the landmarks.
8
Alternatively, the MaxMin heuristic (greedily seeking out extreme, well-separated
landmarks) could be used – with the optional modification to replace landmarks
with a predefined probability by randomly chosen objects (similar to a mutation op-
erator in genetic programming). Neither alternative seems to produce less distorted
projections while having much higher computational complexity. However, there is
possibly some room for improvement here but this is out of the scope of this paper.
Zoomable Interface for Multi-facet Exploration of Music Collections 283

Complexity. Classical MDS has a computational complexity of O(N 3 ) for the


projection, where N is the number of objects in the data set. Additionally, the
N ×N distance matrix needed as input requires O(N 2 ) space and is computed in
O(CN 2 ), where C are the costs of computing the distance between two objects.
By limiting the number of landmark objects m  N , an LMDS projection can
be computed in O(m3 + kmN ), where k is the dimension of the visualization
space, which is fixed here as 2. The first part refers to the computation of the
classical MDS for the m landmarks and the second to the projection of the re-
maining objects with respect to the landmarks. Further, LMDS requires only the
distances of each data point to the landmarks, i.e. only a m × N distance ma-
trix has to be computed resulting in O(mN ) space and O(CmN ) computational
complexity. This way, LMDS becomes feasible for application on large data sets
as it scales linearly with the size of the data set.

Facet Distance Caching. The computation of the distance matrix that is


required for LMDS can be very time consuming – not only depending on the
size of the collection and landmark sample but also on the number of facets and
the complexity of the respective facet distance measures. Caching can reduce the
amount of information that has to be recomputed. Assuming a fixed collection,
the distance matrix only needs to be recomputed if the facet weights or the
facet aggregation function change. Moreover, even a change of the aggregation
parameters has no impact on the facet distances. This allows to pre-compute
for each track the distance values to all landmarks for all facets offline and store
them in the 3-dimensional data structure depicted in Figure 3 (top) called “facet
distance cuboid”. It is necessary to store the facet distance values separately as
it is not clear at indexing time how these values are to be aggregated. During
interaction with the user, when near real-time response is required, only the
computational lightweight facet distance aggregation that produces the distance
matrix from the cuboid and the actual projection need to be done.
If N is the number of tracks, m the number of landmarks and l the number
of facets, the cuboid has the dimension N × m × l and holds as many distance
values. Note that m and l are fixed small values of O(100) and O(10) respectively.
Thus, the space requirement effectively scales linearly with N and even for large
N the data structure should fit into memory. To further reduce the memory
requirements of this data structure, the distance values are discretized to the
byte range ([0 . . . 255]) after normalization to [0, 1] as described in Section 4.1.

Incremental Collection Updates. Re-computation becomes also necessary


once the collection changes. In previous work [32,46] we used a Growing Self-
Organizing Map approach (GSOM) for the projection. While both approaches,
LMDS and GSOM, are neighborhood-preserving, GSOMs have the advantage of
being inherently incremental, i.e. adding or removing objects from the data set
only gradually changes the way, the data is projected. This is a nice characteristic
because too abrupt changes in the projection caused by adding or removing some
tracks might irritate the user if he has gotten used to a specific projection. On
the contrary, LMDS does not allow for incremental changes of the projection.
284 S. Stober and A. Nürnberger

Fig. 4. The SpringLens particle mesh is distorted by changing the rest-length of


selected springs

However, it still allows objects to be added or removed from the data set to
some extend without the need to compute a new projection: If a new track
is added to the collection, an additional “layer” has to be appended to the
facet distance cuboid containing the facet distances of the new track with all
landmarks. The new track can then be projected according to these distances. If
a track is removed, the respective “layer” of the cuboid can be deleted. Neither
operation does further alter the projection.9 Adding or removing many tracks
may however alter the distribution of the data (and thus the covariances) in such
a way that the landmark sample may no longer be representative. In this case,
a new projection based on a modified landmark sample should be computed.
However, for the scope of this paper, a stable landmark set is assumed and this
point is left for further work.

4.3 Lens Distortion


Once the 2-D-positions of all tracks are computed by the projection technique,
the collection could already be displayed. However, an intermediate distortion
step is introduced as depicted in Figure 3 (bottom). It serves as the basis for the
interaction techniques described later.

Lens Modeling. The distortion technique is based on an approach originally


developed to model complex nonlinear distortions of images called “SpringLens”
[9]. A SpringLens consists of a mesh of mass particles and interconnecting springs
that form a rectangular grid with fixed resolution. Through the springs, forces
are exerted between neighboring particles affecting their motion. By changing the
rest-length of selected springs, the mesh can be distorted as depicted in Figure 4.
(Further, Figure 3 (bottom) and Figure 7 shows larger meshes simulating lenses.)
The deformation is calculated by a simple iterative physical simulation over time
using an Euler integration [9].
In the context of this work, the SpringLens technique is applied to simulate
a complex superimposition of multiple fish-eye lenses. A moderate resolution is
9
In case a landmark track is removed from the collection, its feature representation
has to be kept to be able to compute facet distances for new tracks. However, the
corresponding “layer” in the cuboid can be removed as for any ordinary track.
Zoomable Interface for Multi-facet Exploration of Music Collections 285

chosen with a maximum of 50 cells in each dimension for the overlay mesh which
yields sufficient distortion accuracy while real-time capability is maintained. The
distorted position of the projection points is obtained by barycentric coordinate
transformation with respect to the particle points of the mesh. Additionally,
z-values are derived from the rest-lengths that are used in the visualization to
decide whether an object has to be drawn below or above another one.

Nearest Neighbor Indexing. For the adaptation of the lens distortion, the
nearest neighbors of a track need to be retrieved. Here, the two major challenges
are:

1. The facet weights are not known at indexing time and thus the index can
only be built using the facet distances.
2. The choice of an appropriate indexing method for each facet depends on the
respective distance measure and the nature of the underlying features.

As the focus lies here on the visualization and not the indexing, only a very basic
approach is taken and further developments are left for future work: A limited list
of nearest neighbors in pre-computed for each track. This way, nearest neighbors
can be retrieved by simple lookup in constant time (O(1)). However, updating
the lists after a change of the facet weights is computationally expensive. While
the resulting delay of the display update is still acceptable for collections with a
few thousands tracks, it becomes infeasible for larger N .
For more efficient index structures, it may be possible to apply generic mul-
timedia indexing techniques such as space partition trees [5] or approximate ap-
proaches based on locality sensitive hashing [13] that may even be kernelized [19]
to allow for more complex distance metrics. Another option is to generate mul-
tiple nearest neighbor indexes – each for a different setting of the facet weights
– and interpolate the retrieved result lists w.r.t. to the actual facet weights.

4.4 Visualization Metaphor

The music collection is visualized as a galaxy. Each track is displayed as a star or


as its album cover. The brightness and (to some extend) the hue of stars depends
on a predefined importance measure. The currently used measure of importance
is the track play count obtained from the Last.fm API and normalized to [0, 1]
(by dividing by the maximum value). However, this could also be substituted
by a more sophisticated measure, e.g. based on (user) ratings, chart positions
or general popularity. The size and the z-order (i.e. the order of objects along
the z-axis) of the objects depends on their distortion z-values. Optionally, the
SpringLens mesh overlay can be displayed. The visualization then resembles the
space-time distortions well known from gravitational and relativistic physics.

4.5 Filtering

In order to reduce the amount of information displayed at a time, an addi-


tional filtering step is introduced as depicted in Figure 3 (bottom). The user
286 S. Stober and A. Nürnberger

Fig. 5. Available filter modes: collapse all (top left), focus (top right), sparse (bottom
left), expand all (bottom right). The SpringLens mesh overlay is hidden.

can choose between different filters that decide whether a track is displayed col-
lapsed or expanded – i.e. as a star or album cover respectively. While album
covers help for orientation, the displayed stars give information about the data
distribution. Trivial filters are those displaying no album covers (collapseAll) or
all (expandAll). Apart from collapsing or expanding all tracks, it is possible to
expand only those tracks in magnified regions (i.e. with a z-level above a pre-
defined threshold) or to apply a sparser filter. The results of using these filter
modes are shown in Figure 5.
A sparser filter selects only a subset of the collection to be expanded that
is both, sparse (well distributed) and representative. Representative tracks are
those with a high importance (described in Section 4.4). The first sparser version
used a Delaunay triangulation and was later substituted by a raster-based ap-
proach that produces more appealing results in terms of the spatial distribution
of displayed covers.
Zoomable Interface for Multi-facet Exploration of Music Collections 287

Originally, the set of expanded tracks was updated after any position changes
caused by the distortion overlay. However, this was considered irritating during
early user tests and the sparser strategy was changed to update only if the
projection or the displayed region changes.
Delaunay Sparser Filter. This sparser filter constructs a Delaunay triangula-
tion incrementally top-down starting with the track with the highest importance
and some virtual points at the corners of the display area. Next, the size of all
resulting triangles given by the radius of their circumcircle is compared with
a predefined threshold sizemin . If the size of a triangle exceeds this threshold,
the most important track within this triangle is chosen for display and added
as a point for the triangulation. This process continues recursively until no tri-
angle that exceeds sizemin contains anymore tracks that could be added. All
tracks belonging to the triangulation are then expanded (i.e. displayed as album
thumbnail).
The Delaunay triangulation can be computed in O(n log n) and the number
of triangles is at most O(n) with n  N being the number of actually displayed
album cover thumbnails. To reduce lookup time, projected points are stored
in a quadtree data structure [5] and sorted by importance within the tree’s
quadrants. A triangle’s size may change through distortion caused by the multi-
focal zoom. This change may trigger an expansion of the triangle or a removal
of the point that caused its creation originally. Both operations are propagated
recursively until all triangles meet the size condition again. Figure 3 (bottom)
shows a triangulation and the resulting display for a (distorted) projection of a
collection.
Raster Sparser Filter. The raster sparser filter divides the display into a grid
of quadratic cells. The size of the cells depends on the screen resolution and
the minimal display size of the album covers. Further, it maintains a list of the
tracks ranked by importance that is precomputed and only needs to be updated
when the importance values change. On an update, the sparser runs through
its ranked list. For each track it determines the respective grid cell. If the cell
and the surrounding cells are empty, the track is expanded and its cell blocked.
(Checking surrounding cells avoids image overlap. The necessary radius for the
surrounding can be derived from the cell and cover sizes.)
The computational complexity of this sparser approach is linear in the number
of objects to be considered but also depends on the radius of the surrounding
that needs to be checked. The latter can be reduced by using a data structure
for the raster that has O(1) look-up complexity but higher costs for insertions
which happen far less frequently. This approach has further the nice property
that it handles the most important objects first and thus even if interrupted
returns a useful result.

5 Interaction
While the previous section covered the underlying techniques, this section de-
scribes how users can interact with the user-interface that is built on top of
288 S. Stober and A. Nürnberger

them. Figure 6 shows a screenshot of the MusicGalaxy prototype.10 It allows sev-


eral ways of interacting with the visualization: Users can explore the collection
through common panning & zooming (Section 5.1). Alternatively, they can use
the adaptive multi-focus technique introduced with this prototype (Section 5.2).
Further, they can change the facet aggregation function parameters and this way
adapt the view on the collection according to their preferences (Section 5.3).
Hovering over a track displays its title and a double-click start the playback
that can be controlled by the player widget at the bottom of the interface.
Apart from this, several display parameters can be changed such as the filtering
mode (Section 4.5), the size of the displayed album covers or the visibility of the
SpringLens overlay mesh.

5.1 Panning and Zooming

These are very common interaction techniques that can e.g. be found in programs
for geo-data visualization or image editing that make use of the map metaphor.
Panning shifts the displayed region whereas zooming decreases or increases it.
(This does not affect the size of the thumbnails which can be controlled sepa-
rately using the PageUp and PageDn keys.) Using the keyboard, the user can pan
with the cursor keys and zoom in and out with + and – respectively. Alterna-
tively, the mouse can be used: Clicking and holding the left button while moving
the mouse pans the display. The mouse wheel controls the zoom level. If not the
whole collection can be displayed, an overview window indicating the current
section is shown in the top left corner, otherwise it is hidden. Clicking into the
overview window centers the display around the respective point. Further, the
user can drag the section indicator around which also results in panning.

5.2 Focusing

This interaction techniques allows to visualize – and to some extend alleviate –


the neighborhood distortions introduced by the dimensionality reduction during
the projection. The approach is based on a multi-focus fish-eye lens that is
implemented using the SpringLens distortion technique (Section 4.3). It consists
of a user-controlled primary focus and a neighborhood-driven secondary focus.
The primary focus is a common fish-eye lens. By moving this lens around
(holding the right mouse button), the user can zoom into regions of interest. In
contrast to the basic linear zooming function described in Section Section 5.1,
this leads to a nonlinear distortion of the projection. As a result, the region of
interest is enlarged making more space to display details. At the same time,
less interesting regions are compacted. This way, the user can closely inspect
the region of interest without loosing the overview as the field of view is not
narrowed (as opposed to the linear zoom). The magnification factor of the lens
can be changed using the mouse wheel while holding the right mouse button. The
10
A demo video is available at: http://www.dke-research.de/aucoma
Zoomable Interface for Multi-facet Exploration of Music Collections

Fig. 6. Screenshot of the MusicGalaxy prototype with visible overview window (top left), player (bottom) and SpringLens mesh overlay
(blue). In this example, a strong album effect can be observed as for the track in primary focus, four tracks of the same album are nearest
neighbors in secondary focus.
289
290 S. Stober and A. Nürnberger

Fig. 7. SpringLens distortion with only primary focus (left) and additional secondary
focus (right)

visual effect produced by the primary zoom resembles a 2-dimensional version


of the popular “cover flow” effect.
The secondary focus consist of multiple such fish-eye lenses. These lenses are
smaller and cannot be controlled by the user but are automatically adapted
depending on the primary focus. When the primary focus changes, the neighbor
index (Section 4.3) is queried with the track closest to the center of focus. If
nearest neighbors are returned that are not in the primary focus, secondary
lenses are added at the respective positions. As a result, the overall distortion
of the projection brings the distant nearest neighbors back closer to the focused
region of interest. Figure 7 shows the primary and secondary focus with visible
SpringLens mesh overlay.
As it can become tiring to hold the right mouse button while moving the
focus around, the latest prototype introduces a focus lock mode (toggled with the
return key). In this mode, the user clicks once to start a focus change and a second
time to freeze the focus. To indicate that the focus is currently being changed
(i.e. mouse movement will affect the focus), an icon showing a magnifying glass
is displayed in the lower left corner. The secondary focus is by default always
updated instantly when the primary focus changes. This behavior can be disabled
resulting only in an update of the secondary focus once the primary focus does
not change anymore.

5.3 Adapting the Aggregation Functions


Two facet control panels allow to adapt two facet distance aggregation functions
by choosing one of the function types listed in Section 4.1 (from a drop-down
menu) and adjusting weights for the individual facets (through sliders). The con-
trol panels are hidden in the screenshot Figure 6) but shown in Figure 3 (bottom)
that depicts the user-interaction. The first facet distance aggregation function
Zoomable Interface for Multi-facet Exploration of Music Collections 291

is applied to derive the track-landmark distances from the facet distance cuboid
(Section 4.2). These distances are then used to compute the projection of the
collection. The second facet distance aggregation function is applied to identify
the nearest neighbors of a track and thus indirectly controls the secondary focus.
Changing the aggregation parameters results in a near real-time update of
the display so that the impact of the change becomes immediately visible: In
case of the parameters for the nearest neighbor search, some secondary focus
region may disappear while somewhere else a new one appears with tracks now
considered more similar. Here, the transitions are visualized smoothly due to
the underlying physical simulation of the SpringLens grid. In contrast to this, a
change of the projection similarity parameters has a more drastic impact on the
visualization possibly resulting in a complete re-arrangement of all tracks. This
is because the LMDS projection technique produces solutions that are unique
only up to translation, rotation, and reflection and thus, even a small parameter
change may, e.g., flip the visualization. As this may confuse users, one direction
of future research is to investigate how the position of the landmarks can be
constrained during the projection to produce more gradual changes.
The two facet distance aggregation functions are linked by default as it is most
natural to use the same distance measure for projection and neighbor retrieval.
However, unlinking them and using e.g. orthogonal distance measures can lead
to interesting effects: For instance, one may choose to compute the collection
based solely on acoustic facets and find nearest neighbors for the secondary
focus through lyrics similarity. Such a setting would help to uncover tracks with
a similar topic that (most likely) sound very different.

6 Evaluation
The development of MusicGalaxy followed a user-driven design approach [31] by
iteratively alternating between development and evaluation phases. The first pro-
totype [47] was presented at the CeBIT 2010 11 a German trade fair specialized
on information technology, in early March 2010. During the fair, feedback was
collected from a total of 112 visitors aged between 16 and 63 years. The general
reception was very positive. The projection-based visualization was generally
welcomed as an alternative to common list views. However, some remarked that
additional semantics of the two display axis would greatly improve orientation.
Young visitors particularly liked the interactivity of the visualization whereas
older ones tended to have problems with this. They stated that the reason lay
in the amount of information displayed which could still be overwhelming. To
address the problem, they proposed to expand only tracks in focus, increase the
size of objects in focus (compared to the others) and hide the mesh overlay as
the focus would be already visualized by the expanded and enlarged objects. All
of these proposals have been integrated into the second prototype.
The second prototype was tested thoroughly by three testers. During these
tests, the eye movements of the users were recorded with an Tobii T60
11
http://www.cebit.de
292 S. Stober and A. Nürnberger

eye-tracker that can capture where and how long the gaze of the participants rests
for some time (referred to as “fixation points”). Using the adaptive SpringLens
focus, the mouse generally followed the gaze that scans the border of the fo-
cus in order to decide on the direction to explore further. This resulted in a
much smoother gaze-trajectory than the one observed during usage of panning
and zooming where the gaze frequently switched between the overview window
and the objects of interest – as not to loose orientation. This indicates that the
proposed approach is less tiring for the eyes. However, the testers criticized the
controls used to change the focus – especially having to hold the right mouse
button all the time. This lead to the introduction of the focus lock mode and
several minor interface improvements in the third version of the prototype [48]
that are not explicitly covered here.
The remainder of this section describes the evaluation of the third Music-
Galaxy prototype in a user study [45] with the aim to proof that the user-
interface indeed helps during exploration. Screencasts of 30 participants
solving an exploratory retrieval task were recorded together with eye-tracking
data (again using a Tobii T60 eye-tracker) and web cam video streams. This
data was used to identify emerging search strategies among all users and to
analyze to what extent the primary and secondary focus was used. Moreover,
first-hand impressions of the usability of the interface were gathered by letting
the participants say aloud whatever they think, feel or remark as they go about
their task (think-aloud protocol).
In order to ease the evaluation, the study was not conducted with the origi-
nal MusicGalaxy user-interface prototype but with a modified version that can
handle photo collections depicted in Figure 8. It relies on widely used MPEG-7
visual descriptors (EdgeHistogram, ScalableColor and ColorLayout ) [27,25] to
compute the visual similarity (see [49] for further details) – replacing the origi-
nally used music features and respective similarity facets. Using photo collections
for evaluation instead of music has several advantages: It can be assured that
none of the participants knows any of the photos in advance what could oth-
erwise introduce some bias. Dealing with music, this would be much harder to
realize. Furthermore, similarity and relevance of photos can be assessed in an
instant. This is much harder for music tracks and requires additional time for
listening – especially if the tracks are previously unknown.
The following four questions were addressed in the user study:

1. How does the lens-based user-interface compare in terms of usability to com-


mon panning & zooming techniques that are very popular in interfaces using
a map metaphor (such as Google Maps 12 )?
2. How much do users actually use the secondary focus or would a common
fish-eye distortion (i.e. only the primary focus) be sufficient?
3. What interaction patterns do emerge?
4. What can be improved to further support the user and increase user
satisfaction?
12
http://maps.google.com
Zoomable Interface for Multi-facet Exploration of Music Collections 293

Fig. 8. PhotoGalaxy – a modified version of MusicGalaxy for browsing photo collections


that was used during the evaluation (color scheme inverted)

To answer the first question, participants compared a purely SpringLens-based


user-interface with a common panning & zooming and additionally a combina-
tion of both. For questions 2 and 3, the recorded interaction of the participants
with the system was analyzed in detail. Answers to question 4 were collected
by asking the users directly for missing functionality. Section 6.1 addresses the
experimental setup in detail and Section 6.2 discusses the results.

6.1 Experimental Setup


At the beginning of the experiment, the participants were asked several questions
to gather general information about their background. Afterwards, they were
presented four image collections (described below) in fixed order. On the first
collection, a survey supervisor gave a guided introduction to the interface and the
possible user actions. Each participant could spent as much time as needed to get
used to the interface. Once, the participant was familiar with the controls, she
or he continued with the other collections for which a retrieval task (described
below) had to be solved without the help of the supervisor. At this point, the
participants were divided into two groups. The first group used only panning
& zooming (P&Z) as described in Section 5.1 on the second collection and only
the SpringLens functionality (SL) described in Section 5.2 on the third one. The
other group started with SL and then used P&Z. The order of the datasets stayed
the same for both groups. (This way, effects caused by the order of the approaches
and slightly varying difficulties among the collections are avoided.) The fourth
collection could then be explored by using both, P&Z and SL. (The functionality
294 S. Stober and A. Nürnberger

Table 2. Photo collections and topics used during the user study

Collection Topics (number of images)


Melbourne & Victoria –
Barcelona Tibidabo (12), Sagrada Famı́lia (31), Stone Hallway in Park
Güell (13), Beach & Sea (29), Casa Milà (16)
Japan Owls (10), Torii (8), Paintings (8), Osaka Aquarium (19), Tra-
ditional Clothing (35)
Western Australia Lizards (17), Aboriginal Art (9), Plants (Macro) (17), Birds
(21), Ningaloo Reef (19)

for adapting the facet distance aggregation functions described in Section 5.3 was
deactivated for the whole experiment.) After the completion of the last task,
the participants were asked to assess the usability of the different approaches.
Furthermore, feedback was collected pointing out, e.g., missing functionality.

Test Collections. Four image collection were used during the study. They
were drawn from a personal photo collection of the authors.13 Each collection
comprises 350 images – except the first collection (used for the introduction of
the user-interface) which only contains 250 images. All images were scaled down
to fit 600x600 pixels. For each of the collections 2 to 4, five non-overlapping topics
were chosen and the images annotated accordingly. These annotation served as
ground truth and were not shown to the participants. Table 2 shows the topics
for each collection. In total, 264 of the 1050 images belong to one of the 15 topics.

Retrieval Task. For the collections 2 to 4, the participants had to find five (or
more) representative images for each of the topics listed in Table 2. As guidance,
handouts were prepared that showed the topics – each one printed in a different
color –, an optional brief description and two or three sample images giving an
impression what to look for. Images representing a topic had to be marked with
the topic’s color. This was done by double clicking on the thumbnail what opened
a floating dialog window presenting the image at big scale and allowing the
participant to classify the image to a predefined topic by clicking a corresponding
button. As a result, the image was marked with the color representing the topic.
Further, the complete collection could be filtered by highlighting all thumbnails
classified to one topic. This was done by pressing the numeric key (1 to 5) for the
respective topic number. Highlighting was done by focusing a fish-eye lens on
every marked topic member and thus enlarging the corresponding thumbnails.
It was pointed out that the decision whether an image was representative for
a group was solely up to the participant and not judged otherwise. There was
no time limit for the task. However, the participants were encouraged to skip to

13
The collections and topic annotations are publicly available under
the Creative Commons Attribution-Noncommercial-Share Alike license,
http://creativecommons.org/licenses/by-nc-sa/3.0/ Please contact [email protected].
Zoomable Interface for Multi-facet Exploration of Music Collections 295

the next collection after approximately five minutes as during this time already
enough information would have been collected.

Tweaking the Nearest Neighbor Index. In the original implementation, at


most five nearest neighbors are retrieved with the additional constraint that their
distance to the query object has to be in the 1-percentile of all distances in the
collection. (This avoids returning nearest neighbors that are not really close.) 264
of the 1050 images belonging to collections 2 to 4 have a ground truth topic label.
For only 61 of these images, one or more of the five nearest neighbors belonged to
the same topic and only in these cases, the secondary focus would have displayed
something helpful for the given retrieval task. This let us conclude that the feature
descriptors used were not sophisticated enough to capture the visual intra-topic
similarity. A lot more work would have been involved to improve the features –
but this would have been beyond the scope of the study that aimed to evaluate
the user-interface and most specifically the secondary focus which differentiates
our approach from the common fish-eye techniques. In order not to have the user
evaluate the underlying feature representation and the respective similarity met-
ric, we modified the index for the experiment: Every time, the index was queried
with an image with a ground truth annotation, the two most similar images from
the respective topics were injected into the returned list of nearest neighbors. This
ensured that the secondary would contain some relevant images.

6.2 Results
The user study was conducted with 30 participants – all of them graduate or
post-graduate students. Their age was between 19 and 32 years (mean 25.5)
and 40% were female. Most of the test persons (70%) were computer science
students, with half of them having a background in computer vision or user
interface design. 43% of the participants stated that they take photos on a regular
basis and 30% use software for archiving and sorting their photo collection. The
majority (77%) declared that they are open to new user interface concepts.

Usability Comparison. Figure 9 shows the results from the questionnaire


comparing the usability and helpfulness of the SL approach with baseline P&Z.
What becomes immediately evident is that half of the participants rated the SL
interface as being significantly more helpful that the simple P&Z interface while
being equally complicated in use. The intuitiveness of the SL was surprisingly
rated slightly better than for the P&Z interface, which is an interesting outcome
since we expected users to be more familiar with P&Z as it is more common in
today’s user interfaces (e.g. Google Maps). This, however, suggests that interact-
ing with a fish-eye lens can be regarded as intuitive for humans when interacting
with large collections. The combination of both got even better ratings but has
to be considered noncompetitive here, as it could have had an advantage by
always being the last interface used. Participants have had time for getting used
to the handling of the two complementary interfaces already. Moreover, since
the collection did not change as for P&Z and SL, the combined interface might
have had the advantage of being applied to a possibly easier collection – with
296 S. Stober and A. Nürnberger

helpfulness simplicity intuitivity


7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
P&Z SL both P&Z SL both P&Z SL both

Fig. 9. Usability comparison of common panning & zooming (P&Z), adaptive


SpringLens (SL) and the combination of both. Ratings were on a 7-point-scale where
7 is best. The box plots show minimum, maximum, median and quartiles for N = 30.

Table 3. Percentage of marked images (N = 914) categorized by focus region and topic
of the image in primary focus at the time of marking

focus region primary ext. primary secondary none


same topic 37.75 4.27 30.74 4.38
other topic 4.49 13.24 2.08
no focus 3.06
total 37.75 8.75 43.98 9.52

topics being better distributed or a slightly better working similarity measure


so that images of the same topic are found more easily.

Usage of Secondary Focus. For this part, we restrict ourselves to the interac-
tion with the last photo collection where both, P&Z and the lens, could be used
and the participants had had plenty of time (approximately 15 to 30 minutes
depending on the user) for practice. The question to be answered is, how much
the users actually make use of the secondary focus which always contains some
relevant images if the image in primary focus has a ground truth annotation.14
For each image marked by a participant, the location of the image at the time
of marking was determined. There are four possible regions: primary focus (only
the central image), extended primary focus (region covered by primary lens ex-
cept primary focus image), secondary focus and the remaining region. Further,
there are up to three cases for each region with respect to the (user-annotated or
ground truth) topic of the image in primary focus. Table 3 shows the frequencies
of the resulting eight possible cases. (Some combinations are impossible. E.g.,
the existence of a secondary focus implies some image in primary focus.) The
most interesting number is the one referring to images in secondary focus that
belong to the same topic as the primary because this is what the secondary focus
14
Ground truth annotation were never visible to the users.
Zoomable Interface for Multi-facet Exploration of Music Collections 297

is supposed to bring up. It comes close to the percentage of the primary focus
that – not surprisingly – is the highest. Ignoring the topic, (extended) primary
and secondary almost contribute equally and only less than 10% of the marked
images were not in focus – i.e. discovered only through P&Z.

Emerging Search Strategies. For this part we again analyze only interac-
tion with the combined interface. A small group of participants excessively used
P&Z. They increased the initial thumbnail size in order to better perceive the
depicted contents and chose to display all images as thumbnails. To reduce the
overlap of thumbnails, they operated on a deeper zoom level and therefore had
to pan a lot. The gaze data shows a tendency for systematic sequential scans
which were however difficult due to the scattered and irregular arrangement
of the thumbnails. Further, some participants occasionally marked images not
in focus because of being attracted by dominant colors (e.g. for the aquarium
topic). Another typical strategy was to quickly scan through the collection by
moving the primary focus – typically with small thumbnail size and at a zoom
level that showed most of the collection but the outer regions. In this case the
attention was mostly at the (extended) primary focus region with the gaze scan-
ning in which direction to explore further and little to moderate attention at the
secondary focus. Occasionally, participants would freeze the focus or slow down
for some time to scan the whole display. In contrast to this rather continuous
change of the primary focus, there was a group of participants that browsed
the collection mostly by moving (in a single click) the primary focus to some
secondary focus region – much like navigating an invisible neighborhood graph.
Here, the attention was concentrated onto the secondary focus regions.

User Feedback. Many participants had problems with an overcrowded primary


fish-eye in dense regions. This was alleviated by temporarily zooming into the
region which lets the images drift further apart. However, there are possibilities
that require less interaction such as automatically spreading the thumbnails in
focus with force-based layout techniques. Working on deeper zoom levels where
only a small part of the collection is visible, the secondary focus was considered
mostly useless as it was usually out of view. Further work could therefore in-
vestigates off-screen visualization techniques to facilitate awareness of and quick
navigation to secondary focus regions out of view and better integrate P&Z and
SL. The increasing “empty space” at deep zoom levels should be avoided – e.g.
by automatically increasing the thumbnail size as soon as all thumbnails can be
displayed without overlap. An optional re-arrangement of the images in view into
a grid layout may ease sequential scanning as preferred by some users. Another
proposal was to visualize which regions have already been explored similar to the
(optionally time-restricted) “fog of war” used in strategy computer games. Some
participants would welcome advanced filtering options such as a prominent color
filter. An undo function or reverse playback of focus movement would be desir-
able and could easily be implemented by maintaining a list of the last images
in primary focus. Finally, some participants remarked that it would be nice to
generate the secondary focus for a set of images (belonging to the same topic).
298 S. Stober and A. Nürnberger

In fact, it is even possible to adapt the similarity metric used for the nearest
neighbor queries automatically to the task of finding more images of the same
to topic as shown in recent experiments [49]. This opens an interesting research
direction for future work.

7 Conclusion
A common approach for exploratory retrieval scenarios is to start with an over-
view from where the user can decide which regions to explore further. The focus-
adaptive SpringLens visualization technique described in this paper addresses
the following three major problems that arise in this context:
1. Approaches that rely on dimensionality reduction techniques to project the
collection from high-dimensional feature space onto two dimensions inevitably
face projection errors: Some tracks will appear closer than they actually are
and on the other side, some tracks that are distant in the projection may in
fact be neighbors in the original space.
2. Displaying all tracks at once becomes infeasible for large collections because
of limited display space and the risk of overwhelming the user with the
amount of information displayed.
3. There is more than one way to look at a music collection – or more specifically
to compare two music pieces based on their features. Each user may have a
different way and a retrieval system should account for this.
The first problem is addressed by introducing a complex distortion of the vi-
sualization that adapts to the user’s current region of interest and temporarily
alleviates possible projection errors in the focused neighborhood. The amount
of displayed information can be adapted by the application of several sparser
filters. Concerning the third problem, the proposed user-interface allows users
to (manually) adapt the underlying similarity measure used to compute the ar-
rangement of the tracks in the projection of the collection. To this end, weights
can be specified that control the importance of different facets of music similarity
and further an aggregation function can be chosen to combine the facets.
Following a user-centered design approach with focus on usability, a prototype
system has been created by iteratively alternating between development and
evaluation phases. For the final evaluation, an extensive user study including
gaze analysis using an eye-tracker has been conducted with 30 participants. The
results prove that the proposed interface is helpful while at the same time being
easy and intuitive to use.

Acknowledgments
This work was supported in part by the German National Merit Foundation,
the German Research Foundation (DFG) under the project AUCOMA, and the
European Commission under FP7-ICT-2007-C FET-Open, contract no. BISON-
211898. The user study was conducted in collaboration with Christian Hentschel
Zoomable Interface for Multi-facet Exploration of Music Collections 299

who also took care of the image feature extraction. The authors would further like
to thank all testers and the participants of the study for their time and valuable
feedback for further development, Tobias Germer for sharing his ideas and code
of the original SpringLens approach [9], Sebastian Loose who has put a lot of
work into the development of the filter and zoom components, the developers of
CoMIRVA [40] and JAudio [28] for providing their feature extractor code and
George Tzanetakis for providing insight into his MIREX ’07 submission [52].
The Landmark MDS algorithm has been partly implemented using the MDSJ
library [1].

References
1. Algorithmics Group: MDSJ: Java library for multidimensional scaling (version 0.2),
University of Konstanz (2009)
2. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
3. Baumann, S., Halloran, J.: An ecological approach to multimodal subjective music
similarity perception. In: Proc. of 1st Conf. on Interdisciplinary Musicology (CIM
2004), Graz, Austria (April 2004)
4. Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E.: On the use of fastmap for
audio retrieval and browsing. In: Proc. of the 3rd Int. Conf. on Music Information
Retrieval (ISMIR 2002) (2002)
5. De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational geom-
etry: algorithms and applications. Springer, New York (2008)
6. Diakopoulos, D., Vallis, O., Hochenbaum, J., Murphy, J., Kapur, A.: 21st century
electronica: Mir techniques for classification and performance. In: Proc. of the 10th
Int. Conf. on Music Information Retrieval (ISMIR 2009), pp. 465–469 (2009)
7. Donaldson, J., Lamere, P.: Using visualizations for music discovery. Tutorial at the
10th Int. Conf. on Music Information Retrieval (ISMIR 2009) (October 2009)
8. Gasser, M., Flexer, A.: Fm4 soundpark: Audio-based music recommendation in
everyday use. In: Proc. of the 6th Sound and Music Computing Conference (SMC
2009), Porto, Portugal (2009)
9. Germer, T., Götzelmann, T., Spindler, M., Strothotte, T.: Springlens: Distributed
nonlinear magnifications. In: Eurographics 2006 - Short Papers, pp. 123–126. Eu-
rographics Association, Aire-la-Ville (2006)
10. Gleich, M.R.D., Zhukov, L., Lang, K.: The World of Music: SDP layout of high
dimensional data. In: Info Vis 2005 (2005)
11. van Gulik, R., Vignoli, F.: Visual playlist generation on the artist map. In: Proc.
of the 6th Int. Conf. on Music Information Retrieval (ISMIR 2005), pp. 520–523
(2005)
12. Hitchner, S., Murdoch, J., Tzanetakis, G.: Music browsing using a tabletop display.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
175–176 (2007)
13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the
curse of dimensionality. In: Proc. of the 13th ACM Symposium on Theory of Com-
puting (STOC 1998), pp. 604–613. ACM, New York (1998)
14. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
300 S. Stober and A. Nürnberger

15. Julia, C.F., Jorda, S.: SongExplorer: a tabletop application for exploring large
collections of songs. In: Proc. of the 10th Int. Conf. on Music Information Retrieval
(ISMIR 2009), pp. 675–680 (2009)
16. Knees, P., Pohle, T., Schedl, M., Widmer, G.: Exploring Music Collections in Vir-
tual Landscapes. IEEE MultiMedia 14(3), 46–54 (2007)
17. Kohonen, T.: Self-organized formation of topologically correct feature maps. Bio-
logical Cybernetics 43(1), 59–69 (1982)
18. Kruskal, J., Wish, M.: Multidimensional Scaling. Sage, Thousand Oaks (1986)
19. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: Proc. 12th Int. Conf. on Computer Vision (ICCV 2009) (2009)
20. Leitich, S., Topf, M.: Globe of music - music library visualization using geosom.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
167–170 (2007)
21. Lillie, A.S.: MusicBox: Navigating the space of your music. Master’s thesis, MIT
(2008)
22. Lloyd, S.: Automatic Playlist Generation and Music Library Visualisation with
Timbral Similarity Measures. Master’s thesis, Queen Mary University of London
(2009)
23. Lübbers, D.: SoniXplorer: Combining visualization and auralization for content-
based exploration of music collections. In: Proc. of the 6th Int. Conf. on Music
Information Retrieval (ISMIR 2005), pp. 590–593 (2005)
24. Lübbers, D., Jarke, M.: Adaptive multimodal exploration of music collections. In:
Proc. of the 10th Int. Conf. on Music Information Retrieval (ISMIR 2009), pp.
195–200 (2009)
25. Lux, M.: Caliph & emir: Mpeg-7 photo annotation and retrieval. In: Proc. of the
17th ACM Int. Conf. on Multimedia (MM 2009), pp. 925–926. ACM, New York
(2009)
26. Mandel, M., Ellis, D.: Song-level features and support vector machines for music
classification. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 594–599 (2005)
27. Martinez, J., Koenen, R., Pereira, F.: MPEG-7: The generic multimedia content
description standard, part 1. IEEE MultiMedia 9(2), 78–87 (2002)
28. McEnnis, D., McKay, C., Fujinaga, I., Depalle, P.: jAudio: An feature extraction
library. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 600–603 (2005)
29. Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C.: Databionic visualization of music
collections according to perceptual distance. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 396–403 (2005)
30. Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and PocketSOMPlayer, al-
ternative interfaces to large music collections. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 618–623 (2005)
31. Nielsen, J.: Usability engineering. In: Tucker, A.B. (ed.) The Computer Science
and Engineering Handbook, pp. 1440–1460. CRC Press, Boca Raton (1997)
32. Nürnberger, A., Klose, A.: Improving clustering and visualization of multimedia
data using interactive user feedback. In: Proc. of the 9th Int. Conf. on Information
Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
2002), pp. 993–999 (2002)
33. Oliver, N., Kreger-Stickles, L.: PAPA: Physiology and purpose-aware automatic
playlist generation. In: Proc. of the 7th Int. Conf. on Music Information Retrieval
(ISMIR 2006) (2006)
Zoomable Interface for Multi-facet Exploration of Music Collections 301

34. Pampalk, E., Dixon, S., Widmer, G.: Exploring music collections by browsing dif-
ferent views. In: Proc. of the 4th Int. Conf. on Music Information Retrieval (ISMIR
2003), pp. 201–208 (2003)
35. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization
of music archives. In: Proc. of the 10th ACM Int. Conf. on Multimedia (MULTI-
MEDIA 2002), pp. 570–579. ACM Press, New York (2002)
36. Pauws, S., Eggen, B.: PATS: Realization and user evaluation of an automatic
playlist generator. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002) (2002)
37. Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic models and self-
organizing maps to create a hierarchical structuring of music by musical styles. In:
Proc. of the 3rd Int. Conf. on Music Information Retrieval (ISMIR 2002) (2002)
38. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval.
Information Processing & Management 24(5), 513–523 (1988)
39. Sarmento, L., Gouyon, F., Costa, B., Oliveira, E.: Visualizing networks of music
artists with RAMA. In: Proc. of the Int. Conf. on Web Information Systems and
Technologies, Lisbon (2009)
40. Schedl, M.: The CoMIRVA Toolkit for Visualizing Music-Related Data. Technical
report, Johannes Kepler University Linz (June 2006)
41. Shneiderman, B.: Tree visualization with tree-maps: 2-d space-filling approach.
ACM Trans. Graph 11(1), 92–99 (1992)
42. de Silva, V., Tenenbaum, J.: Sparse multidimensional scaling using landmark
points. Tech. rep., Stanford University (2004)
43. de Silva, V., Tenenbaum, J.B.: Global versus local methods in nonlinear dimen-
sionality reduction. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002), pp. 705–712 (2002)
44. Stavness, I., Gluck, J., Vilhan, L., Fels, S.S.: The mUSICtable: A map-based ubiq-
uitous system for social interaction with a digital music collection. In: Kishino,
F., Kitamura, Y., Kato, H., Nagata, N. (eds.) ICEC 2005. LNCS, vol. 3711, pp.
291–302. Springer, Heidelberg (2005)
45. Stober, S., Hentschel, C., Nürnberger, A.: Evaluation of adaptive springlens - a
multi-focus interface for exploring multimedia collections. In: Proc. of the 6th
Nordic Conference on Human-Computer Interaction (NordiCHI 2010), Reykjavik,
Iceland (October 2010)
46. Stober, S., Nürnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 53–65. Springer, Heidelberg (2010)
47. Stober, S., Nürnberger, A.: A multi-focus zoomable interface for multi-facet ex-
ploration of music collections. In: Proc. of the 7th Int. Symposium on Computer
Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 339–354 (June
2010)
48. Stober, S., Nürnberger, A.: MusicGalaxy - an adaptive user-interface for ex-
ploratory music retrieval. In: Proc. of the 7th Sound and Music Computing Con-
ference (SMC 2010), Barcelona, Spain, pp. 382–389 (July 2010)
49. Stober, S., Nürnberger, A.: Similarity adaptation in an exploratory retrieval sce-
nario. In: Detyniecki, M., Knees, P., Nürnberger, A., Schedl, M., Stober, S. (eds.)
Post- Proceedings of the 8th International Workshop on Adaptive Multimedia Re-
trieval (AMR 2010), Linz, Austria (2010)
302 S. Stober and A. Nürnberger

50. Stober, S., Steinbrecher, M., Nürnberger, A.: A survey on the acceptance of listen-
ing context logging for mir applications. In: Baumann, S., Burred, J.J., Nürnberger,
A., Stober, S. (eds.) Proc. of the 3rd Int. Workshop on Learning the Semantics of
Audio Signals (LSAS), Graz, Austria, pp. 45–57 (December 2009)
51. Torrens, M., Hertzog, P., Arcos, J.L.: Visualizing and exploring personal music
libraries. In: Proc. of the 5th Int. Conf. on Music Information Retrieval (ISMIR
2004) (2004)
52. Tzanetakis, G.: Marsyas submission to MIREX 2007. In: Proc. of the 8th Int. Conf.
on Music Information Retrieval (ISMIR 2007) (2007)
53. Vignoli, F., Pauws, S.: A music retrieval system based on user driven similarity
and its evaluation. In: Proc. of the 6th Int. Conf. on Music Information Retrieval
(ISMIR 2005), pp. 272–279 (2005)
54. Whitman, B., Ellis, D.: Automatic record reviews. In: Proc. of the 5th Int. Conf.
on Music Information Retrieval (ISMIR 2004) (2004)
55. Williams, C.K.I.: On a connection between kernel pca and metric multidimensional
scaling. Machine Learning 46(1-3), 11–19 (2002)
56. Wolter, K., Bastuck, C., Gärtner, D.: Adaptive user modeling for content-based
music retrieval. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 40–52. Springer, Heidelberg (2010)
A Database Approach to Symbolic Music
Content Management

Philippe Rigaux1 and Zoe Faget2


1
Cnam, Paris
[email protected]
2
Lamsade, Univ. Paris-Dauphine
[email protected]

Abstract. The paper addresses the problem of content-based access to


large repositories of digitized music scores. We propose a data model and
query language that allow an in-depth management of musical content.
In order to cope with the flexibility of music material, the language is
designed to easily incorporate user-defined functions at early steps of the
query evaluation process. We describe our architectural vision, develop a
formal description of the language, and illustrate a user-friendly syntax
with several classical examples of symbolic music information retrieval.

Keywords: Digital Libraries, Data Model, Time Series, Musicological


Information Management.

1 Introduction
The presence of music on the web has grown exponentially over the past decade.
Music representation is multiple (audio files, midi files, printable music scores. . . )
and is easily accessible through numerous platforms. Given the availability of sev-
eral compact formats, the main representation is by means of audio files, which
provide immediate access to music content and are easily spreaded, sampled and
listened to. However, extracting structured information from an audio file is a
difficult (if not impossible) task, since it is submitted to the subjectivity of in-
terpretation. On the other hand, symbolic music representation, usually derived
from musical scores, enables exploitation scenarios different from what audio files
may offer. The very detailed and unambiguous description of the music content
is of high interest for communities of music professionals, such as musicologists,
music publishers, or professional musicians. New online communities of users
arise, with an interest in a more in-depth study of music than what average
music lovers may look for.
The interpretation of the music content information (structure of musical
pieces, tonality, harmonic progressions. . . ) combined with meta-data (historic
and geographic context, author and composer names. . . ) is a matter of human
expertise. Specific music analysis tools have been developped by music profes-
sionals for centuries, and should now be scaled to a larger level in order to provide
scientific and efficient analysis of large collections of scores.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 303–320, 2011.

c Springer-Verlag Berlin Heidelberg 2011
304 P. Rigaux and Z. Faget

Another fundamental need of such online communities, this time shared with
more traditional platforms, is the ability to share content and knowledge, as
well as annotate, compare and correct all this available data. This Web 2.0
space with user generated content helps improve and accelerate research, with
the added bonus to make available to a larger audience sources which would
otherwise remain confidential. Questions of copyright, security and controled
contributions are part of the social networks problematic.
To summarize, a platform designed to be used mainly, but not only, by music
professionals, should offer classic features such as browsing and rendering, but
also the ability to upload new content, annotate partitions, search by content
(exact and similarity search), and several tools of music content manipulation
and analysis.
Most of the proposals devoted so far to analysis methods or similarity searches
on symbolic music focus on the accuracy and/or relevancy of the result, and
implicitly assume that these procedures apply to a small collection [2,3,13,16].
While useful, this approach gives rise to several issues when the collection consists
of thousands of scores, with heterogeneous descriptions.
A first issue is related to software engineering and architectural concerns. A
large score digital library provides several services to many differents users or
applications. Consistency, reliability, and security concerns call for the definition
of a single consistent data management interface for these services. In particu-
lar, one can hardly envisage the publication of ad-hoc search procedures that
merely exhibit the bunch of methods and algorithms developed for each spe-
cific retrieval task. The multiplication of these services would quickly overwhelm
external users. Worse, the combination of these functions, which is typically a
difficult matter, would be left to external applications. In complex systems, the
ability to compose fluently the data manipulation operators is a key to both
expressive power and computational efficiency.
Therefore, on-line communities dealing with scores and/or musical content are
often limited either by the size of their corpus or the range of possible operations,
with only one publicized strong feature. Examples of on-line communities include
Mutopia [26], MelodicMatch [27] or Musipedia [28]. Wikifonia [29] offers a wider
range of services, allowing registered users to publish and edit sheet music. One
can also cite the OMRSYS platform described in [8].
A second issue pertains to scalability. With the ongoing progress in digiti-
zation, optical recognition and user content generation, we must be ready to
face an important growth of the volume of music content that must be handled
by institutions, libraries, or publishers. Optimizing accesses to large datasets is
a delicate matter which involves many technical aspects that embrace physical
storage, indexing and algorithmic strategies. Such techniques are usually sup-
ported by a specialized data management system which releases applications
from the burden of low-level and intricate implementation concerns.
To our knowledge, no system able to handle large heterogeneous Music Digi-
tal Libraries while smoothly combining data manipulation operators exists at this
moment. The HumDrum toolkit is a widely used automated musicological
A Database Approach to Symbolic Music Content Management 305

analysis tool [14,22], but representation remains at a low level. A HumDrum based
system will lack in flexibility and will depend too much on how files are stored. This
makes difficult the development of indexing of optimization techniques. Another
possible approach would be a system based on MusicXML, an XML based file
format [12,24]. It has been suggested recently that XQuery may be used over Mu-
sicXML for music queries [11], but XQuery is a general-purpose query language
which hardly adapts to the specifics of symbolic music manipulation.
Our objective in this paper is to lay the ground for a score management sys-
tem with all the features of a Digital Scores Library combined with content
manipulation operators. Among other things, a crucial component of such a sys-
tem is a logical data model specifically designed for symbolic music management
and its associated query language. Our approach is based on the idea that the
management of structured scores corresponds, at the core level, to a limited set
of fundamental operations that can be defined and implemented once for all. We
also take into account the fact that the wide range of user needs calls for the
ability to associate these operations with user-defined functions at early steps of
the query evaluation process. Modeling the invariant operators and combining
them with user-defined operations is the main goal of our design effort. Among
numerous advantages, this allows the definition of a stable and robust query()
service which does not need ad-hoc extensions as new requirements arrive.
We do not claim (yet) that our model and its implementation will scale easily,
but a high level representation like our model is a pre-requisit in order to allow
the necessary flexibility for such futur optimization.
Section 3.2 describes in further details the Neuma platform, a Digital Score
Library [30] devoted to large collections of monodic and polyphonic music from
the French Modern Era (16th – 18th centuries). One of the central piece of
the architecture is the data model that we present in this paper. The language
described in section 5 offers a generic mechanism to search and transform music
notation.
The rest of this paper first discusses related work (Section 2). Section 3
presents the motivation and the context of our work . Section 4 then exposes the
formal fundations of our model. Section 6 concludes the paper.

2 Related Work

The past decade has witnessed a growing interest in techniques for representing,
indexing and searching (by content) music documents. The domain is commonly
termed “Music Information Retrieval” (MIR) although it covers many aspects
beyond the mere process of retrieving documents. We refer the reader to [19]
for an introduction. Systems can manipulate music either as audio files or in
symbolic form. The symbolic representation offers a structured representation
which is well suited for content-based accesses, sophisticated manipulations, and
analysis [13].
An early attempt to represent scores as structured files and to develop search
and analysis functions is the HumDrum format. Both the representation and
306 P. Rigaux and Z. Faget

the procedures are low-level (text files, Unix commands) which make them dif-
ficult to integrate in complex application. Recent works try to overcome these
limitations [22,16]. Musipedia proposes several kinds of interfaces to search the
database by content. MelodicMatch is a similar software analysing music through
pattern recognition, enabling search for musical phrases in one or more pieces.
MelodicMatch can search for melodies, rhythms and lyrics in MusicXML files.
The computation of similarity between music fragments is a central issue in
MIR systems [10]. Most proposal focus on comparisons of the melodic profiles.
Because music is subject to many small variations, approximate search is of or-
der, and the problem is actually that of finding nearest neighbors to a given
pattern. Many techniques have been experimented, that vary depending on the
melodic encoding and the similarity measure. See [9,4,1,7] for some recent pro-
posals. The Dynamic Time Warping (DTW) distance is a well-known popular
measure in speech recognition [21,20]. It allows the non-linear mapping of one
signal to another by minimizing the distance between the two. The DTW dis-
tance is usually chosen over the less flexible Euclidian distance for time series
alignment [5]. The DTW computation is rather slow but recent works show that
it can be efficiently indexed [25,15].
We are not aware of any general approach to model and query music nota-
tion. A possible approach would be to use XQuery over MusicXML documents
as suggested in [11]. XQuery is a general-purpose query language, and its use for
music scores yields complicated expressions, and hardly adapts to the specifics
of the objects representation (e.g., temporal sequences). We believe that a ded-
icated language is both more natural and more efficient. The temporal function
approach outlined here can be related to time series management [17].

3 Architecture
3.1 Approach Overview
Figure 1 outlines the main components of a Score Management System built
around our data model. Basically, the architecture is that of a standard DBMS,

Symbolic Music Application

query results queries

Symb. music model User QL


management system
Large−scale score

User Query algebra


functions
Time series manipulation primitives

Storage and indexing

Fig. 1. Approach overview


A Database Approach to Symbolic Music Content Management 307

and we actually position our design as an extension of the relational model,


making it possible to incorporate the new features as components of an extensible
relational system. The model consists of a logical view of score content, along
with a user query language and an algebra that operates in closed form (i.e., each
operator consumes and produces instances of the model). The algebra can be
closely associated with a library of user-defined functions which must be provided
by the application and allow to tailor the query language to a specific domain.
Functions in such a library must comply to constraints that will be developed
in the next section.
This approach brings, to the design and implementation of applications that
deal with symbolic music, the standard and well-known advantages of special-
ized data management systems. Let us just mention the few most important: (i)
ability to rely on a stable, well-defined and expressive data model, (ii) indepen-
dence between logical modeling and physical design, saving the need to confront
programmers with intricate optimization issues at the application level and (iii)
efficiency of set-based operators and indexes provided by the data system.
Regarding the design of our model, the basic idea is to extend the relational
approach with a new type of attribute : the time series type. Each such attribute
represents a (peculiar) temporal function that maps a discrete temporal space
to values in some domain.
The model supports the representation of polyphonic pieces composed of
“voices”, each voice being a sequence of “events” in some music-related domain
(notes, rests, chords) such that only one event occurs at a given instant. Adding
new domains allows to extend the concept to any sequence of symbols taken
from a finite alphabet. This covers monodies, text (where the alphabet consists
of syllables) as well as, potentially, any sequence of music-related information
(e.g., fingerings, performance indications, etc.).
We model a musical piece as Synchronized Time Series (STS). Generally
speaking, a time series is an ordered sequence of values taken at equal time
intervals. Some of the traditional domains that explicitly produce and exploit
time series are, among others, sales forecasting, stock market analysis, process
and quality control, etc. [6]. In the case of music notation, “time” is not to be
understood as a classic calendar (i.e., days or timestamps) as in the previously
mentioned examples but rather, at a more abstract level, as a sequence of events
where the time interval is the smallest note duration in the musical piece.
Consider now a digital library that stores music information and provides
query services on music collections. A music piece consists of one or several parts
which can be modelled as time series, each represented by a score. Fig. 2 is a
simple example of a monodic piece, whereas polyphonic music pieces (Fig. 3)
exhibit a synchronization of several parts.
The temporal domain of interest here is an abstract representation if the
temporal features of the times series. Essentially, these features are (i) the order
of the musical “events” (notes, rests), (ii) the relative duration of the events, and
(iii) the required synchronization of the different parts (here lyrics and notes).
308 P. Rigaux and Z. Faget

Fig. 2. A monodic score

Fig. 3. A polyphonic score

Here are a few examples of queries:


1. Get the scores whose lyrics contains the word ’conseil’ (selection)
2. Get the melodic part that corresponds to the word ’conseil’ (selection and
temporal join)
3. Find a melodic pattern. (search by similarity)
When a collection of musical score is to be studied, queries regarding common
features are also of interest. For example
1. Find all musical pieces in the collection starting with the note C and getting
to G in 5 times unit.
2. Find if minor chords occur more often than major chord synchronized with
the word ’tot’ in Bach’s work.
We provide an algebra that operates in closed form over collections of scores.
We show that this algebra expresses usual operations (e.g., similarity search),
achieves high expressiveness through unbounded composition of operators and
natively incorporates the ability to introduce user-defined function in query ex-
pressions, in order to meet the needs of a specific approach to symbolic music
manipulation. Finally, we introduce a user-friendly query language to express
algebraic operations. We believe that our approach defines a sound basis to
the implementation of a specialized query language devoted to score collections
management at large scale.
In the next section, we describe the Neuma plateform, a score management
system built around those concepts.

3.2 Description of the Neuma Platform


The Neuma platform is a digital library devoted to symbolic music content.
It consists of a repository dedicated to the storage of large collections of dig-
ital scores, where users/applications can upload their documents. It also pro-
poses a family of services to interact with those scores (query, publish, annotate,
transform and analyze).
A Database Approach to Symbolic Music Content Management 309

The Neuma platform is meant to interact with distant web applications with
local database that store corpus-specific informations. The purpose of Neuma
is to manage all music content informations and leave contextual data to the
client application (author, composer, date of publication, . . . ).
To send a new document, an application calls the register() service. So far
only MusicXML documents can be used to exchange musical description, but
any format could be used provided the corresponding mapping function is in
place. Since MusicXML is widely used, it is sufficient for now. The mapping
function extracts a representation of the music content of the document which
complies to our data model.
To publish a score -wether it’s a score featured in the database or a modified
one- the render() service is called. The render() service is based on the Lilypond
package. The generator takes an instance of our model as input and converts it
into a Lilypond file. The importance of a unified data model appears clearly in
such an example : the render service is based on the model, making it rather
easy to visualize a transformed score, when it would be a lot more difficult to
do so if it was instead solely based on the document format.
A large collection of scores would be useless if there was no appropriate query()
service allowing reliable search by content. As explained before, the Neuma dig-
ital library stores all music content (originated from different collections, po-
tentially with heterogenous description) in its repository and leaves descriptive
contextual data specific to collections in local databases. Regardless of their
original collection, music content complies to our data model so that it can be
queried accordingly. Several query types are offered : exact, transposed, with or
without rythm, or contour which only takes in account the shape of the input
melody. The query() service combines content search with descriptive data. A
virtual keyboard is provided to enter music content, and research fields can be
filled to adress the local databases.
The Neuma platform also provides and annotate() service. The annotations
are a great way to enrich the digital library and make sure it keeps growing and
improving. In order to use the annotate() service one first selects part of a score
(a set of elements of the score) and enters information about this portion. There
are different kinds of annotations : free text (useful for performance indications),
or pre-selected terms from an ontology (for identifying music fragments). Anno-
tations can be queried alongside other criterias previously mentioned.

4 The Data Model


4.1 Preliminaries
A musical domain dommusic is a product domain combining heterogenous mu-
sical informations. For example, the domain of a simple monodic score is
dommusic = dompitch × domrythm .
Any type of information that can be extracted from symbolic music (such as
measures, alterations, lyrics. . . ) can be added to the domain. Each domain con-
tains two distinguished values: the neutral value  and the null value ⊥. The
310 P. Rigaux and Z. Faget

Boolean operations ∧ (conjunction) and ∨ (disjunction) verify, for any a ∈ dom,


a ∧ ⊥ = ⊥, a ∧  = a and a ∨ ⊥ = a ∨  = a. In some cases, ⊥ and  can also
be viewed as false and true.
With a given musical domain comes a set of operations provided by the user
and related to this specific domain. When managing a collections of Choir parts,
a function such as max() which computes the highest pitch is meaningful, and
becomes irrelevant when managing of collection of Led Zeppelin tablatures. The
operators designed to single out each part of a complex music domain are pre-
sented in the extended relational algebra.
We subdivise the time domain T into a defined, regular, repeated pattern. The
subdivision of time is the smallest interval between two musical events. The time
domain T is then a countable ordered set isomorphic to N. We introduce a set of
internal time functions, designed to operate on the time domain T . We define as
L the class of functions from T to T otherwise known as internal time functions
(ITF). Any L can be used to operate on the time domain as long as the user
finds this operation meaningful. An important sub-class of L is the set of linear
functions of the form t → nt + m. We denote them temporal scaling functions
in what follows, and further distinguish the families of warping functions of the
form warpm : T → T , t → mt and shifting functions shif tn : T → T , t → t + n.
Shifting functions are used to ignore the first n events of a time series, while
warping functions single out one out of m events.
A musical time series (or voice) is a mapping from a time domain T into
a musical domain dommusic . When dealing with a collection of scores sharing
a number of common properties, we introduce the schema of a relation. The
schema distinguishes atomic attributes names (score_id, author, year. . . ) and
times series (or voices) names. We denote by TS([dom]) the type of these at-
tributes, where [dom] is the domain of interest. Here is, for example, the schema of a
music score

Score(Id : int, Composer : string, V oice : TS(vocal), P iano : TS(polyMusic)).

Note that the time domain is shared by the vocal part and the piano part.
For the same score, one could have made the different choice of schema where
vocal and piano are represented in the same time series of type TS([vocal ×
polyMusic])
The domain vocal adds the lyrics domain to the classic music domain:
domvocals = dompitch × domrythm × domlyrics .

The domain polymusic is the product

dompolymusic = (dompitch × domrythm )N .

We will now define two sets of operators gathered into two algebras: the
(extended) relational algebra, and the times series algebra.
A Database Approach to Symbolic Music Content Management 311

4.2 The Relational Algebra Alg(R)


The Alg(R) algebra consists of the usual operators selection σ, product ×, union
∪ and difference −, along with an extended projection Π. We present simple
examples of those operators.

Selection, σ. Select all scores composed by Louis Couperin:

σauthor= Louis Couperin (Score)

Select all scores containing the lyrics ’Heureux’:

σlyrics(voice)⊃ Heureux Seigneur  (Score)

Projection, π. We want the vocal parts from the Score schema without the
piano part. We project the piano out:

πvocals (Score)

Product, ×. Consider a collection of duets, split into the individual vocal parts
of male and female singers, with the following schemas

M ale_P art(Id : int, V oice : TS(vocals)),

F emale_P art(Id : int, V oice : TS(vocals)).


To get the duet scores, we cross product a female part and a male part. We get
the relation

Duet(Id : int, M ale_V : TS(vocals), F emale_V : TS(vocals)).

Note that the time domain is implicitly shared. In itself, the product doesn’t
have much interest, but together with the selection operator it becomes the join
operators 1. In the previous example, we shouldn’t blindly associate any male
and female vocal parts, but only the ones sharing the same Id.

σM.Id=F.Id (M ale × F emale) ≡ M ale 1Id=Id F emale.

Union ∪. We want to join the two consecutive movements of a piece which


have two separate score instances.

Score = Scorept1 ∪ Scorept2 .

The time series equivalent of the null attribute is the empty time series, for
which each event is ⊥. Beyond classical relational operators, we introduce an
emptyness test ∅? that operates on voices and is modeled as: ∅? (s) = f alse if
∀t, s(t) = ⊥, else ∅? (s) = true. The emptynes test can be introduced in selection
formulas of the σ operator.
312 P. Rigaux and Z. Faget

Consider once more the relation Score. We want to select all scores featuring
the word ’Ave’ in the lyrics part of V. We need a user function m : (lyrics) →
(⊥, ) such that m( Ave ) = , else ⊥. Lyrics not containing ’Ave’ are trans-
formed into the “empty” time series t → ⊥, ∀t. The algebraic expression is:

σ∅? (W ) (Π[Id,V,W :m(lyrics(V ))] (Score)).

4.3 The Time Series Algebra Alg(T S)

We now present the operators of the time series algebra Alg(T S) (◦, ⊕, A). Each
operator takes one or more time series as input and produces a time series.
This way, operating in closed form, operators can be composed. They allow, in
particular: alteration of the time domain in order to focus on specific instants
(external composition); apply a user function to one or more times series to form
a new one (addition operator); windowing of time series fragments for matching
purposes (aggregation).
In what follows, we take an instance s of the Score schema to run several
examples.
The external composition ◦ composes a time series s with an internal temporal
function l. Assume our score s has two movements, and we only want the second
one. Let shif tn be an element of the shift family functions parametrized by a
constant n ∈ N. For any t ∈ T , s◦shif tn (t) = s(t+n). In other words, s◦shif tn is
the time series extracted from s where the first n events are ignored. We compose
s with the shif tL function, where L is the length of the first movement, and the
resulting time series s ◦ shif tL is our result.
Imagine now that we want only the first note of every measure. Assuming
we are in 4/4 and the time unit is one fourth, we compose s with warp16 . The
time series s ◦ warp16 is the time series where only one out of sixteen events are
considered, therefore the first note of every measure.
We now give an example of the addition operator ⊕. Let dompitch be the
domain of all musical notes and domint the domain of all musical intervals. We
can define an operation from dompitch × dompitch to domint , called the harm
operator, which takes two notes as input and computes the interval between
these two notes. Given two time series representing each a vocal part, for instance
V1 =soprano, V2 =alto, we can define the time series

V1 ⊕harm V2

of the harmonic progression (i.e., the sequence of intervals realized by the jux-
taposition of the two voices).
Last, we present the aggregation mechanism A. A typical operation that we
cannot yet obtained is the “windowing” which, at each instant, considers a local
part of a voice s and derives a value from this restricted view. A canonical
example is pattern matching. So far we have no generic way to compare a pattern
P with all subsequences of a time series s. The intuitive way to do pattern
matching is to build all subsequences from s and compare P with each of them,
A Database Approach to Symbolic Music Content Management 313

dom dom dom


γ γ γ aggregation step
s(τ ) ∈ dom s(t) s ◦ λτ −10 s ◦ λτ s ◦ λτ +10

derivation step

t t
τ τ − 10 τ τ + 10

a. Function s(t) b. Sequence of derived functions dλ (s)

Fig. 4. The derivation/aggregation mechanism

using an appropriate distance. This is what the aggregation mechanism does, in


two steps:
1. First, we take a family of internal time functions λ such that for each instant
τ , λ(τ ) is an internal time function. At each instant τ , a TS sτ = dλ s(τ ) =
s ◦ λ(τ ) is derived from s thanks to a derivation operator dλ ;
2. Then a user aggregation function γ ∈ Γ is applied to the TS sτ , yielding an
element from dom.
Fig. 4 illustrates this two-steps process. At each instant τ a new function is
derived. At this point, we obtain a sequence of time series, each locally defined
with respect to an instant of the time axis. This sequence corresponds to local
views of s, possibly warped by temporal functions. Note that this intermediate
structure is not covered by our data model.
The second step applies an aggregation function γ. It takes one (or several,
depending on the arity of γ) derived series, and produces an element from dom.
The combination of the derivation and aggregation steps results in a time series
that complies to the data model.
To illustrate the derivation step, we derive our score s with respect to the
shift function :

dshif t (s) = (s ◦ shif t0 , s ◦ shif t1 , . . . , s ◦ shif tn ).

The aggregation step takes the family of time series obtained with the deriva-
tion step and applies a user function to all of them. For our ongoing example,
this translates to applying the user function DT WP (which computes the DTW
distance between the input time series s and the pattern P ) to dshif t (s). We
denote this two steps procedure by the following expression:

ADTWP ,shift (s).

4.4 Example: Pattern Matching Using the DTW Distance


To end this section, we give an extended example using operators from Alg(R)
and Alg(T S) . We want to compute, given a pattern P and a score, the instant
314 P. Rigaux and Z. Faget

t where the Dynamic Time Warping (DTW) distance between P and V is less
than 5. First, we compute the DTW distance between P and the voice V1 of a
score, at each instant, thanks to the following Alg(T S) expression:

e = AdtwP ,shift (V1 )

where dtwP is the function computing the DTW distance between a given time
series and the pattern P . Expression e defines a time series that gives, at each
instant α, the DTW distance between P and the sub-series of V1 that begins at
α. A selection (from Alg(T S) ) keeps the values in e below 5, all others being set
to ⊥. Let ψ be the formula that expresses this condition. Then:

e = σψ (e).

Finally, expression e is applied to all the scores in the Score relation with the Π
operator. An emptyness test can be used to eliminate those for which the DTW
is always higher that 5 (hence, e is empty).

σ∅? (e ) (Π[composer,V1 ,V2 ,e ] (Score)).

5 User Query Language


The language should have a precise semantic so that the intend of the query is un-
ambiguous. It should specify what to be done, and not how to do it. The language
should understand different kind of expressions: queries as well as definition of
user functions. Finally, the query syntax should be easily understandable by a
human reader.

5.1 Overview
The language is implemented in Ocaml. Time series are represented by lists of
array of elements, where elements can be integers, float, string, booleans, or any
type previously defined. This way we allow ourselves to synchronize voices of
different types. The only restriction is that we do not allow an element to be a
time series. We define the type time series ts_t by a couple (string*element),
where string is the name of the voice, allowing fast access when searching for a
specific voice.

5.2 Query Structure


The general structure of a query is:

from Table [alias]


let NewAttribute := map (function, Attribute)
construct Attribute | NewAttribute
where Attribute | NewAttribute = criteria
A Database Approach to Symbolic Music Content Management 315

The from clause should list at least one table from the database, and the alias
is optional.
The let clause is optional. There can be as many let clause as desired. Voices
modified in a let clause are from attributes present in tables listed in the from
clause.
The construct clause should list at least one attribute, either from one the
listed tables in the from clause, or a modified attribute from a let clause.
The where clause is optional. It consists of a list of predicates, connected by
logical operators (And, Or, Not).

5.3 Query Syntax


The query has four main clauses : from, let, construct, where. In all that
follows, by attribute we mean both classical attributes or time series.
Attributes which appear in the construct, let or where clauses are called
either ColumnName if there is no ambiguity or TableName.ColumnName oth-
erwise. If the attribute is a time series and we want to refer to a specific voice,
we project by using the symbol ->. Precisely, a voice is called ColumnName-
>VoiceName or TableName.ColumnName->VoiceName.
The from clause enumerate a list of tables from the database, with an optional
alias. The table names should refer to actual tables of the database. Aliases
should not be duplicated, nor should they be the name of an existing table.
The optional let clause apply a user function to an attribute. The attribute
should be from one of the tables listed in the from clause. When the attribute
is a time series, this is done by using map. When one wants to apply a binary
operator on two time series, we use map2.
The construct clause lists the names of the attributes (or modified attributes)
which should appear in the query result.
The where clause evaluates a condition on an attribute or a modified attribute
introduced by the let clause. If the condition is evaluated to true, then the line
it refers to is considered to be a part of the result query. Lack of a where clause
is evaluated as always true.
The where clause supports the usual arithmetic (+, −, ∗, /), logical (And,
Or, Not) and comparison operators (=,<>, >,<,>=, <=, contains) which are
used inside predicates.

5.4 Query Evaluation


A query goes through several steps so that the instructions it contains can be
processed. A query can be simple (retrieve a record from the database) or com-
plex with content manipulation and transformation. The results (if any) are then
returned to the user.

Query analysis. The expression entered by the user (either a query or a user
defined function) is analyzed by the system.
The system verifies the query’s syntax. The query is turned into an abstract
syntax tree in which each node is an algebraic operation (from bottom to top :
316 P. Rigaux and Z. Faget

set of tables (from), selection (where), content manipulation (let), projection


(construct)).
Query evaluation. The abstract syntax tree is then transformed into an eval-
uation tree. This steps is a first simple optimisation, where table names
become pointers to the actual tables and column names are turned into the
column’s indice. This step also verifies the validity of the query: existence of
tables in the from clause, unambiguity of the columns in the where clause,
consistency of the attributes in the construct clause, no duplicate aliases. . .
Query execution. In this last step, the nodes of the evaluation tree call the
actual algorithms, so that the program computes the result of the query.
The bottom node retrieves the first line of the table in the from clause. The
second node access the column featured in the where clause, and evaluates
the selection condition. If the condition is true, then the next node is called,
which print the corresponding result on the output chanel, then returns to
the first node and iterate the process on the next line. If the condition is
evaluated as false, then the first node is called again and the process is
iterated on the next line.
The algorithm ends when there are no more lines to evaluate.

5.5 Algebraic Equivalences


In this section, we give the syntaxic equivalent to all algebraic operators previ-
ously introduced in section 4, along with some examples.
Relational operators
– Selection
algebraic notation : σF (R), where F is a formula, R a relation.
syntaxic equivalent : where. . . =

– Projection
algebraic notation : ΠA (R), where A is a set of attributes, R a relation
syntaxic equivalent : construct
Example : the expression
Πid,voice (σcomposer= Faure (Psalms))
is equivalent to
from Score
construct id, voice
where composer=’Faure’

Extension to time series


– Projection
algebraic notation : ΠV (S), where V is a set of voices, S a time series
syntaxic equivalent : S− > (voice 1 , . . . , voice n )
A Database Approach to Symbolic Music Content Management 317

– Selection
algebraic notation : σF (V ), where F is a formula, V is a set of voices of a
time series S.
syntaxic equivalent : where S->V . . . contains
Example : the expression
Πid,Π (σΠlyrics (voice)⊃ Heureux les hommes  ,composer = Faure  (Psalms))
pitch,rythm (voice)
is equivalent to
from Score
construct id, voice->(pitch,rythm)
where voice->lyrics contains ’Heureux les hommes’
and composer = ’Faure’

Remark. if the time series consists of several synchronized voices, contains


shoud list as many conditions as voices.
Example: where voice->(lyrics,pitch) contains (’Heureux les hommes’, ’A3,
B3,B3’)).
– Product
algebraic notation : s × t, where s and t are two time series.
syntaxic equivalent : synch
Example : the expresison

ΠM.V oice×F.V oice (σM.Id=F.Id (M ale × F emale))

is equivalent to

from Male M, Female F


let $duet := synch(M.Voice,F.Voice)
construct $duet
where M.id=F.id

Time series operator


– Addition
algebraic notation : ⊕op , where op is a user function.
syntaxic equivalent : map and map2
Examples :
• the expression

Πpitch(voice)⊕ ( Psalms)
transpose(1)
is equivalent to
from Psalms
let $transpose := map(transpose(1), voice->pitch)
construct $transpose
318 P. Rigaux and Z. Faget

• the expression

Πtrumpet(voice)⊕ (Duets)
harm clarinet(voice)
is equivalent to
from Duets
let $harmonic_progression :=
map2(harm, trumpet->pitch, clarinet->pitch)
construct $harmonic_progression
– Composition
algebraic notation : S ◦ γ where γ is an internal temporal function and S a
time series.
syntaxic equivalent : comp(S, γ)

– Aggregation - derivation
algebraic notation : Aλ,Γ (S), where S is a time series, Γ is a family of internal
time function and λ is an agregation function
syntaxic equivalent : derive(S, Γ, λ).
The family of internal time functions Γ is a mapping from the time domain
into the set of internal time functions. Precisely, for each instant n, Γ (n) = γ,
an internal time function. The two mostly used family of time functions Shift
and Warp are provided.
Example the expression

Πid,Adtw(P),shift (Psalm)

is equivalent to
from Psalm
let $dtwVal := derive(voice, Shift, dtw(P))
construct id, $dtwVal

6 Conclusion and Ongoing Work


By adopting from the beginning an algebraic approach to the management of
time series data sets, we directly enable an expressive and stable language that
avoids a case-by-case definition of a query language based on the introduction of
ad-hoc functions subject to constant evolution. We believe that this constitutes
a sound basis for the development of applications that can rely on an expressive
and efficient data management.
Current efforts are being put in language implementation in order to opti-
mize query evaluation. Our short-term roadmap also includes an investigation
of indexing structures apt at retrieving patterns in large collections.
Acknowledgments. This work is partially supported by the French ANR
Neuma project, http://neuma.irpmf-cnrs.fr. The authors would like to thank
Virginie Thion-Goasdoue and David Gross-Amblard.
A Database Approach to Symbolic Music Content Management 319

References
1. Allan H., Müllensiefen D. and Wiggins,G.A.: Methodological Considerations in
Studies of Musical Similarity. In: Proc. Intl. Society for Music Information Retrieval
(ISMIR) (2007)
2. Anglade, A. and Dixon, S.: Characterisation of Harmony with Inductive Logic
Programming. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
3. Anglade, A. and Dixon, S.: Towards Logic-based Representations of Musical Har-
mony for Classification Retrieval and Knowledge Discovery. In: MML (2008)
4. Berman, T., Downie, J., Berman, B.: Beyond Error Tolerance: Finding Thematic
Similarities in Music Digital Libraries. In: Proc. European. Conf. on Digital Li-
braries, pp. 463–466 (2006)
5. Berndt, D., Clifford, J.: Using dynamic time warping to find patterns in time series.
In: AAAI Workshop on Knowledge Discovery in Databases, pp. 229–248 (1994)
6. Brockwell, P.J., Davis, R.: Introduction to Time Series and forecasting. Springer,
Heidelberg (1996)
7. Cameron, J., Downie, J.S., Ehmann, A.F.: Human Similarity Judgments: Impli-
cations for the Design of Formal Evaluations. In: Proc. Intl. Society for Music
Information Retrieval, ISMIR (2007)
8. Capela, A., Rebelo, A., Guedes, C.: Integrated recognition system for music scores.
In: Proc. of the 2008 Internation Computer Music Conferences (2008)
9. Downie, J., Nelson, M.: Evaluation of a simple and effective music information
retrieval method. In: Proc. ACM Symp. on Information Retrieval (2000)
10. Downie, J.S.: Music Information Retrieval. Annual review of Information Science
and Technology 37, 295–340 (2003)
11. Ganseman, J., Scheunders, P., D’haes, W.: Using XQuery on MusicXML Databases
for Musicological Analysis. In: Proc. Intl. Society for Music Information Retrieval,
ISMIR (2008)
12. Good, M.: MusicXML in practice: issues in translation and analysis. In: Proc. 1st
Internationl Conference on Musical Applications Using XML, pp. 47–54 (2002)
13. Haus, G., Longari, M., Pollstri, E.: A Score-Driven Approach to Music Information
Retrieval. Journal of American Society for Information Science and Technology 55,
1045–1052 (2004)
14. Huron, D.: Music information processing using the HumDrum toolkit: Concepts,
examples and lessons. Computer Music Journal 26, 11–26 (2002)
15. Keogh, E.J., Ratanamahatana, C.A.: Exact Indexing of Dynamic Time Warping.
Knowl. Inf. Syst. 7(3), 358–386 (2003)
16. Knopke, I. : The Perlhumdrum and Perllilypond Toolkits for Symbolic Music Infor-
mation Retrieval. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
17. Lee, J.Y., Elmasri, R.: An EER-Based Conceptual Model and Query Language for
Time-Series Data. In: Proc. Intl.Conf. on Conceptual Modeling, pp. 21–34 (1998)
18. Lerner, A., Shasha, D.: A Query : Query language for ordered data, optimiza-
tion techniques and experiments. In: Proc. of the 29th VLDB Conference, Berlin,
Germany (2003)
19. Muller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg
(2004)
20. Rabiner, L., Rosenberg, A., Levinson, S.: Considerations in dynamic time warping
algorithms for discrete word recognition. IEEE Trans. Acoustics, Speech and Signal
Proc. ASSP-26, 575–582 (1978)
320 P. Rigaux and Z. Faget

21. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Trans. Acoustics, Speech and Signal Proc. ASSP-26, 43–
49 (1978)
22. Sapp, C.S.: Online Database of Scores in the Humdrum File Format. In: Proc. Intl.
Society for Music Information Retrieval, ISMIR (2005)
23. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey Of Music Information Retrieval
Systems. In: Proc. Intl. Society for Music Information Retrieval, ISMIR (2005)
24. Viglianti, R.: MusicXML : An XML based approach to automatic musicological
analysis. In: Conference Abstracts of the Digital Humanities (2007)
25. Zhu, Y., Shasha, D.: Warping Indexes with Envelope Transforms for Query by
Humming. In: Proc. ACM SIGMOD Symp. on the Management of Data, pp. 181–
192 (2003)
26. Mutopia, http://www.mutopiaproject.org (last viewed February 2011)
27. Melodicmatch, http://www.melodicmatch.com (last viewed February 2011)
28. Musipedia, http://www.musipedia.org (last viewed February 2011)
29. Wikifonia, http://www.wikifonia.org (last viewed February 2011)
30. Neuma, http://neuma.fr (last viewed February 2011)
Error-Tolerant Content-Based Music-Retrieval
with Mathematical Morphology

Mikko Karvonen, Mika Laitinen, Kjell Lemström, and Juho Vikman

University of Helsinki
Department of Computer Science
[email protected]
{mika.laitinen,kjell.lemstrom,juho.vikman}@helsinki.fi
http://www.cs.helsinki.fi

Abstract. In this paper, we show how to apply the framework of mathe-


matical morphology (MM) in order to improve error-tolerance in content-
based music retrieval (CBMR) when dealing with approximate retrieval
of polyphonic, symbolically encoded music. To this end, we introduce two
algorithms based on the MM framework and carry out experiments to
compare their performance against well-known algorithms earlier devel-
oped for CBMR problems. Although, according to our experiments, the
new algorithms do not perform quite as well as the rivaling algorithms in
a typical query setting, they provide ease of adjusting the desired error
tolerance. Moreover, in certain settings the new algorithms become even
faster than their existing counterparts.

Keywords: MIR, music information retrieval, mathematical morphol-


ogy, geometric music retrieval, digital image processing.

1 Introduction
The snowballing number of multimedia data and databases publicly available
for anyone to explore and query has made the conventional text-based query
approach insufficient. To effectively query these databases in the digital era,
content-based methods tailored for the specific media have to be available.
In this paper we study the applicability of a mathematical framework for
retrieving music in symbolic, polyphonic music databases in a content-based
fashion. More specifically, we harness the mathematical morphology methodol-
ogy for locating approximate occurrences of a given musical query pattern in
a larger music database. To this end, we represent music symbolically using
the well-known piano-roll representation (see Fig. 1(b)) and cast it into a two-
dimensional binary image. The representation used resembles that of a previ-
ously used technique based on point-pattern matching [14,11,12,10]; the applied
methods themselves, however, are very different. The advantage of using our
novel approach is that it enables more flexible matching for polyphonic music,
allowing local jittering on both time and pitch values of the notes. This has been
problematic to achieve with the polyphonic methods based on the point-pattern

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 321–337, 2011.

c Springer-Verlag Berlin Heidelberg 2011
322 M. Karvonen et al.

Fig. 1. (a) The first two measures of Bach’s Invention 1. (b) The same polyphonic
melody cast into a 2-D binary image. (c) A query pattern image with one extra note
and various time and pitch displacements. (d) The resulting image after a blur rank
order filtering operation, showing us the potential matches.

matching. Moreover, our approach provides the user with an intuitive, visual way
of defining the allowed approximations for a query in hand. In [8], Karvonen and
Lemström suggested the use of this framework for music retrieval purposes. We
extend and complement their ideas, introduce and implement new algorithms,
and carry out experiments to show their efficiency and effectiveness.
The motivation to use symbolic methods is twofold. Firstly, there is a mul-
titude of symbolic music databases where audio methods are naturally not of
use. In addition, the symbolic methods allow for distributed matching, i.e., oc-
currences of a query pattern are allowed to be distributed across the instruments
(voices) or to be hidden in some other way in the matching fragments of the poly-
phonic database. The corresponding symbolic and audio files may be aligned by
using mapping tools [7] in order to be able to play back the matching part in an
audio form.

1.1 Representation and Problem Specifications


In this paper we deal with symbolically encoded, polyphonic music for which we
use the pointset representation (the pitch-against-time representation of note-
on information), as suggested in [15], or the extended version of the former, the
horizontal-line-segment representation [14], where note durations are also explic-
itly given. The latter representation is equivalent to the well-known piano-roll
representation (see e.g. Fig. 1(b)), while the former omits the duration informa-
tion of the line segments and uses only the onset information of the notes (the
starting points of the horizontal line segments). As opposed to the algorithms
based on point-pattern matching where the piano-roll representation is a mere
visualization of the underlying representation, here the visualization IS the rep-
resentation: the algorithms to be given operate on binary images of the onset
points or the horizontal line segments that correspond to the notes of the given
query pattern and the database.
Let us denote by P the pattern to be searched for in a database, denoted
by T . We will consider the three problems, P1-P3, specified in [14], and their
Morphologic Music Retrieval 323

generalizations, AP1-AP3, to approximative matching where local jittering is


allowed. The problems are as follows:
– (Approximative) complete subset matching: Given P and T in the
pointset representation, find translations of P such that all its points match
(P1) / match approximatively (AP1) with some points in T .
– (Approximative) maximal partial subset matching: Given P and T
in the pointset representation, find all translations of P that give a maximal
(P2) / an approximative and maximal (AP2) partial match with points in
T.
– (Approximative) longest common shared time matching: Given P
and T in the horizontal line segment representation, find translations of P
that give the longest common (P3) / the approximative longest common
(AP3) shared time with T , i.e., the longest total length of the (approxima-
tively) intersected line segments of T and those of translated P .
Above we have deliberately been vague in the meaning of an approximative
match: the applied techniques enable the user to steer the approximation in
the desired direction by means of shaping the structuring element used, as will
be shown later in this paper. Naturally, an algorithm capable of solving prob-
lem AP1, AP2 or AP3 would also be able to solve the related, original non-
approximative problem P1, P2 or P3, respectively, because an exact match can
be identified with zero approximation.

2 Background
2.1 Related Work
Let us denote by P + f a translation of P by vector f , i.e., vector f is added
to each m component of P separately: P + f = p1 + f, p2 + f, . . . , pm + f .
Problem AP1 can then be expressed as the search for a subset I of T such that
P + f  I for some f and some similarity relation ; in the original P1 setting
the  relation is to be replaced by the equality relation =. It is noteworthy that
the mathematical translation operation corresponds to two musically distinct
phenomena: a vertical move corresponds to transposition while a horizontal move
corresponds to aligning the pattern and the database time-wise.
In [15], Wiggins et al. showed how to solve P1 and P2 in O(mn log(mn)) time.
First, translations that map the maximal number of the m points of P to some
points of T (of n points) are to be collected. Then the set of such translation
vectors is to be sorted based on the lexicographic order, and finally the transla-
tion vector that is the most frequent is to be reported. If the reported vector f
appears m times, it is also an occurrence for P1. With careful implementation
of the sorting routine, the running time can be improved to O(mn log m) [14].
For P1, one can use a faster algorithm working in O(n) expected time and O(m)
space [14].
In [5], Clifford et al. showed that problem P2 is 3SUM-hard, which means that it
is unlikely that one could find an algorithm for the problem with a subquadratic
324 M. Karvonen et al.

running time. Interestingly enough, Minkowski addition and subtraction, which


are the underlying basic operations used by our algorithms, are also known to
be 3SUM-hard [1]. Clifford et al. also gave an approximation algorithm for P2
working in time O(n log n).
In order to be able to query large music databases in real time, several in-
dexing schemes have been suggested. Clausen et al. used an inverted file index
for a P2-related problem [4] that achieves sublinear query times in the length of
the database. In their approach, efficiency is achieved at the cost of robustness:
the information extraction of their method makes the approach non-applicable
to problems P1 and P2 as exact solutions. Another very general indexing ap-
proach was recently proposed in [13]: Typke et al.’s use of a metric index has
the advantage that it works under robust geometric similarity measures. How-
ever, it is difficult to adapt it to support translations and partial matching. More
recently, Lemström et al. [10] introduced an approach that combines indexing
and filtering achieving output sensitive running times for P1 and P2: O(sm)
and O(sm log m), respectively, where s is the number of candidates, given by a
filter, that are to be checked. Typically their algorithms perform 1-3 orders of
magnitude faster than the original algorithms by Ukkonen et al. [14].
Romming and Selfridge-Field [12] introduced an algorithm based on geometric
hashing. Their solution that combines the capability of dealing with polyphonic
music, transposition invariance and time-scale invariance, works in O(n3 ) space
and O(n2 m3 ) time, but by applying windowing on the database, the complexities
can be restated as O(w2 n) and O(wnm3 ), respectively, where w is the maximum
number of events that occur in any window. Most recently, Lemström [9] gen-
eralized Ukkonen et al.’s P1 and P2 algorithms [14] to be time-scale invariant.
With windowing the algorithms work in O(mΣ log Σ) time and O(mΣ) space,
where Σ = O(wn) when searching for exact occurrences and Σ = O(nw2 ) when
searching for partial occurrences; without windowing the respective complexities
are O(ρn2 log n) and O(ρn2 ); ρ = O(m) for the exact case, ρ = O(m2 ) for the
partial case.
With all the above algorithms, however, their applicability to real-world prob-
lems is reduced due to the fact that, beyond the considered invariances, matches
have to be mathematically exact, and thus, for instance, performance expression
and error is difficult to account for. We bridge this gap by introducing new algo-
rithms based on the mathematical morphology framework where allowed error
tolerance can be elegantly embedded in a query.

2.2 Mathematical Morphology


Mathematical morphology (MM) is a theoretically well-defined framework and
the foundation of morphological image processing. Originally developed for bi-
nary images in the 1960s, it was subsequently extended to grey-scale images and
finally generalized to complete lattices. MM is used for quantitative analysis and
processing of the shape and form of spatial structures in images. It finds many ap-
plications in computer vision, template matching and pattern recognition prob-
lems. Morphological image processing is used for pre- and post-processing of
Morphologic Music Retrieval 325

images in a very similar way to conventional image filters. However, the focus
in MM-based methods is often in extracting attributes and geometrically mean-
ingful data from images, as opposite to generating filtered versions of images.
In MM, sets are used to represent objects in an image. In binary images, the
sets are members of the 2-D integer space Z2 . The two fundamental morpholog-
ical operations, dilation and erosion, are non-linear neighbourhood operations
on two sets. They are based on the Minkowski addition and subtraction [6]. Out
of the two sets, the typically smaller one is called the structuring element (SE).
Dilation performs a maximum on the SE, which has a growing effect on the
target set, while erosion performs a minimum on the SE and causes the target set
to shrink. Dilation can be used to fill gaps in an image, for instance, connecting
the breaks in letters in a badly scanned image of a book page. Erosion can be
used, for example, for removing salt-and-pepper type noise. One way to define
dilation is
A ⊕ B = {f ∈ Z2 | (B̂ + f ) ∩ A = ∅}, (1)
where A is the target image, B is the SE, and B̂ its reflection (or rotation by
180 degrees). Accordingly, erosion can be written

A  B = {f ∈ Z2 | (B + f ) ⊆ A}. (2)

Erosion itself can be used for pattern matching. Foreground pixels in the
resulting image mark the locations of the matches. Any shape, however, can be
found in an image filled with foreground. If the background also needs to match,
erosion has to be used separately also for the negations of the image and the
structuring element. Intersecting these two erosions leads to the desired result.
This procedure is commonly known as the hit-or-miss transform or hit-miss
transform (HMT):

HMT(A, B) = (A  B) ∩ (AC  B C ). (3)

HMT is guaranteed to give us a match only if our SE perfectly matches some


object(s) in the image. The requirement for a perfect match is that the back-
ground must also match (i.e., it cannot contain additional pixels) and that each
object has at least a one-pixel-thick background around it, separating it from
other objects (in this case B C actually becomes W − B, where W is a window of
”on” pixels slightly larger than B). In cases where we are interested in partially
detecting patterns within a set, we can ignore the background and reduce HMT
to simple erosion. This is clearly the case when we represent polyphonic music
as binary 2-D images. We use this simplified pattern detection scheme in one of
the algorithms developed in Chapter 3.

3 Algorithms

In [8] Karvonen and Lemström introduced four algorithms based on the mathe-
matical morphology framework and gave their MATLAB implementations. Our
326 M. Karvonen et al.

closer examination revealed common principles behind the four algorithms; three
of them were virtually identical to each other.
The principles on which our two algortihms to be introduced rely are explained
by Bloomberg and Maragos [2]. Having HMT as the main means of generalizing
erosion, they present three more, which can be combined in various ways. They
also name a few of the combinations. Although we can find some use for HMT,
its benefit is not significant in our case. But two of the other tricks proved to be
handy, and particularly their combination, which is not mentioned by Bloomberg
and Maragos.
We start with erosion as the basic pattern matching operation. The problem
with erosion is its lack of flexibility: every note must match and no jittering is
tolerated. Performing the plain erosion solves problem P1. We present two ways
to gain flexibility.
– Allow partial matches. This is achieved by moving from P1 to P2.
– Handle jittering. This is achieved by moving from P1 to AP1.
Out of the pointset problems, only AP2 now remains unconsidered. It can be
solved, however, by combining the two tricks above. We will next explain how
these improvements can be implemented. First, we concentrate on the pointset
representation, then we will deal with line segments.

3.1 Allowing Partial Matches


For a match to be found with plain erosion, having applied a translation, the
whole foreground area of the query needs to be covered by the database fore-
ground. In pointset representation this means that there needs to be a corre-
sponding database note for each note in the query. To allow missing notes, a
coverage of only some specified portion of the query foreground suffices. This is
achieved by replacing erosion by a more general filter. For such generalization,
Bloomberg and Maragos propose a binary rank order filter (ROF) and threshold
convolution (TC). In addition to them, one of the algorithms in [8] was based
on correlation. These three methods are connected to each other, as discussed
next.
For every possible translation f , the binary rank order filter counts the ratio
|(P + f ) ∩ T |
,
|P |
where |P | is the number of foreground pixels in the query. If the ratio is greater
than or equal to a specified threshold value, it leaves a mark in the resulting
image, representing a match. This ratio can be seen as a confidence score (i.e.,
a probability) that the query foreground occurs in the database foreground at
some point. It is noteworthy that plain erosion is a special case of binary ROF,
where the threshold ratio is set to 1. By lowering the threshold we impose looser
conditions than plain erosion on detecting the query.
Correlation and convolution operate on greyscale images. Although we deal
with binary images, these operations are useful because ROF can be implemented
Morphologic Music Retrieval 327

using correlation and thresholding. When using convolution as a pattern match-


ing operation, it can be seen as a way to implement correlation: rotating the
query pattern 180 degrees and then performing convolution on real-valued data
has almost the same effect as performing correlation, the only difference being
that the resulting marks appear in the top-left corner instead of the bottom-
right corner of the match region. Both can be effectively implemented using the
Fast Fourier Transform (FFT). Because of this relation between correlation and
convolution, ROF and TC are actually theoretically equivalent.
When solving P2, one may want to search for maximal partial matches (in-
stead of threshold matches). This is straightforwardly achieved by implementing
ROF by using correlation. Although our ROF implementation is based on cor-
relation, we will call the method ROF since it offers all the needed functionality.

3.2 Tolerating Jittering


Let us next explain the main asset of our algorithms as compared to the previous
algorithms. In order to tolerate jittering, the algorithms should be able to find
corresponding database elements not only in the exact positions of the translated
query elements, but also in their near proximity.
Bloomberg and Vincent [3] introduced a technique for adding such toleration
into HMT. They call it blur hit-miss transform (BHMT). The trick is to di-
late the database images (both the original and the complement) by a smaller,
disc-shaped structuring element before the erosions are performed. This can be
written

BHMT(A, B1 , B2 , R1 , R2 )
= [(A ⊕ R1 )  B1 ] ∩ [(AC ⊕ R2 )  B2 ], (4)

where A is the database and AC its complement, B1 and B2 are the query
foreground and background and R1 and R2 are the blur SEs. The technique is
also eligible for the plain erosion. We choose this method for jitter toleration
and call it blur erosion:

A b (B, R) = (A ⊕ R)  B. (5)

The shape of the preprocessive dilation SE does not have to be a disc. In our
case, where the dimensions under consideration are time and pitch, a natural
setting comprises user-specified thresholds for the dimensions. This leads us to
rectangular SEs with efficient implementations. In practice, dilation operations
are useful in the time dimension, but applying it in pitch dimension often results
in false (positive) matches. Instead, a blur of just one semitone is very useful
because the queries often contain pitch quantization errors.

3.3 Combining the Two


By applying ROF we allow missing notes, thus being able to solve problem P2.
The jitter toleration is achieved by using blurring, thus solving AP1. In order to
328 M. Karvonen et al.

Erosion

Hit-miss transform
Rank order filter Blur erosion

Hit-miss Blur
rank order filter hit-miss transform
Blur
rank order filter

Blur hit-miss
rank order filter

Fig. 2. Generalizations of erosion

be able to solve AP2, we combine these two. In order to correctly solve AP2, the
dilation has to be applied to the database image. With blurred ROF a speed-up
can be obtained (with the cost of false positive matches) by dilating the query
pattern instead of the database image. If there is no need to adjust the dilation
SE, the blur can be applied to the database in a preprocessing phase. Note also
that if both the query and the database were dilated, it would grow the distance
between the query elements and the corresponding database elements which,
subsequently, would gradually decrease the overlapping area.
Figure 2 illustrates the relations between the discussed methods. Our interest
is on blur erosion and blur ROF (underlined in the Figure), because they can be
used to solve the approximate problems AP1 and AP2.

3.4 Line Segment Representation


Blur ROF is applicable also for solving AP3. In this case, however, the blur is
not as essential: in a case of an approximate occurrence, even if there was some
jittering in the time dimension, a crucial portion of the line segments would
typically still overlap. Indeed, ROF without any blur solves exactly problem P3.
By using blur erosion without ROF on line segment data, we get an algorithm
that does not have an existing counterpart. Plain erosion is like P3 with the
extra requirement of full matches only, the blur then adds error toleration to the
process.
Morphologic Music Retrieval 329

3.5 Applying Hit-Miss Transform


Thus far we have not been interested in what happens in the background of an
occurrence; we have just searched for occurrences of the query pattern that are
intermingled in the polyphonic texture of the database. If, however, no extra
notes were allowed in the time span of an occurrence, we would have to consider
the background as well. This is where we need the hit-miss transform. Naturally,
HMT is applicable also in decreasing the number of false positives in cases where
the query is assumed to be comprehensive. With HMT as the third way of
generalizing erosion, we complement the classification in Figure 2.
Combining HMT with the blur operation, we have to slightly modify the
original BHMT to meet the special requirements of the domain. In the original
form, when matching the foreground, tiny background dots are ignored because
they are considered to be noise. In our case, with notes represented as single
pixels or thin line segments, all the events would be ignored during the back-
ground matching; the background would always match. To achieve the desired
effect, instead of dilating the complemented database image, we need to erode
the complemented query image by the same SE:

BHMT*(A, B1 , B2 , R1 , R2 )
= [(A ⊕ R1 )  B1 ] ∩ [AC  (B2  R2 )], (6)

where A is the database and AC its complement. B1 and B2 are the query
foreground and background, R1 and R2 are the blur SEs. If B2 is the complement
of B1 , we can write B = B1 and use the form

BHMT*(A, B, R1 , R2 )
= [(A ⊕ R1 )  B1 ] ∩ [AC  (B ⊕ R2 )C ]. (7)

Another example where background matching would be needed is with line


segment representation of long notes. In an extreme case, a tone cluster with
a long duration forms a rectangle that can be matched with anything. Even
long sounding chords can result in many false positives. This problem can be
alleviated by using HMT with a tiny local background next to the ends of the
line segments to separate the notes.

4 Experiments
The algorithms presented in this paper set new standards for finding approxima-
tive occurrences of a query pattern from a given database. There are no rivaling
algorithms in this sense, so we are not able to fairly compare the performance
of our algorithms to any existing algorithm. However, to give the reader a sense
of the real-life performance of these approximative algorithms, we compare the
running times of these to the existing, nonapproximative algorithms. Essentially
this means that we are comparing the performance of the algorithms being able
to solve AP1-AP3 to the ones that can solve only P1, P2 and P3 [14].
330 M. Karvonen et al.

6000 70000
Pointset
Line segment
5000 60000

50000
4000
Time (ms)

Time (ms)
40000
3000
30000
2000
20000

1000 10000

0 0
4 8 12 16 20 24 28 32 36 4 8 12 16 20 24 28 32 36
Time resolution (pixel columns per second) Time resolution (pixel columns per second)

Fig. 3. The effect of changing the time resolution a) on blur erosion (on the left) and
b) on blur correlation (on the right)

In this paper we have sketched eight algorithms based on mathematical mor-


phology. In our experiments we will focus on two of them: blur erosion and blur
ROF, which can be applied to solve problems AP1-AP3. The special cases of
these algorithms, where blur is not applied, are plain erosion and ROF. As our
implementation of blur ROF is based on correlation, we will call it bluf correla-
tion from now on. As there are no competitors that solve the problems AP1-3,
we set our new algorithms against the original geometric algorithms named after
the problem specifications P 1, P 2 and P 3 [14] to get an idea of their practical
performance.
For the performance of our algorithms, the implementation of dilation, erosion
and correlation are crucial. For dilation and erosion, we rely on Bloomberg’s
Leptonica library. Leptonica offers an optimized implementation for rectangle-
shaped SEs, which we can utilize on the case of dilation. On the other hand,
our erosion SEs tend to be much more complex and larger in size (we erode the
databases with the whole query patterns). For correlation we use Fast Fourier
Transform implemented in the FFTW library. This operation is quite heavy
calculation-wise compared to erosion, since it has to operate with floating point
complex numbers.
The performance of the reference algorithms, used to solve the original, non-
approximative problems, depend mostly on the number of notes in the database
and in the query. We experiment on how the algorithms scale up as a function
of the database and query pattern lengths. It is also noteworthy that the note
density can make a significant difference in the performance, as the time con-
sumption of our algorithms mostly grow along with the size of the corresponding
images.
The database we used in our experiments consists of MIDI files from the Mu-
topia collection 1 that contains over 1.4 million notes. These files were converted
to various other formats required by the algorithms, such as binary images of
1
http://www.mutopiaproject.org/
Morphologic Music Retrieval 331

100000 1e+06

100000
10000
10000

1000
Time (ms)

Time (ms)
1000

100
100

10
10
1

1 0.1
8 16 32 64 128 256 512 1024 8 16 32 64 128 256 512
Pattern size (notes) Database size (thousands of notes)
P1 Blur Corr.
P2 MSM
Blur Er.

Fig. 4. Execution time on pointset data plotted on a logarithmic scale

pointset and line segment types. On the experiments of the effects of varying
pattern sizes, we selected randomly 16 pieces out of the whole database, each
containing 16,000 notes. Five distinct queries were randomly chosen, and the me-
dian of their execution times was reported. When experimenting with varying
database sizes, we chose a pattern size of 128 notes.
The size of the images is also a major concern for the performance of our
algorithms. We represent the pitch dimension as 128 pixels, since the MIDI
pitch value range consists of 128 possible values. The time dimension, however,
poses additional problems: it is not intuitively clear what would make a good
time resolution. If we use too many pixel columns per second, the performances
of our algorithms will be slowed down significantly. On the flip side of the coin,
not using enough pixels per second would result in a loss of information as we
would not be able to distinguish separate notes in rapid passages anymore. Before
running the actual experiments, we decided to experiment on finding a suitable
time resolution efficiency-wise.

4.1 Adjusting the Time Resolution

We tested the effect of increasing time resolution on both blur erosion and blur
correlation, and the results can be seen in Figure 3. With blur erosion, we can
see a clear difference between the pointset representation and the line segment
representation: in the line segment case, the running time of blur erosion seems
to grow quadratically in relation to the growing time resolution, while in the
pointset case, the growth rate seems to be clearly slower. This can be explained
with the fact that the execution time of erosion depends on the size of the query
foreground. In the pointset case, we still only mark the beginning point of the
notes, so only SEs require extra space. In the line segment case, however, the
growth is clearly linear.
332 M. Karvonen et al.

In the case of blur correlation, there seems to be nearly no effect whether the
input is in pointset or line segment form. The pointset and line segment curves
in the figure of blur correlation collide, so we depicted only one of the curves in
this case.
Looking at the results, one can note that the time usage of blur erosion begins
to grow quickly around 12 and 16 pixel columns per second. Considering that we
do not want the performance of our algorithms to suffer too much, and the fact
that we are deliberately getting rid of some nuances of information by blurring,
we were encouraged to set the time resolution as low as 12 pixel columns per
second. This time resolution was used in the further experiments.

4.2 Pointset Representation

Both P1 and P2 are problems, where we aim to find an occurrence of a query


pattern from a database, where both the database and the query are represented
as pointsets. Our new algorithms add the support for approximation. Blur ero-
sion solves problem AP1, finding exact approximative matches, whereas blur
correlation finds also partial matches, thus solving AP2.
We compared efficiency of the non-approximative algorithms to our new, ap-
proximative algorithms with varying query and database sizes. As an additional
comparison point, we also included Clifford et al.’s approximation algorithm [5],
called the maximal subset matching (MSM) algorithm, in the comparison. MSM
is based on FFT and its execution time does not depend on the query size.
Analyzing the results seen in Figure 4, we note that the exact matching al-
gorithm P 1 is the fastest algorithm in all settings. This was to be expected due
to the linear behaviour of P 1 in the length of the database. P 1 also clearly out-
performs its approximative counterpart, blur erosion. On the other hand, the

1e+06 100000

100000
10000
Time (ms)

Time (ms)

10000

1000
1000

100 100
8 16 32 64 128 256 512 1024 8 16 32 64 128 256 512
Pattern size (notes) Database size (thousands of notes)
P3 Blur Corr.
Blur Er.

Fig. 5. Execution time on line segment data plotted on a logarithmic scale


Morphologic Music Retrieval 333

performance difference between the partial matching algorithms, blur correla-


tion, MSM and P 2, is less radical. P 2 is clearly fastest of those with small query
sizes, but as its time consumption grows with longer queries, it becomes the
slowest with very large query sizes. Nevertheless, even with small query sizes,
we believe that the enhanced error toleration is worth the extra time it requires.

4.3 Line Segment Representation

When experimenting with the line segment representations, we used P 3 as a


reference algorithm for blur correlation. For blur erosion, we were not able to
find a suitable reference algorithm. Once again, however, blur erosion gives the
reader some general sense of the efficiency of the algorithms working with line
segment representation.
The time consumption behaviour of P 3, blur erosion and blur correlation are
depicted in Figure 5. The slight meandering seen in some of the curves is the
result of an uneven distribution of notes in the database. Analyzing the graphs
further, we notice that our blur correlation seems more competitive in this than
in the pointset representation case. Again, we note that the independence of the
length of the pattern makes blur correlation faster than P3 with larger pattern
sizes: blur correlation outperforms P 3 once the pattern size exceeds 256 notes.
Analyzing the results of experiments with differing database sizes with a query
pattern size of 128, the more restrictive blur erosion algorithm was fastest of the
three. However, the three algorithms’ time consumptions were roughly of the
same magnitude.

Fig. 6. (a) An excerpt of the database in a piano-roll representation with jittering


window around each note. (b) Query pattern. (c) The query pattern inserted into the
jittering windows of the excerpt of the database.
334 M. Karvonen et al.

First (approximate) match


                  
   
   
 Pattern 
     

           

Fig. 7. The subject used as a search pattern and the first approximate match

(a)
        
 
          

      
28

      
 (b)

                 
        
   
(c)

Fig. 8. A match found by blur erosion (a). An exact match found by both P1 and blur
erosion (b). This entry has too much variation even for blur erosion (c).

Our experiments also confirmed our claim of blur correlation handling jittering
better than P 3. We were expecting this, and Figure 6 illustrates an idiomatic
case where P 3 will not find a match, but blur correlation will. In this case we
have a query pattern that is an excerpt of the database, with the distinction that
some of the notes have been tilted either time-wise or pitch-wise. Additionally
one note has been split into two. Blur correlation finds a perfect match in this
case, whereas P 3 cannot, unless the threshold for the total common length is
exceeded. We were expecting this kind of results, since intuitively P 3 cannot
handle this kind of jittering as well as morphological algorithms do.

4.4 Finding Fugue Theme Entries

To further demonstrate the assets of the blur technique, we compared P 1 and


blur erosion in the task of finding the theme entries in J. S. Bach’s Fugue no. 16
in G minor, BWV861, from the Well-Tempered-Clavier, Book 1. The imitations
of a fugue theme often have slight variation. The theme in our case is shown in
Figure 7. In the following imitation, there is a little difference. The first interval
is a minor third instead of a minor second. This prevents P 1 finding a match
here. But with a vertical dilation of two pixels blur erosion managed to find the
match.
Figure 8 shows three entries of the theme. The first one has some variation
at the end and was only found by blur erosion. The second is an exact match.
Finally, the last entry could not be found by either of the algorithms, because
it has too much variation. In total, blur erosion found 16 occurrences, while P 1
found only six.
If all the entries had differed only in one or two notes, it would have been
easy to find them using P 2. For some of the imitations, however, less than
Morphologic Music Retrieval 335

6/11
13
 
             
     
       


    

                  
        
6/11 
 
16

             
                 
     

        
4/11


                  
       
7/11
19
                     
                
 
 

               
         

11/11

Fig. 9. Some more developed imitations of the theme with their proportions of exactly
matching notes

half of the notes match exactly the original form of the theme (see Figure 9).
Nevertheless, these imitations are fairly easily recognized visually and audibly.
Our blur-erosion algorithm found them all.

5 Conclusions
In this paper, we have combined existing image processing methods based on
mathematical morphology to construct a collection of new pattern matching
algorithms for symbolic music represented as binary images. Our aim was to gain
an improved error tolerance over the existing pointset-based and line-segment-
based algorithms introduced for related problems.
Our algorithms solve three existing music retrieval problems, P1, P2 and P3.
Our basic algorithm based on erosion solves the exact matching problem P1. To
successfully solve the other two, we needed to relax the requirement of exact
matches, which we did by applying a rank order filtering technique. Using this
relaxation technique, we can solve both the partial pointset matching problem
P2, and also the line segment matching problem P3. By introducing blurring in
the form of preprocessive dilation, the error tolerance of these morphological al-
gorithms can be improved. That way the algorithms are able to tolerate jittering
in both the time and the pitch dimension.
Comparing to the solutions of the non-approximative problems, our new
algorithms tend to be somewhat slower. However, they are still comparable
performance-wise, and actually even faster in some cases. As the most impor-
tant novelty of our algorithms is the added error tolerance given by blurring, we
336 M. Karvonen et al.

think that the slowdown is rather restrained compared to the added usability
of the algorithms. We expect our error-tolerant methods to give better results
in real-world applications when compared to the rivaling algorithms. As future
work, we plan on researching and setting up a relevant ground truth, as without
a ground truth, we cannot adequately measure the precision and recall on the
algorithms. Other future work could include investigating the use of greyscale
morphology for introducing more fine-grained control over approximation.

Acknowledgements

This work was partially supported by the Academy of Finland, grants #108547,
#118653, #129909 and #218156.

References

1. Barrera Hernández, A.: Finding an o(n2 log n) algorithm is sometimes hard. In:
Proceedings of the 8th Canadian Conference on Computational Geometry, pp.
289–294. Carleton University Press, Ottawa (1996)
2. Bloomberg, D., Maragos, P.: Generalized hit-miss operators with applications to
document image analysis. In: SPIE Conference on Image Algebra and Morpholog-
ical Image Processing, pp. 116–128 (1990)
3. Bloomberg, D., Vincent, L.: Pattern matching using the blur hit-miss transform.
Journal of Electronic Imaging 9(2), 140–150 (2000)
4. Clausen, M., Engelbrecht, R., Meyer, D., Schmitz, J.: Proms: A web-based tool for
searching in polyphonic music. In: Proceedings of the International Symposium on
Music Information Retrieval (ISMIR 2000), Plymouth, MA (October 2000)
5. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A
fast, randomised, maximal subset matching algorithm for document-level music
retrieval. In: Proceedings of the 7th International Conference on Music Informa-
tion Retrieval (ISMIR 2006), Victoria, BC, Canada, pp. 150–155 (2006)
6. Heijmans, H.: Mathematical morphology: A modern approach in image processing
based on algebra and geometry. SIAM Review 37(1), 1–36 (1995)
7. Hu, N., Dannenberg, R., Tzanetakis, G.: Polyphonic audio matching and alignment
for music retrieval. In: Proc. IEEE WASPAA, pp. 185–188 (2003)
8. Karvonen, M., Lemström, K.: Using mathematical morphology for geometric music
information retrieval. In: International Workshop on Machine Learning and Music
(MML 2008), Helsinki, Finland (2008)
9. Lemström, K.: Towards more robust geometric content-based music retrieval. In:
Proceedings of the 11th International Society for Music Information Retrieval Con-
ference (ISMIR 2010), Utrecht, pp. 577–582 (2010)
10. Lemström, K., Mikkilä, N., Mäkinen, V.: Filtering methods for content-based
retrieval on indexed symbolic music databases. Journal of Information Re-
trieval 13(1), 1–21 (2010)
11. Lubiw, A., Tanur, L.: Pattern matching in polyphonic music as a weighted geo-
metric translation problem. In: Proceedings of the 5th International Conference on
Music Information Retrieval (ISMIR 2004), Barcelona, pp. 289–296 (2004)
Morphologic Music Retrieval 337

12. Romming, C., Selfridge-Field, E.: Algorithms for polyphonic music retrieval: The
hausdorff metric and geometric hashing. In: Proceedings of the 8th International
Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria (2007)
13. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Utrecht
University, Netherlands (2007)
14. Ukkonen, E., Lemström, K., Mäkinen, V.: Geometric algorithms for transposition
invariant content-based music retrieval. In: Proceedings of the 4th International
Conference on Music Information Retrieval (ISMIR 2003), Baltimore, MA, pp.
193–199 (2003)
15. Wiggins, G.A., Lemström, K., Meredith, D.: SIA(M)ESE: An algorithm for trans-
position invariant, polyphonic content-based music retrieval. In: Proceedings of the
3rd International Conference on Music Information Retrieval (ISMIR 2002), Paris,
France, pp. 283–284 (2002)
Melodic Similarity through Shape Similarity

Julián Urbano, Juan Lloréns, Jorge Morato, and Sonia Sánchez-Cuadrado

University Carlos III of Madrid


Department of Computer Science
Avda. Universidad, 30
28911 Leganés, Madrid, Spain
{jurbano,llorens}@inf.uc3m.es,
{jorge,ssanchec}@ie.inf.uc3m.es

Abstract. We present a new geometric model to compute the melodic similarity


of symbolic musical pieces. Melodies are represented as splines in the pitch-
time plane, and their similarity is computed as the similarity of their shape. The
model is very intuitive and it is transposition and time scale invariant. We have
implemented it with a local alignment algorithm over sequences of n-grams that
define spline spans. An evaluation with the MIREX 2005 collections shows that
the model performs very well, obtaining the best effectiveness scores ever
reported for these collections. Three systems based on this new model were
evaluated in MIREX 2010, and the three systems obtained the best results.

Keywords: Music information retrieval, melodic similarity, interpolation.

1 Introduction
The problem of Symbolic Melodic Similarity, where musical pieces similar to a query
should be retrieved, has been approached from very different points of view [24][6].
Some techniques are based on string representations of music and editing distance
algorithms to measure the similarity between two pieces[17]. Later work has extended
this approach with other dynamic programming algorithms to compute global- or
local-alignments between the two musical pieces [19][11][12]. Other methods rely on
music representations based on n-grams [25][8][2], and other methods represent
music pieces as geometric objects, using different techniques to calculate the melodic
similarity based on the geometric similarity of the two objects. Some of these
geometric methods represent music pieces as sets of points in the pitch-time plane,
and then compute geometric similarities between these sets [26][23][7]. Others
represent music pieces as orthogonal polynomial chains crossing the set of pitch-time
points, and then measure the similarity as the minimum area between the two chains
[30][1][15].
In this paper we present a new model to compare melodic pieces. We adapted the
local alignment approach to work with n-grams instead of with single notes, and the
corresponding substitution score function between n-grams was also adapted to take
into consideration a new geometric representation of musical sequences. In this
geometric representation, we model music pieces as curves in the pitch-time plane,
and compare them in terms of their shape similarity.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 338–355, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Melodic Similarity through Shape Similarity 339

In the next section we outline several problems that a symbolic music retrieval
system should address, and then we discuss the general solutions given in the
literature to these requirements. Next, we introduce our geometric representation
model, which compares two musical pieces by their shape, and see how this model
addresses the requirements discussed. In section 5 we describe how we have
implemented our model, and in section 6 we evaluate it with the training and
evaluation test collections used in the MIREX 2005 Symbolic Melodic Similarity task
(for short, we will refer to these collections as Train05 and Eval05) [10][21][28].
Finally, we finish with conclusions and lines for further research. An appendix reports
more evaluation results at the end.

2 Melodic Similarity Requirements


Due to the nature of the information treated in Symbolic Melodic Similarity[18], there
are some requirements that have to be considered from the very beginning when
devising a retrieval system. Byrd and Crawford identified some requirements that
they consider every MIR system should meet, such as the need of cross-voice
matching, polyphonic queries or the clear necessity of taking into account both the
horizontal and vertical dimensions of music[5].Selfridge-Field identified three
elements that may confound both the users when they specify the query and the actual
retrieval systems at the time of computing the similarity between two music pieces:
rests, repeated notes and grace notes [18]. In terms of cross-voice and polyphonic
material, she found five types of melody considered difficult to handle: compound,
self-accompanying, submerged, roving and distributed melodies. Mongeau and
Sankoff addressed repeated notes and refer to these situations as fragmentation and
consolidation [17].
We list here some more general requirements that should be common to any
Symbolic Melodic Similarity system, as we consider them basic for the general user
needs. These requirements are divided in two categories: vertical (i.e. pitch) and
horizontal (i.e. time).

2.1 Vertical Requirements

Vertical requirements regard the pitch dimension of music: octave equivalence,


degree equality, note equality, pitch variation, harmonic similarity and voice
separation. A retrieval model that meets the first three requirements is usually
regarded as transposition invariant.

2.1.1 Octave Equivalence


When two pieces differ only in the octave they are written in, they should be
considered the same one in terms of melodic similarity. Such a case is shown in
Fig. 1, with simple versions of the main riff in Layla, by Dereck and the Dominos.
It has been pointed out that faculty or music students may want to retrieve pieces
within some certain pitch range such as C5 up to F#3, or every work above A5
[13].However, this type of information need should be easily handled with
340 J. Urbano et al.

Fig. 1. Octave equivalence

metadata or a simple traverrse through the sequence. We argue that users without such
a strong musical backgroun nd will be interested in the recognition of a certain piitch
contour, and such cases are a much more troublesome because some measuree of
melodic similarity has to be calculated. This is the case of query by humm ming
applications.

2.1.2 Degree Equality


g. 1 shows a melody in the F major tonality, as well as the
The score at the top of Fig
corresponding pitch and toonality-degree for each note. Below, Fig. 2 shows exacctly
the same melody shifted 7 semitones
s downwards to the Bb major tonality.

Fig. 2. Degree equality

The tonality-degrees useed in both cases are the same, but the resultant notes are
not. Nonetheless, one would consider the second melody a version of the first oone,
because they are the sam me in terms of pitch contour. Therefore, they should be
considered the same one by
b a retrieval system, which should also consider possiible
modulations where the key changes somewhere throughout the song.

2.1.3 Note Equality


We could also consider thee case where exactly the same melodies, with exactly the
same notes, are written in different
d tonalities and, therefore, each note corresponds to
a different tonality-degree in each case. Fig. 3 shows such a case, with the saame
melody as in Fig. 1, but in the
t C major tonality.
Although the degrees do o not correspond one to each other, the actual notes do, so
both pieces should be consiidered the same one in terms of melodic similarity.
Melodic Similarity through Shape Similarity 341

Fig. 3. Note equality

2.1.4 Pitch Variation


Sometimes, a melody is alttered by changing only the pitch of a few particular nootes.
For instance, the first melo
ody in Fig. 1 might be changed by shifting the 12th nnote
from D7 to A6 (which actu ually happens in the original song). Such a change shoould
not make a retrieval systemm disregard that result, but simply rank it lower, after the
exactly-equal ones. Thus, the retrieval process should not consider only exxact
matching, where the query y is part of a piece in the repository (or the other w way
around). Approximate mattching, where documents can be considered similar tto a
query to some degree, sho ould be the way to go. This is of particular interest for
scenarios like query by huumming, where it is expected to have slight variationss in
pitch in the melody hummeed by the user.

2.1.5 Harmonic Similaritty


Another desired feature wo ould be to match harmonic pieces, both with harmonic and
melodic counterparts. For instance,
i in a triad chord (made up by the root note andd its
major third and perfect fifthh intervals), one might recognize only two notes (typicaally
the root and the perfect fiftth). However, some other might recognize the root and the
major third, or just the roott, or even consider them as part of a 4-note chord such aas a
major seventh chord (whicch adds a major seventh interval). Fig. 4 shows the saame
piece as in the top of Fig. 1, but with some intervals added to make the song m more
harmonic. These two piecess have basically the same pitch progression, but with soome
ornamentation, and they sho ould be regarded as very similar by a retrieval system.

Fig. 4. Harmonic similarity

Thus, a system should d be able to compare harmony wholly and partiaally,


considering again the Pitcch Variation problem as a basis to establish differennces
between songs.

2.1.6 Voice Separation


Fig. 5 below depicts a pianoo piece with 3 voices, which work together as a whole, but
could also be treated individ
dually.
342 J. Urbano et al.

Fig. 5. Voice separation

Indeed, if this piece werre played with a flute only one voice could be performmed,
even if some streaming efffect were produced by changing tempo and timbre for ttwo
voices to be perceived by a listener [16]. Therefore, a query containing only one vooice
should match with this piecce in case that voice is similar enough to any of the thhree
marked in the figure.

2.2 Horizontal Requirem


ments

Horizontal requirements regard


r the time dimension of music: time signatture
equivalence, tempo equivallence, duration equality and duration variation. A retrieeval
model that meets the secon
nd and third requirements is usually regarded as time sccale
invariant.

2.2.1 Time Signature Equ uivalence


The top of Fig. 6 depicts a simplified version of the beginning of op. 81 no. 10 byy S.
4 time signature. If a 4/4 time signature were used, likee in
Heller, with its original 2/4
the bottom of Fig. 6, the pieece would be split into bars of duration 4 crotchets each..

Fig. 6. Time signature equivalence

The only difference betw


ween these two pieces is actually how intense some nootes
should be played. Howeverr, they are in essence the same piece, and no regular listeener
would tell the difference. Therefore, we believe the time signature should nott be
considered when comparing g musical performances in terms of melodic similarity.

2.2.2 Tempo Equivalencee


For most people, the piecee at the top of Fig. 6, with a tempo of 112 crotchets per
he one in Fig. 7, where notes have twice the length but the
minute, would sound like th
Melodic Similarity through Shape Similarity 343

Fig. 7. Tempo equivalence

whole score is played twicce as fast, at 224 crotchets per minute. This two channges
result in exactly the same acctual time.
On the other hand, it might also be considered a tempo of 56 crotchets per minnute
and notes with half the durration. Moreover, the tempo can change somewhere in the
middle of the melody, and therefore change the actual time of each note afterwarrds.
Therefore, actual note lenggths cannot be considered as the only horizontal measuure,
because these three pieces would
w sound the same to any listener.

2.2.3 Duration Equality


If the melody at the top of Fig. 6 were played slower or quicker by means of a tem
mpo
variation, but maintaining the rhythm, an example of the result would be like the
score in Fig. 8.

Fig. 8. Duration equality

Even though the melodiic perception does actually change, the rhythm does nnot,
and neither does the pitch contour. Therefore, they should be considered as virtuaally
the same, maybe with somee degree of dissimilarity based on the tempo variation.

2.2.4 Duration Variation n


As with the Pitch Variation n problem, sometimes a melody is altered by changing oonly
the rhythm of a few notes. For
F instance, the melody in Fig. 9 maintains the same piitch
contour as in Fig. 6, but chaanges the duration of some notes.

Fig. 9. Duration variation

Variations like these arre common and they should be considered as well, jjust
like the Pitch Variation pro
oblem, allowing approximate matches instead of just exxact
ones.
344 J. Urbano et al.

3 General Solutions to the Requirements


Most of these problems have been already addressed in the literature. Next, we
describe and evaluate the most used and accepted solutions.

3.1 Vertical Requirements

The immediate solution to the Octave Equivalence problem is to consider octave


numbers with their relative variation within the piece. Surely, a progression from G5
to C6 is not the same as a progression from G5 to C5. For the Degree Equality
problem it seems to be clear that tonality degrees must be used, rather than actual
pitch values, in order to compare two melodies. However, the Note Equality problem
suggests the opposite.
The accepted solution for these three vertical problems seems to be the use of
relative pitch differences as the units for the comparison, instead of the actual pitch or
degree values. Some approaches consider pitch intervals between two successive
notes [11][8][15], between each note and the tonic (assuming the key is known and
failing to meet the Note Equality problem) [17], or a mixture of both [11]. Others
compute similarities without pitch intervals, but allowing vertical translations in the
time dimension [1][19][30]. The Voice Separation problem is usually assumed to be
solved in a previous stage, as the input to these systems uses to be a single melodic
sequence. There are approximations to solve this problem [25][14].

3.2 Horizontal Requirements

Although the time signature of a performance is worth for other purposes such as
pattern search or score alignment, it seems to us that it should not be considered at all
when comparing two pieces melodically.
According to the Tempo Equivalence problem, actual time should be considered
rather than score time, since it would be probably easier for a regular user to provide
actual rhythm information. On the other hand, the Duration Equality problem requires
the score time to be used instead. Thus, it seems that both measures have to be taken
into account. The actual time is valuable for most users without a musical
background, while the score time might be more valuable for people who do have it.
However, when facing the Duration Variation problem it seems necessary to use
some sort of timeless model. The solution could be to compare both actual and score
time [11], or to use relative differences between notes, in this case with the ratio
between two notes’ durations [8]. Other approaches use a rhythmical framework to
represent note durations as multiples of a base score duration [2][19][23], which does
not meet the Tempo Equivalence problem and hence is not time scale invariant.

4 A Model Based on Interpolation


We developed a new geometric model that represents musical pieces with curves in
the pitch-time plane, extending the model with orthogonal polynomial chains
[30][1][15]. Notes are represented as points in the pitch-time plane, with positions
Melodic Similarity through Shape Similarity 345

relative to their pitch and duration


d differences. Then, we define the curve C(t) as the
interpolating curve passing g through each point (see Fig. 10). Should the song hhave
multiple voices, each one would
w be placed in a different pitch-time plane, sharing the
same time dimension, but with
w a different curve Ci(t) (where the subscript i indicaates
the voice number). Note thaat we thus assume the voices are already separated.
With this representation, the similarity between two songs could be thought off as
the similarity in shape betw
ween the two curves they define. Every vertical requirem ment
identified in section 2.1wou uld be met with this representation: a song with an octtave
shift would keep the same shape;
s if the tonality changed the shape of the curve woould
not be affected either; and if the notes remained the same after a tonality change, so
would the curve do. The Pitch Variation problem can be addressed analyticaally
measuring the curvature diffference, and different voices can be compared individuaally
in the same way because theey are in different planes.

Fig. 10. Mellody represented as a curve in a pitch-time plane

Same thing happens with h the horizontal requirements: the Tempo Equivalence and
Duration Equality problem ms can be solved analytically, because they imply jusst a
linear transformation in thee time dimension. For example, if the melody at the topp of
Fig. 6 is defined with curve C(t) and the one in Fig. 7 is denoted with curve D(tt), it
can be easily proved that C(2t)=D(t). Moreover, the Duration Variation probllem
could be addressed analy ytically as the Pitch Variation problem, and the Tiime
Signature Equivalence pro oblem is not an issue because the shape of the curvee is
independent of the time sign nature.

4.1 Measuring Dissimilarrity with the Change in Shape

Having musical pieces represented with curves, each one of them could be defiined
with a polynomial of the formf C(t)=antn+an-1tn-1+…+a1t+a0. The first derivativee of
this polynomial measures howh much the shape of the curve is changing at a particuular
point in time (i.e. how the song changes). To measure the change of one curve w with
respect to another, the area between the first derivatives could be used.
Note that a shift in pitch h would mean just a shift in the a0 term. As it turns oout,
when calculating the first derivative
d of the curves this term is canceled, which is w
why
the vertical requirements arre met: shifts in pitch are not reflected in the shape of the
curve, so they are not reflected
r in the first derivative either. Therefore, this
representation is transpositiion invariant.
The song is actually defiined by the first derivative of its interpolating curve, C’(t).
The dissimilarity between twot songs, say C(t) and D(t), would be defined as the aarea
346 J. Urbano et al.

between their first derivatives (measured with the integral over the absolute value of
their difference):

diff(C, D) = |C'(t)-D'(t)|dt (1)

The representation with orthogonal polynomial chains also led to the measurement
of dissimilarity as the area between the curves [30][1]. However, such representation
is not directly transposition invariant unless it used pitch intervals instead of absolute
pitch values, and a more complex algorithm is needed to overcome this problem[15].
As orthogonal chains are not differentiable, this would be the indirect equivalent to
calculating the first derivative as we do.
This dissimilarity measurement based on the area between curves turns out to be a
metric function, because it has the following properties:
• Non-negativity, diff(C, D) ≥ 0: because the absolute value is never negative.
• Identity of indiscernibles, diff(C, D) = 0 ⇔ C = D: because calculating the
absolute value the only way to have no difference is with the same exact curve1.
• Symmetry, diff(C, D) = diff(D, C): again, because the integral is over the
absolute value of the difference.
• Triangle inequality, diff(C, E) ≤ diff(C, D) + diff(D, E):

|C'(t) - E'(t)|dt ≤ |C'(t) - D'(t)|dt + |D'(t) - E'(t)|dt

|C'(t) - D'(t)|dt + |D'(t) - E'(t)|dt = |C'(t) - D'(t)| +|D'(t) - E'(t)|dt

|C'(t) - D'(t)| +|D'(t) - E'(t)|dt ≥ |C'(t) - E'(t)| dt

Therefore, many indexing and retrieval techniques, like vantage objects[4], could be
exploited if using this metric.

4.2 Interpolation with Splines

The next issue to address is the interpolation method to use. The standard Lagrange
interpolation method, though simple, is known to suffer the Runge’s Phenomenon [3].
As the number of points increases, the interpolating curve wiggles a lot, especially at
the beginning and the end of the curve. As such, one curve would be very different
from another one having just one more point at the end, the shape would be different
and so the dissimilarity metric would result in a difference when the two curves are
practically identical. Moreover, a very small difference in one of the points could
translate into an extreme variation in the overall curve, which would make virtually
impossible to handle the Pitch and Duration Variation problems properly (see top of
Fig. 11).

1
Actually, this means that the first derivatives are the same, the actual curves could still be
shifted. Nonetheless, this is the behavior we want.
Melodic Similarity through Shape Similarity 347

Fig. 11. Runge’s Phenomenon

A way around Runge's Phenomenon


P is the use of splines (see bottom of Fig. 111).
Besides, splines are also easy to calculate and they are defined as piece-w wise
functions, which comes in handy when addressing the horizontal requirements. We
saw above that the horizon
ntal problems could be solved, as they implied just a linnear
transformation of the formm D(t) ⇒ D(kt) in one of the curves. However, the
calculation of the term k is anything but straightforward, and the transformattion
would apply to the whole curve,
c complicating the measurement of differences for the
m. The solution would be to split the curve into spans, and
Duration Variation problem
define it as
ci,1 (t) ti,1 ≤ t ≤ ti,kn
ci,2 (t) ti,2 ≤ t ≤ ti,kn+1
Ci (t) = (2)
ci,mi-kn +1 (t) ti,mi-kn +1 ≤ t ≤ ti,mi
where ti,j denotes the onset time
t of the j-th note in the i-th voice, mi is the length off the
i-th voice, and kn is the spaan length. With this representation, linear transformatiions
would be applied only to a single span without affecting the whole curve. Moreovver,
the duration of the spans coould be normalized from 0 to 1, making it easy to calcuulate
the term k and comply with h the time scale invariance requirements.
Most spline interpolatio
on methods define the curve in parametric form (i.e. w with
one function per dimension n). In this case, it results in one function for the pitch and
one function for the time.. This means that the two musical dimensions couldd be
compared separately, giviing more weight to one or the other. Therefore, the
dissimilarity between two spans
s c(t) and d(t) would be the sum of the pitch and tiime
dissimilarities as measured by (1):
diff
ff(c, d) = kpdiffp(c, d) + ktdifft(c, d) (3)
where diffp and difft are fuunctions as in (1) that consider only the pitch and tiime
dimensions, respectively, and
a kp and kt are fine tuning constants. Different woorks
suggest that pitch is much more
m important than time for comparing melodic similarrity,
so more weight should be given
g to kp [19][5][8][23][11].
348 J. Urbano et al.

5 Implementation

Geometric representations of music pieces are very intuitive, but they are not
necessarily easy to implement. We could follow the approach of moving one curve
towards the other looking for the minimum area between them [1][15]. However, this
approach is very sensitive to small differences in the middle of a song, such as
repeated notes: if a single note were added or removed from a melody, it would be
impossible to fully match the original melody from that note to the end. Instead, we
follow a dynamic programming approach to find an alignment between the two
melodies [19].
Various approaches for melodic similarity have applied editing distance algorithms
upon textual representations of musical sequences that assign one character to each
interval or each n-gram [8]. This dissimilarity measure has been improved in recent
years, and sequence alignment algorithms have proved to perform better than simple
editing distance algorithms [11][12]. Next, we describe the representation and
alignment method we use.

5.1 Melody Representation

To practically apply our model, we followed a basic n-gram approach, where each n-
gram represents one span of the spline. The pitch of each note was represented as the
relative difference to the pitch of the first note in the n-gram, and the duration was
represented as the ratio to the duration of the whole n-gram. For example, an n-gram
of length 4 with absolute pitches 〈74, 81, 72, 76〉 and absolute durations 〈240, 480,
240, 720〉, would be modeled as 〈81-74, 72-74, 76-74〉 = 〈7, -2, 2〉 in terms of pitch
and 〈240, 480, 240, 720〉⁄1680 = 〈0.1429, 0.2857, 0.1429, 0.4286〉 in terms of
duration. Note that the first note is omitted in the pitch representation as it is always 0.
This representation is transposition invariant because a melody shifted in the pitch
dimension maintains the same relative pitch intervals. It is also time scale invariant
because the durations are expressed as their relative duration within the span, and so
they remain the same in the face of tempo and actual or score duration changes. This
is of particular interest for query by humming applications and unquantized pieces, as
small variations in duration would have negligible effects on the ratios.
We used Uniform B-Splines as interpolation method [3]. This results in a
parametric polynomial function for each n-gram. In particular, an n-gram of length kn
results in a polynomial of degree kn-1 for the pitch dimension and a polynomial of
degree kn-1 for the time dimension. Because the actual representation uses the first
derivatives, each polynomial is actually of degree kn-2.

5.2 Melody Alignment

We used the Smith-Waterman local alignment algorithm [20], with the two sequences
of overlapping spans as input, defined as in (2). Therefore, the input symbols to the
alignment algorithm are actually the parametric pitch and time functions of a span,
Melodic Similarity through Shape Similarity 349

based on the above representation of n-grams. The edit operations we define for the
Smith-Waterman algorithm are as follows:
• Insertion: s(-, c).Adding a span c is penalized with the score–diff(c, ɸ(c)).
• Deletion: s(c, -). Deleting a span c is penalized with the score–diff(c, ɸ(c)).
• Substitution: s(c, d). Substituting a span c with d is penalized with –diff(c, d).
• Match: s(c, c). Matching a span c is rewarded with the score 2(kpμp+ktμt).
where ɸ(•) returns the null n-gram of • (i.e. an n-gram equal to • but with all pitch
intervals set to 0), and μp and μt are the mean differences calculated by diffp and difft
respectively over a random sample of 100,000 pairs of n-grams sampled from the set
of incipits in the Train05 collection.
We also normalized the dissimilarity scores returned by difft. From the results in
Table 1 it can be seen that pitch dissimilarity scores are between 5 and 7 times larger
than time dissimilarity scores. Therefore, the choice of kp and kt does not intuitively
reflect the actual weight given to the pitch and time dimensions. For instance, the
selection of kt=0.25, chosen in studies like [11], would result in an actual weight
between 0.05 and 0.0357. To avoid this effect, we normalized every time dissimilarity
score multiplying it by a factor λ = μp / μt. As such, the score of the match operation
is actually defined as s(c, c) = 2μp(kp+kt), and the dissimilarity function defined in (3)
is actually calculated as diff(c, d) = kp diffp(c, d) + λktdifft(c, d).

6 Experimental Results2
We evaluated the model proposed with the Train05 and Eval05 test collections used
in the MIREX 2005 Symbolic Melodic Similarity Task [21][10], measuring the mean
Average Dynamic Recall score across queries [22]. Both collections consist of about
580 incipits and 11 queries each, with their corresponding ground truths. Each ground
truth is a list of all incipits similar to each query, according to a panel of experts, and
with groups of incipits considered equally similar to the query.
However, we have recently showed that these lists have inconsistencies whereby
incipits judged as equally similar by the experts are not in the same similarity group
and vice versa [28]. All these inconsistencies result in a very permissive evaluation
where a system could return incipits not similar to the query and still be rewarded for
it. Thus, results reported with these lists are actually overestimated, by as much as
12% in the case of the MIREX 2005 evaluation. We have proposed alternatives to
arrange the similarity groups for each query, proving that the new arrangements are
significantly more consistent than the original one, leading to a more robust
evaluation. The most consistent ground truth lists were those called Any-1 [28].
Therefore, we will use these Any-1 ground truth lists from this point on to evaluate
our model, as they offer more reliable results. Nonetheless, all results are reported in
an appendix as if using the original ground truths employed in MIREX 2005, called
All-2, for the sake of comparison with previous results.

2
All system outputs and ground truth lists used in this paper can be downloaded from
http://julian-urbano.info/publications/
350 J. Urbano et al.

To determine the value of the kn and kt parameters, we used a full factorial


experimental design. We tested our model with n-gram lengths in the range kn∈{3, 4,
5, 6, 7}, which result in Uniform B-Spline polynomials of degrees 1 to 5. The value of
kp was kept to 1, and kt was converted to nominal with levels kt∈{0, 0.1, 0.2, …, 1}.

6.1 Normalization Factor λ

First, we calculated the mean dissimilarity scores μp and μt for each n-gram length kn,
according to diffp and difft over a random sample of 100,000 pairs of n-grams. Table 1
lists the results. As mentioned, the pitch dissimilarity scores are between 5 and 7
times larger than the time dissimilarity scores, suggesting the use of the normalization
factor λ defined above.

Table 1. Mean and standard deviation of the diffp and difft functions applied upon a random
sample of 100,000 pairs of n-grams of different sizes

kn μp σp μt σt λ = μp / μt
3 2.8082 1.6406 0.5653 0.6074 4.9676
4 2.5019 1.6873 0.494 0.5417 5.0646
5 2.2901 1.4568 0.4325 0.458 5.2950
6 2.1347 1.4278 0.3799 0.3897 5.6191
7 2.0223 1.3303 0.2863 0.2908 7.0636

There also appears to be a negative correlation between the n-gram length and the
dissimilarity scores. This is caused by the degree of the polynomials defining the
splines: high-degree polynomials fit the points more smoothly than low-degree ones.
Polynomials of low degree tend to wiggle more, and so their derivatives are more
pronounced and lead to larger areas between curves.

6.2 Evaluation with the Train05 Test Collection, Any-1 Ground Truth Lists

The experimental design results in 55 trials for the 5 different levels of kn and the 11
different levels of kt. All these trials were performed with the Train05 test collection,
ground truths aggregated with the Any-1 function [28]. Table 2 shows the results.
In general, large n-grams tend to perform worse. This could probably be explained
by the fact that large n-grams define the splines with smoother functions, and the
differences in shape may be too small to discriminate musically perceptual
differences. However, kn=3 seems to be the exception (see Fig. 12). This is probably
caused by the extremely low degree of the derivative polynomials. N-grams of length
kn=3 result in splines defined with polynomials of degree 2, which are then
differentiated and result in polynomials of degree 1. That is, they are just straight
lines, and so a small difference in shape can turn into a relatively large dissimilarity
score when measuring the area.
Overall, kn=4 and kn=5 seem to perform the best, although kn=4 is more stable
across levels of kt. In fact, kn=4 and kt=0.6 obtain the best score, 0.7215. This result
agrees with other studies where n-grams of length 4 and 5 were also found to perform
Melodic Similarity through Shape Similarity 351

Table 2. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.

kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.6961 0.7067 0.7107 0.7106 0.7102 0.7109 0.7148 0.711 0.7089 0.7045 0.6962
4 0.7046 0.7126 0.7153 0.7147 0.7133 0.72 0.7215 0.7202 0.7128 0.7136 0.709
5 0.7093 0.7125 0.7191 0.72 0.7173 0.7108 0.704 0.6978 0.6963 0.6973 0.6866
6 0.714 0.7132 0.7115 0.7088 0.7008 0.693 0.6915 0.6874 0.682 0.6765 0.6763
7 0.6823 0.6867 0.6806 0.6747 0.6538 0.6544 0.6529 0.6517 0.6484 0.6465 0.6432

kn = 3
0.72

kn = 4
kn = 5
Mean ADR score

kn = 6
0.7

kn = 7
0.68
0.66
0.64

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


kt

Fig. 12. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1

better [8]. Moreover, this combination of parameters obtains a mean ADR score of
0.8039 when evaluated with the original All-2 ground truths (see Appendix). This is
the best score ever reported for this collection.

6.3 Evaluation with the Eval05 Test Collection, Any-1 Ground Truth Lists

In a fair evaluation scenario, we would use the previous experiment to train our
system and choose the values of kn and kt that seem to perform the best (in particular,
kn=4 and kt=0.6). Then, the system would be run and evaluated with a different
collection to assess the external validity of the results and try to avoid overfitting to
the training collection. For the sake of completeness, here we show the results for all
55 combinations of the parameters with the Eval05 test collection used in MIREX
2005, again aggregated with the Any-1 function [28].Table 3 shows the results.
Unlike the previous experiment with the Train05 test collection, in this case the
variation across levels of kt is smaller (the mean standard deviation is twice as much
in Train05), indicating that the use of the time dimension does not provide better
results overall (see Fig. 13). This is probably caused by the particular queries in each
collection. Seven of the eleven queries in Train05 start with long rests, while this
352 J. Urbano et al.

Table 3. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.

kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.6522 0.6601 0.6646 0.6612 0.664 0.6539 0.6566 0.6576 0.6591 0.6606 0.662
4 0.653 0.653 0.6567 0.6616 0.6629 0.6633 0.6617 0.6569 0.65 0.663 0.6531
5 0.6413 0.6367 0.6327 0.6303 0.6284 0.6328 0.6478 0.6461 0.6419 0.6414 0.6478
6 0.6269 0.6251 0.6225 0.6168 0.6216 0.6284 0.6255 0.6192 0.6173 0.6144 0.6243
7 0.5958 0.623 0.6189 0.6163 0.6162 0.6192 0.6215 0.6174 0.6148 0.6112 0.6106
0.67

kn = 3
kn = 4
kn = 5
0.65

kn = 6
Mean ADR score

kn = 7
0.63
0.61
0.59

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


kt

Fig. 13. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1

happens for only three of the eleven queries in Eval05. In our model, rests are
ignored, and so the effect of the time dimension is larger when the very queries have
rests as their duration is added to the next note's.
Likewise, large n-grams tend to perform worse. In this case though, n-grams of
length kn=3 and kn=4 perform the best. The most effective combination is kn=3 and
kt=0.2, with a mean ADR score of 0.6646. However, kn=4 and kt=0.5 is very close,
with a mean ADR score of 0.6633. Therefore, based on the results of the previous
experiment and the results in this one, we believe that kn=4 and kt∈[0.5, 0.6] are the
best parameters overall.
It is also important to note that none of the 55 combinations ran result in a mean
ADR score less than 0.594, which was the highest score achieved in the actual
MIREX 2005 evaluation with the Any-1 ground truths [28]. Therefore, our systems
would have ranked first if participated.

7 Conclusions and Future Work


We have proposed a new transposition and time scale invariant model to represent
musical pieces and compute their melodic similarity. Songs are considered as curves
in the pitch-time plane, allowing us to compute their melodic similarity in terms of the
shape similarity of the curves they define. We have implemented it with a local
Melodic Similarity through Shape Similarity 353

alignment algorithm over sequences of spline spans, each of which is represented by


one polynomial for the pitch dimension and another polynomial for the time
dimension. This parametric representation of melodies permits the application of a
weight scheme between pitch and time dissimilarities.
The MIREX 2005 test collections have been used to evaluate the model for several
span lengths and weight schemes. Overall, spans 4 notes long seem to perform the
best, with longer spans performing gradually worse. The optimal weigh scheme we
found gives about twice as much importance to the pitch dimension than to the time
dimension. However, time dissimilarities need to be normalized, as they are shown to
be about five times smaller than pitch dissimilarities.
This model obtains the best mean ADR score ever reported for the MIREX 2005
training collection, and every span length and weight scheme evaluated would have
ranked first in the actual evaluation of that edition. However, the use of the time
dimension did not improve the results significantly for the evaluation collection. On
the other hand, three systems derived from this model were submitted to the MIREX
2010 edition: PitchDeriv, ParamDeriv and Shape [27]. These systems obtained the
best results, and they ranked the top three in this edition. Again, the use of the time
dimension was not shown to improve the results.
A rough analysis of the MIREX 2005 and 2010 collections shows that the queries
used in the 2005 training collection have significantly more rests than in the
evaluation collection, and they are virtually absent in the 2010 collection. Because our
model ignores rests, simply adding their durations to the next note’s duration, the use
of the time dimension is shown to improve the results only in the 2005 training
collection. This evidences the need for larger and more heterogeneous test collections
for the Symbolic Melodic Similarity task, for researchers to train and tune their
systems properly and reduce overfitting to particular collections[9][29].
The results indicate that this line of work is certainly promising. Further research
should address the interpolation method to use, different ways of splitting the curve
into spans, extend the model to consider rests and polyphonic material, and evaluate
on more heterogeneous collections.

References
1. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuñez, Y., Rappaport, D.,
Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic Similarity.
Computer Music Journal 30(3), 67–76 (2006)
2. Bainbridge, D., Dewsnip, M., Witten, I.H.: Searching Digital Music Libraries. Information
Processing and Management 41(1), 41–56 (2005)
3. de Boor, C.: A Practical guide to Splines. Springer, Heidelberg (2001)
4. Bozkaya, T., Ozsoyoglu, M.: Indexing Large Metric Spaces for Similarity Search Queries.
ACM Transactions on Database Systems 24(3), 361–404 (1999)
5. Byrd, D., Crawford, T.: Problems of Music Information Retrieval in the Real World.
Information Processing and Management 38(2), 249–272 (2002)
6. Casey, M.A., Veltkamp, R.C., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-
Based Music Information Retrieval: Current Directions and Future Challenges.
Proceedings of the IEEE 96(4), 668–695 (2008)
354 J. Urbano et al.

7. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A Fast,
Randomised, Maximal Subset Matching Algorithm for Document-Level Music Retrieval.
In: International Conference on Music Information Retrieval, pp. 150–155 (2006)
8. Doraisamy, S., Rüger, S.: Robust Polyphonic Music Retrieval with N-grams. Journal of
Intelligent Systems 21(1), 53–70 (2003)
9. Downie, J.S.: The Scientific Evaluation of Music Information Retrieval Systems:
Foundations and Future. Computer Music Journal 28(2), 12–23 (2004)
10. Downie, J.S., West, K., Ehmann, A.F., Vincent, E.: The 2005 Music Information Retrieval
Evaluation Exchange (MIREX 2005): Preliminary Overview. In: International Conference
on Music Information Retrieval, pp. 320–323 (2005)
11. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for Evaluating
Similarity Between Monophonic Musical Sequences. Journal of New Music
Research 36(4), 267–279 (2007)
12. Hanna, P., Robine, M., Ferraro, P., Allali, J.: Improvements of Alignment Algorithms for
Polyphonic Music Retrieval. In: International Symposium on Computer Music Modeling
and Retrieval, pp. 244–251 (2008)
13. Isaacson, E.U.: Music IR for Music Theory. In: The MIR/MDL Evaluation Project White
paper Collection, 2nd edn., pp. 23–26 (2002)
14. Kilian, J., Hoos, H.H.: Voice Separation — A Local Optimisation Approach. In:
International Symposium on Music Information Retrieval, pp. 39–46 (2002)
15. Lin, H.-J., Wu, H.-H.: Efficient Geometric Measure of Music Similarity. Information
Processing Letters 109(2), 116–120 (2008)
16. McAdams, S., Bregman, A.S.: Hearing Musical Streams. In: Roads, C., Strawn, J. (eds.)
Foundations of Computer Music, pp. 658–598. The MIT Press, Cambridge (1985)
17. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161–175 (1990)
18. Selfridge-Field, E.: Conceptual and Representational Issues in Melodic Comparison.
Computing in Musicology 11, 3–64 (1998)
19. Smith, L.A., McNab, R.J., Witten, I.H.: Sequence-Based Melodic Comparison: A
Dynamic Programming Approach. Computing in Musicology 11, 101–117 (1998)
20. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal
of Molecular Biology 147(1), 195–197 (1981)
21. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C.: A Ground Truth for
Half a Million Musical Incipits. Journal of Digital Information Management 3(1), 34–39
(2005)
22. Typke, R., Veltkamp, R.C., Wiering, F.: A Measure for Evaluating Retrieval Techniques
based on Partially Ordered Ground Truth Lists. In: IEEE International Conference on
Multimedia and Expo., pp. 1793–1796 (2006)
23. Typke, R., Veltkamp, R.C., Wiering, F.: Searching Notated Polyphonic Music Using
Transportation Distances. In: ACM International Conference on Multimedia, pp. 128–135
(2004)
24. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey of Music Information Retrieval
Systems. In: International Conference on Music Information Retrieval, pp. 153–160 (2005)
25. Uitdenbogerd, A., Zobel, J.: Melodic Matching Techniques for Large Music Databases. In:
ACM International Conference on Multimedia, pp. 57–66 (1999)
26. Ukkonen, E., Lemström, K., Mäkinen, V.: Geometric Algorithms for Transposition
Invariant Content-Based Music Retrieval. In: International Conference on Music
Information Retrieval, pp. 193–199 (2003)
Melodic Similarity through Shape Similarity 355

27. Urbano, J., Lloréns, J., Morato, J., Sánchez-Cuadrado, S.: MIREX 2010 Symbolic Melodic
Similarity: Local Alignment with Geometric Representations. Music Information Retrieval
Evaluation eXchange (2010)
28. Urbano, J., Marrero, M., Martín, D., Lloréns, J.: Improving the Generation of Ground
Truths based on Partially Ordered Lists. In: International Society for Music Information
Retrieval Conference, pp. 285–290 (2010)
29. Urbano, J., Morato, J., Marrero, M., Martín, D.: Crowdsourcing Preference Judgments for
Evaluation of Music Similarity Tasks. In: ACM SIGIR Workshop on Crowdsourcing for
Search Evaluation, pp. 9–16 (2010)
30. Ó Maidín, D.: A Geometrical Algorithm for Melodic Difference. Computing in
Musicology 11, 65–72 (1998)

Appendix: Results with the Original All-2 Ground Truth Lists


Here we list the results of all 55 combinations of kn and kt evaluated with the very
original Train05 (see Table 4) and Eval05 (see Table 5) test collections, ground truth
lists aggregated with the All-2 function [21][28]. These numbers permit a direct
comparison with previous studies that used these ground truth lists as well.
The qualitative results remain the same: kn=4 seems to perform the best, and the
effect of the time dimension is much larger in the Train05 collection. Remarkably, in
Eval05 kn=4 outperforms all other n-gram lengths for all but two levels of kt.

Table 4. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.

kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.7743 0.7793 0.788 0.7899 0.7893 0.791 0.7936 0.7864 0.7824 0.777 0.7686
4 0.7836 0.7899 0.7913 0.7955 0.7946 0.8012 0.8039 0.8007 0.791 0.7919 0.7841
5 0.7844 0.7867 0.7937 0.7951 0.7944 0.7872 0.7799 0.7736 0.7692 0.7716 0.7605
6 0.7885 0.7842 0.7891 0.7851 0.7784 0.7682 0.7658 0.762 0.7572 0.7439 0.7388
7 0.7598 0.7573 0.7466 0.7409 0.7186 0.7205 0.7184 0.7168 0.711 0.7075 0.6997

Table 5. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.

kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.7185 0.714 0.7147 0.7116 0.712 0.7024 0.7056 0.7067 0.708 0.7078 0.7048
4 0.7242 0.7268 0.7291 0.7316 0.7279 0.7282 0.7263 0.7215 0.7002 0.7108 0.7032
5 0.7114 0.7108 0.6988 0.6958 0.6942 0.6986 0.7109 0.7054 0.6959 0.6886 0.6914
6 0.708 0.7025 0.6887 0.6693 0.6701 0.6743 0.6727 0.6652 0.6612 0.6561 0.6636
7 0.6548 0.6832 0.6818 0.6735 0.6614 0.6594 0.6604 0.6552 0.6525 0.6484 0.6499

It can also be observed that the results would again be overestimated by as much as
11% in the case of Train05 and as much as 13% in Eval05, in contrast with the
maximum 12% observed with the systems that participated in the actual MIREX 2005
evaluation.
Content-Based Music Discovery

Dirk Schönfuß

mufin GmbH August-Bebel-Straße 36,
 01219 Dresden Germany


[email protected]

Abstract. Music recommendation systems have become a valuable aid for


managing large music collections and discovering new music. Our content-
based recommendation system employs signal-based features and semantic
music attributes generated using machine-based learning algorithms. In addition
to playlist generation and music recommendation, we are exploring new
usability concepts made possible by the analysis results. Functionality such as
the mufin vision sound universe enables the user to discover his own music
collection or even unknown catalogues in a new, more intuitive way.

Keywords: music, visualization, recommendation, cloud, clustering, semantic


attributes, auto-tagging.

1 Introduction
The way music is consumed today has been changed dramatically by its increasing
availability in digital form. Online music shops have replaced traditional stores and
music collections are increasingly kept on electronic storage systems and mobile
devices instead of physical media on a shelf. It has become much faster and more
comfortable to find and acquire music of a known artist. At the same time it has
become more difficult to find one’s way in the enormous range of music that is
offered commercially, find music according to one’s taste or even manage one’s own
collection.
Young people today have a music collection with an average size of 8,159 tracks [1]
and the iTunes music store today offers more than 10 million tracks for sale. Long-tail
sales are low which is illustrated by the fact that only 1% of the catalog tracks generate
80% of sales [2][3]. A similar effect can also be seen in the usage of private music
collections. According to our own studies, only few people are actively working with
manual playlists because they consider this too time-consuming or they have simply
forgotten which music they actually possess.

1.1 Automatic Music Recommendation

This is where music recommendation technology comes in: The similarity of two
songs can be mathematically calculated based on their musical attributes. Thus, for
each song a ranked list of similar songs from the catalogue can be generated.
Editorial, user-generated or content-based data derived directly from the audio signal
can be used.

S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 356–360, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Content-Based Music Discovery 357

Editorial data allows a very thorough description of the musical content but this
manual process is very expensive and time-consuming and will only ever cover a
small percentage of the available music. User-data based recommendations have
become very popular through vendors such as Amazon (“People who bought this item
also bought …”) or Last.FM. However, this approach suffers from a cold-start
problem and its strong focus on popular content.
Signal-based recommenders are not affected by popularity, sales rank or user
activity. They extract low-level features directly from the audio signal. This also
offers additional advantages such as being able to work without a server connection
and being able to process any music file even if it has not been published anywhere.
However, signal-based technology alone misses the socio-cultural aspect which is not
present in the audio signal and it also cannot address current trends or lyrics content.

1.2 Mufin’s Hybrid Approach

Mufin’s technology is signal-based but it combines signal features with semantic


musical attributes and metadata from other data sources thus forming a hybrid
recommendation approach. First of all, the technology analyzes the audio signal and
extracts signal features. mufin then employs state-of-the-art machine learning
technology to extract semantic musical attributes including mood descriptions such as
happy, sad, calm or aggressive but also other descriptive tags such as synthetic,
acoustic, presence of electronic beats, distorted guitars, etc.
This information can for instance be used to narrow down search results, offer
browsing capabilities or contextually describe content. By combining these attributes
with information from other sources such as editorial metadata the user can for
instance search for "aggressive rock songs from the 1970s with a track-length of more
than 8 minutes".
Lists of similar songs can then be generated by combining all available information
including signal features, musical attributes (auto-tags) and metadata from other data
sources using a music ontology. This ensures a high quality and enables the steering
of the recommendation system. The results can be used for instance for playlist
generation or as an aid to an editor who needs an alternative to a piece of music he is
not allowed to use in a certain context.
As the recommendation system features a module based on digital signal-
processing algorithms it can generate musical similarity for all songs within a music
catalogue and because it makes use of mathematical analysis of the audio signals, it is
completely deterministic and - if desired - can work independent of any "human
factor" like cultural background, listening habits, etc. Depending on the database
used, it can also work way off the mainstream and thus give the user the opportunity
to discover music he may never have found otherwise.
In contrast to other technologies such as collaborative filtering, mufin's technology
can provide recommendations for any track, even if there are no tags or social data.
Recommendations are not limited by genre boundaries, target groups or biased by
popularity. Instead, it equally covers all songs in a music catalogue and if genre
boundaries or the influence of popularity is indeed desired, this can addressed by
leveraging additional data sources.
358 D. Schönfuß

Fig. 1. The mufin music recommender combines audio features inside a song model and
semantic musical attributes using a music ontology. Additionally, visualization coordinates for
the mufin vision sound galaxy are generated during the music analysis process.

Mufin’s complete music analysis process is fully automated. The technology has
already proven its practical application in usage scenarios with more than 9 million
tracks. Mufin's technology is available for different platforms including Linux,
MacOS X, Windows and mobile platforms. Additionally, it can also be used via web
services.

2 Mufin Vision

Common, text-based attributes such as title or artist are not suitable to keep track of a
large music collection, especially if the user is not familiar with every item in the
collection. Songs which belong together from a sound perspective may appear very
far apart when using lists sorted by metadata. Additionally, only a limited number of
songs will fit onto the screen preventing the user from actually getting an overview of
his collection.
mufin vision has been developed with the goal to offer easy access to music
collections. Even if there are thousands of songs, the user can easily find his way
around the collection since he can learn where to find music with a certain
characteristic. By looking at the concentration of dots in an area he can immediately
assess the distribution of the collection and zoom into a section to get a closer look.
The mufin vision 3D sound galaxy displays each song as a dot in a coordinate
system. X, y and z axis as well as size and color of the dots can be assigned to
different musical criteria such as tempo, mood, instrumentation or type of singing
voice; even metadata such as release date or song duration can be used. Using the axis
configuration, the user can explore his collection the way he wants and make relations
between different songs visible. As a result, it becomes much easier to find music
fitting a mood or occasion.
Mufin vision premiered in the mufin player PC application but it can also be used
on the web and even on mobile devices. The latest version of the mufin player 1.5
allows the user to control mufin vision using a multi-touch display.
Content-Based Music Discovery 359

Fig. 2. Both songs are by the same artist. However, “Brothers in arms” is a very calm ballad
with sparse instrumentation while “Sultans of swing” is a rather powerful song with a fuller
sound spectrum. The mufin vision sound galaxy reflects that difference since it works on song
level instead of an artist or genre level.

Fig. 3. The figure displays a playlist in which the entries are connected by lines. One can see
that although the songs may be similar as a whole, their musical attributes vary over the course
of the playlist.

3 Further Work
The mufin player PC application offers a database view of the user’s music collection
including filtering, searching and sorting mechanisms. However, instead of only using
metadata such as artist or title for sorting, the mufin player can also sort any list by
similarity to a selected seed song.
360 D. Schönfuß

Additionally, the mufin player offers an online storage space for a user’s music
collection. This prevents the user from data loss and allows him to simultaneously
stream his music online and listen to it from anywhere in the world.
Furthermore, mufin works together with the German National Library in order to
establish a workflow for the protection of our cultural heritage. The main contribution
of mufin is the fully automatic annotation of the music content and the provision of
descriptive tags for the library’s ontology. Based on technology by mufin and its
partners, a semantic multimedia search demonstration was presented at IBC 2009 in
Amsterdam.

References
1. Bahanovich, D., Collopy, D.: Music Experience and Behaviour in Young People.
University of Hertfordshire, UK (2009)
2. Celma, O.: Music Recommendation and Discovery in the Long Tail. PhD-Thesis,
Universitat Pompeu Fabra, Spain (2008)
3. Nielsen Soundscan: State of the industrie (2007), http://www.narm.com/
2008Conv/StateoftheIndustry.pdf (July 22, 2009)
Author Index

Abeßer, Jakob 259 Mansencal, Boris 31


Alvaro, Jesús L. 163 Marchand, Sylvain 31
Anglade, Amélie 1 Marchini, Marco 205
Aramaki, Mitsuko 176 Mauch, Matthias 1
Arcos, Josep Lluis 219 Merer, Adrien 176
Morato, Jorge 338
Barbancho, Ana M. 116 Mustafa, Hafiz 84
Barbancho, Isabel 116
Barros, Beatriz 163 Nürnberger, Andreas 273
Barthet, Mathieu 138
Bunch, Pete 76 Ortiz, Andrés 116
Özaslan, Tan Hakan 219
Cano, Estefanı́a 259
Ozerov, Alexey 102
de Haas, W. Bas 242
de la Bandera, Cristina 116 Palacios, Eric 219
Dittmar, Christian 259 Purwins, Hendrik 205
Dixon, Simon 1
Rigaux, Philippe 303
Faget, Zoe 303 Röbel, Axel 60
Févotte, Cédric 102 Robine, Matthias 242
Rodet, Xavier 60
Girin, Laurent 31 Romito, Marco 60
Godsill, Simon 76
Grollmisch, Sascha 259 Sammartino, Simone 116
Großmann, Holger 259 Sánchez-Cuadrado, Sonia 338
Guaus, Enric 219 Sandler, Mark 20, 138
Schönfuß, Dirk 356
Hanna, Pierre 242 Stewart, Rebecca 20
Hargreaves, Steven 138 Stober, Sebastian 273
Jensen, Kristoffer 51
Tardón, Lorenzo J. 116
Karvonen, Mikko 321
Klapuri, Anssi 188 Urbano, Julián 338
Kronland-Martinet, Richard 176
Kudumakis, Panos 20 Veltkamp, Remco C. 242
Vikman, Juho 321
Laitinen, Mika 321
Lemström, Kjell 321 Wang, Wenwu 84
Liuni, Marco 60 Wiering, Frans 242
Lloréns, Juan 338
Lukashevich, Hanna 259 Ystad, Sølvi 176

You might also like