Exploring Music Contents
Exploring Music Contents
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Sølvi Ystad Mitsuko Aramaki
Richard Kronland-Martinet
Kristoffer Jensen (Eds.)
Exploring
Music Contents
7th International Symposium, CMMR 2010
Málaga, Spain, June 21-24, 2010
Revised Papers
13
Volume Editors
Sølvi Ystad
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]
Mitsuko Aramaki
CNRS-INCM, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]
Richard Kronland-Martinet
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: [email protected]
Kristoffer Jensen
Aalborg University Esbjerg, Niels Bohr Vej 8, 6700 Esbjerg, Denmark
E-mail: [email protected]
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Computer Music Modeling and Retrieval (CMMR) 2010 was the seventh event
of this international conference series that was initiated in 2003. Since its start,
the conference has been co-organized by the University of Aalborg, Esbjerg, Den-
mark (http://www.aaue.dk) and the Laboratoire de Mécanique et d’Acoustique
in Marseille, France (http://www.lma.cnrs-mrs.fr) and has taken place in France,
Italy and Denmark. The six previous editions of CMMR offered a varied overview
of recent music information retrieval (MIR) and sound modeling activities in ad-
dition to alternative fields related to human interaction, perception and cognition.
This year’s CMMR took place in Málaga, Spain, June 21–24, 2010. The
conference was organized by the Application of Information and Communica-
tions Technologies Group (ATIC) of the University of Málaga (Spain), together
with LMA and INCM (CNRS, France) and AAUE (Denmark). The conference
featured three prominent keynote speakers working in the MIR area, and the
program of CMMR 2010 included in addition paper sessions, panel discussions,
posters and demos.
The proceedings of the previous CMMR conferences were published in the
Lecture Notes in Computer Science series (LNCS 2771, LNCS 3310, LNCS 3902,
LNCS 4969, LNCS 5493 and LNCS 5954), and the present edition follows the
lineage of the previous ones, including a collection of 22 papers within the topics
of CMMR. These articles were specially reviewed and corrected for this proceed-
ings volume.
The current book is divided into five main chapters that reflect the present
challenges within the field of computer music modeling and retrieval. The chap-
ters span topics from music interaction, composition tools and sound source
separation to data mining and music libraries. One chapter is also dedicated
to perceptual and cognitive aspects that are currently the subject of increased
interest in the MIR community. We are confident that CMMR 2010 brought
forward the research in these important areas.
We would like to thank Isabel Barbancho and her team at the Application of
Information and Communications Technologies Group (ATIC) of the University
of Málaga (Spain) for hosting the 7th CMMR conference and for ensuring a
successful organization of both scientific and social matters. We would also like
to thank the Program Committee members for their valuable paper reports and
thank all the participants who made CMMR 2010 a fruitful and convivial event.
Finally, we would like to thank Springer for accepting to publish the CMMR
2010 proceedings in their LNCS series.
Symposium Chair
Isabel Barbancho University of Málaga, Spain
Symposium Co-chairs
Kristoffer Jensen AAUE, Denmark
Sølvi Ystad CNRS-LMA, France
Program Committee
Paper and Program Chairs
Mitsuko Aramaki CNRS-INCM, France
Richard Kronland-Martinet CNRS-LMA, France
1 Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 1–19, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 S. Dixon, M. Mauch, and A. Anglade
Two examples are the bag-of-frames approach to music similarity [5], and the pe-
riodicity pattern approach to rhythm analysis [13], which are both independent
of the order of musical notes, whereas temporal order is an essential feature of
melody, rhythm and harmonic progression. Perhaps surprisingly, much progress
has been made in music informatics in recent years1, despite the naivete of the
musical models used and the claims that some tasks have reached a “glass ceil-
ing” [6].
The continuing progress can be explained in terms of a combination of factors:
the high level of redundancy in music, the simplicity of many of the tasks which
are attempted, and the limited scope of the algorithms which are developed. In
this regard we agree with [14], who review the first 10 years of ISMIR confer-
ences and list some challenges which the community “has not fully engaged with
before”. One of these challenges is to “dig deeper into the music itself”, which
would enable researchers to address more musically complex tasks; another is to
“expand ... musical horizons”, that is, broaden the scope of MIR systems.
In this paper we present two approaches to modelling musical harmony, aiming
at capturing the type of musical knowledge and reasoning a musician might use
in performing similar tasks. The first task we address is that of chord transcrip-
tion from audio recordings. We present a system which uses a high-level model
of musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian framework, and
generates the content of a “lead-sheet” containing the sequence of chord sym-
bols, including their bass notes and metrical positions, and the key signature and
any modulations over time. This system achieves state-of-the-art performance,
being rated first in its category in the 2009 and 2010 MIREX evaluations. The
second task to which we direct our attention is the machine learning of logical
descriptions of harmonic sequences in order to characterise particular styles or
genres. For this work we use inductive logic programming to obtain represen-
tations such as decision trees which can be used to classify unseen examples or
provide insight into the characteristics of a data corpus.
Computational models of harmony are important for many application areas
of music informatics, as well as for music psychology and musicology itself. For
example, a harmony model is a necessary component of intelligent music no-
tation software, for determining the correct key signature and pitch spelling of
accidentals where music is obtained from digital keyboards or MIDI files. Like-
wise processes such as automatic transcription are benefitted by tracking the
harmonic context at each point in the music [24]. It has been shown that har-
monic modelling improves search and retrieval in music databases, for example
in order to find variations of an example query [36], which is useful for musi-
cological research. Theories of music cognition, if expressed unambiguously, can
be implemented and tested on large data corpora and compared with human
annotations, in order to verify or refine concepts in the theory.
1
Progress is evident for example in the annual MIREX series of evaluations of
music information retrieval systems (http://www.music-ir.org/mirex/wiki/2010:
Main_Page)
Probabilistic and Logic-Based Modelling of Harmony 3
The remainder of the paper is structured as follows. The next section provides
an overview of research in harmony modelling. This is followed by a section
describing our probabilistic model of chord transcription. In section 4, we present
our logic-based approach to modelling of harmony, and show how this can be
used to characterise and classify music. The final section is a brief conclusion
and outline of future work.
2 Background
Research into computational analysis of harmony has a history of over four
decades since [44] proposed a grammar-based analysis that required the user
to manually remove any non-harmonic notes (e.g. passing notes, suspensions
and ornaments) before the algorithm processed the remaining chord sequence.
A grammar-based approach was also taken by [40], who developed a set of chord
substitution rules, in the form of a context-free grammar, for generating 12-bar
Blues sequences. [31] addressed the problem of extracting patterns and substitu-
tion rules automatically from jazz standard chord sequences, and discussed how
the notions of expectation and surprise are related to the use of these patterns
and rules.
Closely related to grammar-based approaches are rule-based approaches, which
were used widely in early artificial intelligence systems. [21] used an elimination
process combined with heuristic rules in order to infer the tonality given a fugue
melody from Bach’s Well-Tempered Clavier. [15] presents an expert system con-
sisting of about 350 rules for generating 4-part harmonisations of melodies in the
style of Bach Chorales. The rules cover the chord sequences, including cadences
and modulations, as well as the melodic lines of individual parts, including voice
leading. [28] developed an expert system with a complex set of rules for recognis-
ing consonances and dissonances in order to infer the chord sequence. Maxwell’s
approach was not able to infer harmony from a melodic sequence, as it considered
the harmony at any point in time to be defined by a subset of the simultaneously
sounding notes.
[41] addressed some of the weaknesses of earlier systems with a combined
rhythmic and harmonic analysis system based on preference rules [20]. The
system assigns a numerical score to each possible interpretation based on the
preference rules which the interpretation satisfies, and searches the space of all
solutions using dynamic programming restricted with a beam search. The sys-
tem benefits from the implementation of rules relating harmony and metre, such
as the preference rule which favours non-harmonic notes occurring on weak met-
rical positions. One claimed strength of the approach is the transparency of the
preference rules, but this is offset by the opacity of the system parameters such
as the numeric scores which are assigned to each rule.
[33] proposed a counting scheme for matching performed notes to chord tem-
plates for variable-length segments of music. The system is intentionally simplis-
tic, in order that the framework might easily be extended or modified. The main
contributions of the work are the graph search algorithms, inspired by Temper-
ley’s dynamic programming approach, which determine the segmentation to be
4 S. Dixon, M. Mauch, and A. Anglade
used in the analysis. The proposed graph search algorithm is shown to be much
more efficient than standard algorithms without differing greatly in the quality
of analyses it produces.
As an alternative to the rule-based approaches, which suffer from the cu-
mulative effects of errors, [38] proposed a probabilistic approach to functional
harmonic analysis, using a hidden Markov model. For each time unit (measure
or half-measure), their system outputs the current key and the scale degree of
the current chord. In order to make the computation tractable, a number of
simplifying assumptions were made, such as the symmetry of all musical keys.
Although this reduced the number of parameters by at least two orders of mag-
nitude, the training algorithm was only successful on a subset of the parameters,
and the remaining parameters were set by hand.
An alternative stream of research has been concerned with multidimensional
representations of polyphonic music [10,11,42] based on the Viewpoints approach
of [12]. This representation scheme is for example able to preserve information
about voice leading which is otherwise lost by approaches that treat harmony as
a sequence of chord symbols.
Although most research has focussed on analysing musical works, some work
investigates the properties of entire corpora. [25] compared two corpora of chord
sequences, belonging to jazz standards and popular (Beatles) songs respectively,
and found key- and context-independent patterns of chords which occurred fre-
quently in each corpus. [26] examined the statistics of the chord sequences of sev-
eral thousand songs, and compared the results to those from a standard natural
language corpus in an attempt to find lexical units in harmony that correspond
to words in language. [34,35] investigated whether stochastic language models in-
cluding naive Bayes classifiers and 2-, 3- and 4-grams could be used for automatic
genre classification. The models were tested on both symbolic and audio data,
where an off-the-shelf chord transcription algorithm was used to convert the audio
data to a symbolic representation. [39] analysed the Beatles corpus using proba-
bilistic N-grams in order to show that the dependency of a chord on its context
extends beyond the immediately preceding chord (the first-order Markov assump-
tion). [9] studied differences in the use of harmony across various periods of classi-
cal music history, using root progressions (i.e. the sequence of root notes of chords
in a progression) reduced to 2 categories (dominant and subdominant) to give a
representation called harmonic vectors. The use of root progressions is one of the
representations we use in our own work in section 4 [2].
All of the above systems process symbolic input, such as that found in a score,
although most of the systems do not require the level of detail provided by the
score (e.g. key signature, pitch spelling), which they are able to reconstruct from
the pitch and timing data. In recent years, the focus of research has shifted to the
analysis of audio files, starting with the work of [16], who computed a chroma
representation (salience of frequencies representing the 12 Western pitch classes,
independent of octave) which was matched to a set of chord templates using the
inner product. Alternatively, [7] modelled chords with a 12-dimensional Gaussian
distribution, where chord notes had a mean of 1, non-chord notes had a mean of 0,
Probabilistic and Logic-Based Modelling of Harmony 5
and the covariance matrix had high values between pairs of chord notes. A hidden
Markov model was used to infer the most likely sequence of chords, where state
transition probabilities were initialised based on the distance between chords on
a special circle of fifths which included minor chords near to their relative major
chord. Further work on audio-based harmony analysis is reviewed thoroughly in
three recent doctoral theses, to which the interested reader is referred [22,18,32].
Music theory, perceptual studies, and musicians themselves generally agree that
no musical quality can be treated individually. When a musician transcribes
the chords of a piece of music, the chord labels are not assigned solely on the
basis of local pitch content of the signal. Musical context such as the key, met-
rical position and even the large-scale structure of the music play an important
role in the interpretation of harmony. [17, Chapter 4] conducted a survey among
human music transcription experts, and found that they use several musical con-
text elements to guide the transcription process: not only is a prior rough chord
detection the basis for accurate note transcription, but the chord transcription
itself depends on the tonal context and other parameters such as beats, instru-
mentation and structure.
The goal of our recent work on chord transcription [24,22,23] is to propose
computational models that integrate musical context into the automatic chord
estimation process. We employ a dynamic Bayesian network (DBN) to combine
models of metrical position, key, chord, bass note and beat-synchronous bass and
treble chroma into a single high-level musical context model. The most probable
sequence of metrical positions, keys, chords and bass notes is estimated via
Viterbi inference.
A DBN is a graphical model representing a succession of simple Bayesian
networks in time. These are assumed to be Markovian and time-invariant, so
the model can be expressed recursively in two time slices: the initial slice and
the recursive slice. Our DBN is shown in Figure 1. Each node in the network
represents a random variable, which might be an observed node (in our case
the bass and treble chroma) or a hidden node (the key, metrical position, chord
and bass pitch class nodes). Edges in the graph denote dependencies between
variables. In our DBN the musically interesting behaviour is modelled in the
recursive slice, which represents the progress of all variables from one beat to
the next. In the following paragraphs we explain the function of each node.
Chord. Technically, the dependencies of the random variables are described in the
conditional probability distribution of the dependent variable. Since the highest
number of dependencies join at the chord variable, it takes a central position
in the network. Its conditional probability distribution is also the most com-
plex: it depends not only on the key and the metrical position, but also on the
chord variable in the previous slice. The chord variable has 121 different chord
states (see below), and its dependency on the previous chord variable enables
6 S. Dixon, M. Mauch, and A. Anglade
key Ki−1 Ki
chord Ci−1 Ci
bass Bi−1 Bi
bs
bass chroma Xi−1 Xibs
tr
treble chroma Xi−1 Xitr
Fig. 1. Our network model topology, represented as a DBN with two slices and six
layers. The clear nodes represent random variables, while the observed ones are shaded
grey. The directed edges represent the dependency structure. Intra-slice dependency
edges are drawn solid, inter-slice dependency edges are dashed.
density
4 2
Key and metrical position. The dependency structure of the key and metrical
position variables are comparatively simpler, since they depend only on the re-
spective predecessor. The emphasis on smooth, stable key sequences is handled
in the same way as it is in chords, but the 24 states representing major and minor
keys have even higher self-transition probability, and hence they will persist for
longer stretches of time. The metrical position model represents a 44 meter and
hence has four states. The conditional probability distribution strongly favours
“normal” beat transitions, i.e. from one beat to the next, but it also allows for
irregular transitions in order to accommodate temporary deviations from 44 me-
ter and occasional beat tracking errors. In Figure 2a black arrows represent a
transition probability of 1−ε (where ε = 0.05) to the following beat. Grey arrows
represent a probability of ε/2 to jump to different beats through self-transition
or omission of the expected beat.
Bass. The random variable that models the bass has 13 states, one for each of
the pitch classes, and one “no bass” state. It depends on both the current chord
and the previous chord. The current chord is the basis of the most probable bass
notes that can be chosen. The highest probability is assigned to the “nominal”
chord bass pitch class2 , lower probabilities to the remaining chord pitch classes,
and the rest of the probability mass is distributed between the remaining pitch
classes. The additional use of the dependency on the previous chord allows us
to model the behaviour of the bass note on the first beat of the chord differently
from its behaviour on later beats. We can thus model the tendency for the played
bass note to coincide with the “nominal” bass note of the chord (e.g. the note B
in the B7 chord), while there is more variation in the bass notes played during
the rest of the duration of the chord.
Chroma. The chroma nodes provide models of the bass and treble chroma au-
dio features. Unlike the discrete nodes previously discussed, they are continuous
because the 12 elements of the chroma vector represent relative salience, which
2
The chord symbol itself always implies a bass note, but the bass line might include
other notes not specified by the chord symbol, as in the case of walking bass.
8 S. Dixon, M. Mauch, and A. Anglade
can assume any value between zero and unity. We represent both bass and treble
chroma as multidimensional Gaussian random variables. The bass chroma vari-
able has 13 different Gaussians, one for every bass state, and the treble chroma
node has 121 Gaussians, one for every chord state. The means of the Gaussians
are set to reflect the nature of the chords: to unity for pitch classes that are
part of the chord, and to zero for the rest. A single variate in the 12-dimensional
Gaussian treble chroma distribution models one pitch class, as illustrated in Fig-
ure 2b. Since the chroma values are normalised to the unit interval, the Gaussian
model functions similar to a regression model: for a given chord the Gaussian
density increases with increasing salience of the chord notes (solid line), and
decreases with increasing salience of non-chord notes (dashed line). For more
details see [22].
One important aspect of the model is the wide variety of chords it uses.
It models ten different chord types (maj, min, maj/3, maj/5, maj6, 7, maj7,
min7, dim, aug) and the “no chord” class N. The chord labels with slashes
denote chords whose bass note differs from the chord root, for example D/3
represents a D major chord in first inversion (sometimes written D/F). The
recognition of these chords is a novel feature of our chord recognition algorithm.
Figure 3 shows a score rendered using exclusively the information in our model.
In the last four bars, marked with a box, the second chord is correctly annotated
as D/F. The position of the bar lines is obtained from the metrical position
variable, the key signature from the key variable, and the bass notes from the
bass variable. The chord labels are obtained from the chord variable, replicated
as notes in the treble staff for better visualisation. The crotchet rest on the first
beat of the piece indicates that here, the Viterbi algorithm inferred that the “no
chord” model fits best.
Using a standard test set of 210 songs used in the MIREX chord detection
task, our basic model achieved an accuracy of 73%, with each component of the
model contributing significantly to the result. This improves on the best result at
G B
7
Em G
7
C F C G D/F Em Bm G
7
Fig. 3. Excerpt of automatic output of our algorithm (top) and song book version
(bottom) of the pop song “Friends Will Be Friends” (Deacon/Mercury). The song
book excerpt corresponds to the four bars marked with a box.
Probabilistic and Logic-Based Modelling of Harmony 9
It Won’t Be Long
ground truth chorus verse chorus bridge verse chorus bridge verse chorus outro
segmentation
chord correct
using auto seg.
chord correct 1
baseline meth.0.5
0
0 20 40 60 80 100 120
time/s
Fig. 4. Segmentation and its effect on chord transcription for the Beatles’ song “It
Won’t Be Long” (Lennon/McCartney). The top 2 rows show the human and automatic
segmentation respectively. Although the structure is different, the main repetitions are
correctly identified. The bottom 2 rows show (in black) where the chord was transcribed
correctly by our algorithm using (respectively not using) the segmentation information.
MIREX 2009 for pre-trained systems. Further improvements have been made via
two extensions of this model: taking advantage of repeated structural segments
(e.g. verses or choruses), and refining the front-end audio processing.
Most musical pieces have segments which occur more than once in the piece,
and there are two reasons for wishing to identify these repetitions. First, multiple
sets of data provide us with extra information which can be shared between the
repeated segments to improve detection performance. Second, in the interest of
consistency, we can ensure that the repeated sections are labelled with the same
set of chord symbols. We developed an algorithm that automatically extracts the
repetition structure from a beat-synchronous chroma representation [27], which
ranked first in the 2009 MIREX Structural Segmentation task.
After building a similarity matrix based on the correlation between beat-
synchronous chroma vectors, the method finds sets of repetitions whose ele-
ments have the same length in beats. A repetition set composed of n elements
with length d receives a score of (n − 1)d, reflecting how much space a hypothet-
ical music editor could save by typesetting a repeated segment only once. The
repetition set with the maximum score (“part A” in Figure 4) is added to the
final list of structural elements, and the process is repeated on the remainder of
the song until no valid repetition sets are left.
The resulting structural segmentation is then used to merge the chroma repre-
sentations of matching segments. Despite the inevitable errors propagated from
incorrect segmentation, we found a significant performance increase (to 75% on
the MIREX score) by using the segmentation. In Figure 4 the beneficial effect
of using the structural segmentation can clearly be observed: many of the white
stripes representing chord recognition errors are eliminated by the structural
segmentation method, compared to the baseline method.
10 S. Dixon, M. Mauch, and A. Anglade
A further improvement was achieved by modifying the front end audio pro-
cessing. We found that by learning chord profiles as Gaussian mixtures, the
recognition rate of some chords can be improved. However this did not result
in an overall improvement, as the performance on the most common chords de-
creased. Instead, an approximate pitch transcription method using non-negative
least squares was employed to reduce the effect of upper harmonics in the chroma
representations [23]. This results in both a qualitative (reduction of specific er-
rors) and quantitative (a substantial overall increase in accuracy) improvement
in results, with a MIREX score of 79% (without using segmentation), which
again is significantly better than the state of the art. By combining both of the
above enhancements we reach an accuracy of 81%, a statistically significant im-
provement over the best result (74%) in the 2009 MIREX Chord Detection tasks
and over our own previously mentioned results.
directly, but only indirectly, by using the learnt models to classify unseen exam-
ples. Thus the following harmony modelling experiments are evaluated via the
task of genre classification.
genre(g,A,B,Key)
gap(A,C),degAndCat(5,maj,C,D,Key),degAndCat(1,min,D,E,Key) ?
Y N
gap(A,F),degAndCat(2,7,F,G,Key),degAndCat(5,maj,G,H,Key) ?
academic
N
Y ...
gap(H,I),degAndCat(1,maj,I,J,Key),degAndCat(5,7,J,K,Key) ?
Y N
gap(H,L),degAndCat(2,min7,L,M,Key),degAndCat(5,7,M,N,Key) ?
academic jjazz
Y N
jazz academic
Fig. 5. Part of the decision tree for a binary classifier for the classes Jazz and Academic
Table 1. Results compared with the baseline for 2-class, 3-class and 9-class classifica-
tion tasks
Classification Task Baseline Symbolic Audio
Academic – Jazz 0.55 0.947 0.912
Academic – Popular 0.55 0.826 0.728
Jazz – Popular 0.61 0.891 0.807
Academic – Popular – Jazz 0.40 0.805 0.696
All 9 subgenres 0.21 0.525 0.415
scale degree, chord category, and intervals between successive root notes, and we
constrained the learning algorithm to generate rules containing subsequences of
length at least two chords. The model can be expressed as a decision tree, as
shown in figure 5, where the choice of branch taken is based on whether or not
the chord sequence matches the predicates at the current node, and the class
to which the sequence belongs is given by the leaf of the decision tree reached
by following these choices. The decision tree is equivalent to an ordered set of
rules or a Prolog program. Note that a rule at a single node of a tree cannot
necessarily be understood outside of its context in the tree. In particular, a rule
by itself cannot be used as a classifier.
The results for various classification tasks are shown in Table 1. All results are
significantly above the baseline, but performance clearly decreases for more dif-
ficult tasks. Perfect classification is not to be expected from harmony data, since
other aspects of music such as instrumentation (timbre), rhythm and melody
are also involved in defining and recognising musical styles.
Analysis of the most common rules extracted from the decision tree models
built during these experiments reveals some interesting and well-known jazz,
academic and popular music harmony patterns. For each rule shown below, the
coverage expresses the fraction of songs in each class that match the rule. For
example, while a perfect cadence is common to both academic and jazz styles,
the chord categories distinguish the styles very well, with academic music using
triads and jazz using seventh chords:
14 S. Dixon, M. Mauch, and A. Anglade
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(5,maj,C,D,Key),
degreeAndCategory(1,maj,D,E,Key),
gap(E,B).
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(5,7,C,D,Key),
degreeAndCategory(1,maj7,D,E,Key),
gap(E,B).
genre(blues,A,B,Key) :- gap(A,C),
degreeAndCategory(1,7,C,D,Key),
degreeAndCategory(4,7,D,E,Key),
gap(E,B).
On the other hand, jazz is characterised (but not exclusively) by the sequence:
... - ii7 - V7 - ...
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(2,min7,C,D,Key),
degreeAndCategory(5,7,D,E,Key),
gap(E,B).
The representation also allows for longer rules to be expressed, such as the
following rule describing a modulation to the dominant key and back again in
academic music: ... - II7 - V - ... - I - V7 - ...
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(2,7,C,D,Key),
degreeAndCategory(5,maj,D,E,Key),
gap(E,F),
degreeAndCategory(1,maj,F,G,Key),
degreeAndCategory(5,7,G,H,Key),
gap(H,B).
Although none of the rules are particularly surprising, these examples illus-
trate some meaningful musicological concepts that are captured by the rules. In
general, we observed that Academic music is characterised by rules establish-
ing the tonality, e.g. via cadences, while Jazz is less about tonality, and more
about harmonic colour, e.g. the use of 7th, 6th, augmented and more complex
chords, and Popular music harmony tends to have simpler harmonic rules as
melody is predominant in this style. The system is also able to find longer rules
that a human might not spot easily. Working from audio data, even though the
transcriptions are not fully accurate, the classification and rules still capture the
same general trends as for symbolic data.
For genre classification we are not advocating a harmony-based approach
alone. It is clear that other musical features are better predictors of genre.
Nevertheless, the positive results encouraged a further experiment in which we
integrated the current classification approach with a state-of-the-art genre classi-
fication system, to test whether the addition of a harmony feature could improve
its performance.
Table 2. Best mean classification results (and number of features used) for the two
data sets using 5×5-fold cross-validation and feature selection
that the classification rate of the harmony-based classifier alone is poor. For
both datasets the improvements over the standard classifier (as shown in table
2) were found to be statistically significant.
5 Conclusion
We have looked at two approaches to the modelling of harmony which aim to “dig
deeper into the music”. In our probabilistic approach to chord transcription, we
demonstrated the advantage of modelling musical context such as key, metrical
structure and bass line, and simultaneously estimating all of these variables
along with the chord. We also developed an audio feature using non-negative
least squares that reflects the notes played better than the standard chroma
feature, and therefore reduces interference from harmonically irrelevant partials
and noise. A further improvement of the system was obtained by modelling the
global structure of the music, identifying repeated sections and averaging features
over these segments. One promising avenue of further work is the separation of
the audio (low-level) and symbolic (high-level) models which are conceptually
distinct but modelled together in current systems. A low-level model would be
concerned only with the production or analysis of audio — the mapping from
notes to features; while a high-level model would be a musical model handling
the mapping from chord symbols to notes.
Using a logic-based approach, we showed that it is possible to automatically
discover patterns in chord sequences which characterise a corpus of data, and
to use such models as classifiers. The advantage with a logic-based approach is
that models learnt by the system are transparent: the decision tree models can
be presented to users as sets of human readable rules. This explanatory power is
particularly relevant for applications such as music recommendation. The DCG
representation allows chord sequences of any length to coexist in the same model,
as well as context information such as key. Our experiments found that the more
musically meaningful Degree-and-Category representation gave better classifica-
tion results than using root intervals. The results using transcription from audio
data were encouraging in that although some information was lost in the tran-
scription process, the classification results remained well above the baseline, and
thus this approach is still viable when symbolic representations of the music are
not available. Finally, we showed that the combination of high-level harmony
features with low-level features can lead to genre classification accuracy im-
provements in a state-of-the-art system, and believe that such high-level models
provide a promising direction for genre classification research.
While these methods have advanced the state of the art in music informatics,
it is clear that in several respects they are not yet close to an expert musician’s
understanding of harmony. Limiting the representation of harmony to a list of
chord symbols is inadequate for many applications. Such a representation may
be sufficient as a memory aid for jazz and pop musicians, but it allows only a very
limited specification of chord voicing (via the bass note), and does not permit
analysis of polyphonic texture such as voice leading, an important concept in
many harmonic styles, unlike the recent work of [11] and [29]. Finally, we note
Probabilistic and Logic-Based Modelling of Harmony 17
that the current work provides little insight into harmonic function, for example
the ability to distinguish harmony notes from ornamental and passing notes and
to recognise chord substitutions, both of which are essential characteristics of a
system that models a musician’s understanding of harmony. We hope to address
these issues in future work.
References
1. Anglade, A., Benetos, E., Mauch, M., Dixon, S.: Improving music genre classi-
fication using automatically induced harmony rules. Journal of New Music Re-
search 39(4), 349–361 (2010)
2. Anglade, A., Dixon, S.: Characterisation of harmony with inductive logic program-
ming. In: 9th International Conference on Music Information Retrieval, pp. 63–68
(2008)
3. Anglade, A., Ramirez, R., Dixon, S.: First-order logic classification models of mu-
sical genres based on harmony. In: 6th Sound and Music Computing Conference,
pp. 309–314 (2009)
4. Anglade, A., Ramirez, R., Dixon, S.: Genre classification using harmony rules in-
duced from automatic chord transcriptions. In: 10th International Society for Music
Information Retrieval Conference, pp. 669–674 (2009)
5. Aucouturier, J.J., Defréville, B., Pachet, F.: The bag-of-frames approach to audio
pattern recognition: A sufficient model for urban soundscapes but not for poly-
phonic music. Journal of the Acoustical Society of America 122(2), 881–891 (2007)
6. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
7. Bello, J.P., Pickens, J.: A robust mid-level representation for harmonic content in
music signals. In: 6th International Conference on Music Information Retrieval,
pp. 304–311 (2005)
8. Benetos, E., Kotropoulos, C.: Non-negative tensor factorization applied to music
genre classification. IEEE Transactions on Audio, Speech, and Language Process-
ing 18(8), 1955–1967 (2010)
9. Cathé, P.: Harmonic vectors and stylistic analysis: A computer-aided analysis of
the first movement of Brahms’ String Quartet Op. 51-1. Journal of Mathematics
and Music 4(2), 107–119 (2010)
10. Conklin, D.: Representation and discovery of vertical patterns in music. In:
Anagnostopoulou, C., Ferrand, M., Smaill, A. (eds.) ICMAI 2002. LNCS (LNAI),
vol. 2445, pp. 32–42. Springer, Heidelberg (2002)
11. Conklin, D., Bergeron, M.: Discovery of contrapuntal patterns. In: 11th Interna-
tional Society for Music Information Retrieval Conference, pp. 201–206 (2010)
12. Conklin, D., Witten, I.: Multiple viewpoint systems for music prediction. Journal
of New Music Research 24(1), 51–73 (1995)
18 S. Dixon, M. Mauch, and A. Anglade
13. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, pp.
159–165 (2003)
14. Downie, J., Byrd, D., Crawford, T.: Ten years of ISMIR: Reflections on challenges
and opportunities. In: 10th International Society for Music Information Retrieval
Conference, pp. 13–18 (2009)
15. Ebcioğlu, K.: An expert system for harmonizing chorales in the style of J. S. Bach.
In: Balaban, M., Ebcioiğlu, K., Laske, O. (eds.) Understanding Music with AI, pp.
294–333. MIT Press, Cambridge (1992)
16. Fujishima, T.: Realtime chord recognition of musical sound: A system using Com-
mon Lisp Music. In: Proceedings of the International Computer Music Conference,
pp. 464–467 (1999)
17. Hainsworth, S.W.: Techniques for the Automated Analysis of Musical Audio. Ph.D.
thesis, University of Cambridge, Cambridge, UK (2003)
18. Harte, C.: Towards Automatic Extraction of Harmony Information from Music
Signals. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music
(2010)
19. Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford University Press,
Oxford (1990)
20. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
21. Longuet-Higgins, H., Steedman, M.: On interpreting Bach. Machine Intelligence 6,
221–241 (1971)
22. Mauch, M.: Automatic Chord Transcription from Audio Using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London, Cen-
tre for Digital Music (2010)
23. Mauch, M., Dixon, S.: Approximate note transcription for the improved identi-
fication of difficult chords. In: 11th International Society for Music Information
Retrieval Conference, pp. 135–140 (2010)
24. Mauch, M., Dixon, S.: Simultaneous estimation of chords and musical context from
audio. IEEE Transactions on Audio, Speech and Language Processing 18(6), 1280–
1289 (2010)
25. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering chord idioms
through Beatles and Real Book songs. In: 8th International Conference on Music
Information Retrieval, pp. 111–114 (2007)
26. Mauch, M., Müllensiefen, D., Dixon, S., Wiggins, G.: Can statistical language mod-
els be used for the analysis of harmonic progressions? In: International Conference
on Music Perception and Cognition (2008)
27. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: 10th International Society for Music Information Retrieval
Conference, pp. 231–236 (2009)
28. Maxwell, H.: An expert system for harmonizing analysis of tonal music. In:
Balaban, M., Ebcioiğlu, K., Laske, O. (eds.) Understanding Music with AI, pp.
334–353. MIT Press, Cambridge (1992)
29. Mearns, L., Tidhar, D., Dixon, S.: Characterisation of composer style using high-
level musical features. In: 3rd ACM Workshop on Machine Learning and Music
(2010)
30. Morales, E.: PAL: A pattern-based first-order inductive system. Machine Learn-
ing 26(2-3), 227–252 (1997)
31. Pachet, F.: Surprising harmonies. International Journal of Computing Anticipatory
Systems 4 (February 1999)
Probabilistic and Logic-Based Modelling of Harmony 19
32. Papadopoulos, H.: Joint Estimation of Musical Content Information from an Audio
Signal. Ph.D. thesis, Université Pierre et Marie Curie – Paris 6 (2010)
33. Pardo, B., Birmingham, W.: Algorithms for chordal analysis. Computer Music
Journal 26(2), 27–49 (2002)
34. Pérez-Sancho, C., Rizo, D., Iñesta, J.M.: Genre classification using chords and
stochastic language models. Connection Science 21(2-3), 145–159 (2009)
35. Pérez-Sancho, C., Rizo, D., Iñesta, J.M., de León, P.J.P., Kersten, S., Ramirez,
R.: Genre classification of music by tonal harmony. Intelligent Data Analysis 14,
533–545 (2010)
36. Pickens, J., Bello, J., Monti, G., Sandler, M., Crawford, T., Dovey, M., Byrd, D.:
Polyphonic score retrieval using polyphonic audio queries: A harmonic modelling
approach. Journal of New Music Research 32(2), 223–236 (2003)
37. Ramirez, R.: Inducing musical rules with ILP. In: Proceedings of the International
Conference on Logic Programming, pp. 502–504 (2003)
38. Raphael, C., Stoddard, J.: Functional harmonic analysis using probabilistic models.
Computer Music Journal 28(3), 45–52 (2004)
39. Scholz, R., Vincent, E., Bimbot, F.: Robust modeling of musical chord sequences us-
ing probabilistic N-grams. In: IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 53–56 (2009)
40. Steedman, M.: A generative grammar for jazz chord sequences. Music Percep-
tion 2(1), 52–77 (1984)
41. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference rule ap-
proach. Computer Music Journal 23(1), 10–27 (1999)
42. Whorley, R., Wiggins, G., Rhodes, C., Pearce, M.: Development of techniques for
the computational modelling of harmony. In: First International Conference on
Computational Creativity, pp. 11–15 (2010)
43. Widmer, G.: Discovering simple rules in complex data: A meta-learning algorithm
and some surprising musical discoveries. Artificial Intelligence 146(2), 129–148
(2003)
44. Winograd, T.: Linguistics and the computer analysis of tonal harmony. Journal of
Music Theory 12(1), 2–49 (1968)
Interactive Music Applications and Standards
1 Introduction
The advent of the Internet and the exploding popularity of file sharing web sites
have challenged the music industry’s traditional supply model that relied on the
physical distribution of music recordings such as vinyl records, cassettes, CDs,
etc [5], [3]. In this direction, new interactive music services have emerged [1],
[6], [7]. However, a standardized file format is inevitably required to provide the
interoperability between various interactive music players and interactive music
applications.
Video games and music consumption, once discrete markets, are now merging.
Games for dedicated gaming consoles such as the Microsoft XBox, Nintendo Wii
and Sony Playstation and applications for smart phones using the Apple iPhone
and Google Android platforms are incorporating music creation and manipulation
into applications which encourage users to purchase music. These games can even
be centered around specific performers such as the Beatles [11] or T-Pain [14].
Many of these games follow a format inspired by karaoke. In its simplest case,
audio processing for karaoke applications involves removing the lead vocals so
that a live singer can perform with the backing tracks. This arrangement grew in
complexity by including automatic lyric following as well. Karaoke performance
used to be relegated to a setup involving a sound system with microphone and
playback capabilities within a dedicated space such as a karaoke bar or living
room, but it has found a revitalized market with mobile devices such as smart
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 20–30, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Interactive Music Applications and Standards 21
phones. Karaoke is now no longer limited to a certain place or equipment, but can
performed with a group of friends with a gaming console in a home or performed
with a smart phone, recorded and uploaded online to share with others.
A standard format is needed to allow for the same musical content to be pro-
duced once and used with multiple applications. We will look at the current com-
mercial applications for interactive music and discuss what requirements need to
be met. We will then look at three standards that address these requirements:
the MPEG-A Interactive Music Application Format (IM AF), IEEE 1599 and
interaction eXtensible Music Format (iXMF). We conclude by discussing what
improvements still need to be made for these standards to meet the requirements
of currently commercially-available applications.
2 Applications
3 Requirements
If the music industry continues to produce content for interactive music applica-
tions, a standard distribution format is needed. Content then will not need to be
individually authored for each application. At the most basic level, a standard
needs to allow:
– Separate tracks or groups of tracks
– Apply signal processing to those tracks or groups
– Markup those tracks or stems to include time-based symbolic information
Once tracks or groups of tracks are separated from the full mix of the song,
additional processing or information can be included to enhance the interactivity
with those tracks.
4 MPEG-A IM AF
The MPEG-A Interactive Music Application Format (IM AF) standard struc-
tures the playback of songs that have multiple, unmixed audio tracks [8], [9], [10].
Interactive Music Applications and Standards 23
IM AF creates a container for the tracks, the associated metadata and symbolic
data while also managing how the audio tracks are played. Creating an IM AF
file involves formatting different types of media data, especially multiple audio
tracks with interactivity data and storing them into an ISO-Base Media File
Format. An IM AF file is composed of:
Multiple audio tracks representing the music (e.g. instruments and/or voices).
Groups of audio tracks – a hierarchical structure of audio tracks (e.g. all
guitars of a song can be gathered in the same group).
Preset data – pre-defined mixing information on multiple audio tracks (e.g.
karaoke and rhythmic version).
User mixing data and interactivity rules, information related to user in-
teraction (e.g. track/group selection, volume control).
Metadata used to describe a song, music album, artist, etc.
Additional media data that can be used to enrich the users interaction space
(e.g. timed text synchronized with audio tracks which can represent the lyrics
of a song, images related to the song, music album, artist, etc).
4.1 Mixes
The multiple audio tracks are combined to produce a mix. The mix is defined
by the playback level of tracks and may be determined by the music content
creator or by the end-user.
An interactive music player utilizing IM AF could allow users to re-mix music
tracks by enabling them to select the number of instruments to be listened to
and adjust the volume of individual tracks to their particular taste. Thus, IM
AF enables users to publish and exchange this re-mixing data, enabling other
users with IM AF players to experience their particular music taste creations.
Preset mixes of tracks could also be available. In particular IM AF supports two
possible mix modes for interaction and playback: preset-mix mode and user-mix
mode.
In the preset-mix mode, the user selects one preset among the presets stored
in IM AF, and then the audio tracks are mixed using the preset parameters
associated with the selected preset. Some preset examples are:
General preset – composed of multiple audio tracks by music producer.
Karaoke preset – composed of multiple audio tracks except vocal tracks.
A cappella preset – composed of vocal and chorus tracks.
Figure 1 shows an MPEG-A IM AF player. In user-mix mode, the user selects/de-
selects the audio tracks/groups and controls the volume of each of them. Thus,
in user-mix mode, audio tracks are mixed according to the user’s control and
taste; however, they should comply with the interactivity rules stored in the
IM AF. User interaction should conform to certain rules defined by the music
composers with the aim to fit their artistic creation. However, the rules defini-
tion is optional and up to the music composer, they are not imposed by the IM
AF format. In general there are two categories of rules in IM AF: selection and
24 R. Stewart, P. Kudumakis, and M. Sandler
Fig. 1. An interactive music application. The player on the left shows the song being
played in a preset mix mode and the player on the right shows the user mix mode.
!
!
"
mixing rules. The selection rules relate to the selection of the audio tracks and
groups at rendering time whereas the mixing rules relate to the audio mixing.
Note that the interactivity rules allow the music producer to define the amount
of freedom available in IM AF users mixes. The interactivity rules analyser in the
player verifies whether the user interaction conforms to music producers rules.
Figure 2 depicts in a block diagram the logic for both the preset-mix and the
user-mix usage modes.
IM AF supports four types of selection rules, as follows:
File Format ISO Base Media File Format (ISO-BMFF) ISO/IEC 14496-12:2008
MPEG-4 Audio AAC Profile ISO/IEC 14496-3:2005
MPEG-D SAOC ISO/IEC 23003-2:2010
Audio
MPEG-1 Audio Layer III (MP3) ISO/IEC 11172-3:1993
PCM -
Image JPEG ISO/IEC 10918-1:1994
Text 3GPP Timed Text 3GPP TS 26.245:2004
Metadata MPEG-7 MDS ISO/IEC 15938-5:2003
26 R. Stewart, P. Kudumakis, and M. Sandler
Table 2. Brands supported by IM AF. For im04 and im12, simultaneously decoded
audio tracks consist of tracks related to SAOC, which are a downmix signal and SAOC
bitstream. The downmix signal should be encoded using AAC or MP3. For all brands,
the maximum channel number of each track is restricted to 2 (stereo).
im01 X X 4
AAC/Level 2
im02 X X 6 Mobile
im03 X X 8 48 kHz/16 bits
AAC/Level 2
im04 X X X 2
SAOC Baseline/2
im11 X X X 16 AAC/Level 2
Normal
AAC/Level 2
im12 X X X 2
SAOC Baseline/3
+
'(%)*
'(%)*
'
'
'
!"#$%&
!"#$%&
5 Related Formats
While the IM AF packages together the relevant metadata and content that an
interactive music application would require, other file formats have also been
developed as a means to organize and describe synchronized streams of infor-
mation for different applications. The two that will be briefly reviewed here are
IEEE 1599 [12] and iXMF [4].
*
#
$ %
'
&
!
'
"
"
" !
" "( ")*
"
of symbols by both humans and machines, hence the decision to represent all
information that is not audio or video sample data within XML.
The standard is developed primarily for applications that provide additional
information surrounding a piece of music. Example applications include being
able to easily navigate between a score, multiple recordings of performances of
that score and images of the performers in the recordings [2].
The format consists of six layers that communicate with each other, but there
can be multiple instances of the same layer type. Figure 4 illustrates how the
layers interact. The layers are referred to as:
5.2 iXMF
Another file format that perform a similar task with a particular focus on video
games is iXMF (interaction eXtensible Music Format) [4]. The iXMF standard
is targeted for interactive audio within games development. XMF is a meta file
format that bundles multiple files together and iXMF uses this same meta file
format as its structure.
Interactive Music Applications and Standards 29
iXMF uses a structure in which a moment in time can trigger an event. The
triggered event can encompass a wide array of activities such as the playing of
an audio file or the execution of specific code. The overall structure is described
in [4] as:
The format allows for both audio and symbolic information information such
as MIDI to be included. The Scripts then allow for real-time adaptive audio
effects. iXMF has been developed to create interactive soundtracks for video
games environments, so the audio can be generated in real-time based on a user’s
actions and other external factors. There are a number of standard Scripts that
perform basic tasks such as starting or stopping a Cue, but this set of Scripts
can also be extended.
6 Discussion
Current commercial applications built around interactive music require real-time
playback and interaction with multiple audio tracks. Additionally, symbolic in-
formation, including text, is needed to accommodate the new karaoke-like games
such as Guitar Hero. The IM AF standard fulfils most of the requirements, but
not all. In particular it lacks the ability to include symbolic information like MIDI
note and instrument data. IEEE 1599 and iXMF both can accommodate MIDI
data, though lack some of the advantages of IM AF such as direct integration
with a number of MPEG formats.
One of the strengths of iXMF is its Scripts which can define time-varying
audio effects. These kind of effects are needed for applications such as I Am
T-Pain and Glee Karaoke. IM AF is beginning to consider integrating these
effects such as equalization, but greater flexibility will be needed so that the
content creators can create and manipulate their own audio signal processing
algorithms. The consumer will also need to be able to manually adjust the audio
effects applied to the audio in order to build applications like the MXP4 Studio
[7] with IM AF.
As interactive music applications may be used in a variety of settings, from
dedicated gaming consoles to smart phones, any spatialization of the audio needs
to be flexible and automatically adjust to the most appropriate format. This
could range from stereo speakers to surround sound systems or binaural audio
over headphones. IM AF is beginning to support SAOC (Spatial Audio Object
Coding) which addresses this very problem and differentiates it from similar
standards.
While there are a number of standard file formats that have been developed in
parallel to address slightly differing application areas within interactive music,
IM AF is increasingly the best choice for karaoke-style games. There are still
30 R. Stewart, P. Kudumakis, and M. Sandler
References
1. Audizen, http://www.audizen.com (last viewed, February 2011)
2. Ludovico, L.A.: The new standard IEEE 1599, introduction and examples. J. Mul-
timedia 4(1), 3–8 (2009)
3. Goel, S., Miesing, P., Chandra, U.: The Impact of Illegal Peer-to-Peer File Sharing
on the Media Industry. California Management Review 52(3) (Spring 2010)
4. IASIG Interactive XMF Workgroup: Interactive XMF specification: file for-
mat specification. Draft 0.9.1a (2008), http://www.iasig.org/pubs/ixmf_
draft-v091a.pdf
5. IFPI Digital Musi Report 2009: New Business Models for a Changing Environment.
International Federation of the Phonographic Industry (January 2009)
6. iKlax Media, http://www.iklaxmusic.com (last viewed February 2011)
7. Interactive Music Studio by MXP4, Inc., http://www.mxp4.com/
interactive-music-studio (last viewed February 2011)
8. ISO/IEC 23000-12, Information technology – Multimedia application for-
mat (MPEG-A) – Part 12: Interactive music application format (2010),
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.
htm?csnumber=53644
9. ISO/IEC 23000-12/FDAM 1 IM AF Conformance and Reference Software, N11746,
95th MPEG Meeting, Daegu, S. Korea (2011)
10. Kudumakis, P., Jang, I., Sandler, M.: A new interactive MPEG format for the
music industry. In: 7th Int. Symposium on Computer Music Modeling and Retrieval
(CMMR 2010), Málaga, Spain (2010)
11. Kushner, D.: The Making of the Beatles: Rock Band. IEEE Spectrum 46(9), 30–35
(2009)
12. Ludovico, L.A.: IEEE 1599: a multi-layer approach to music description. J. Multi-
media 4(1), 9–14 (2009)
13. Smule, Inc.: Glee Karaoke iPhone Application, http://glee.smule.com/ (last
viewed February 2011)
14. Smule, Inc.: I Am T-Pain iPhone Application, http://iamtpain.smule.com/ (last
viewed February 2011)
Interactive Music with Active Audio CDs
Abstract. With a standard compact disc (CD) audio player, the only
possibility for the user is to listen to the recorded track, passively: the
interaction is limited to changing the global volume or the track. Imagine
now that the listener can turn into a musician, playing with the sound
sources present in the stereo mix, changing their respective volumes and
locations in space. For example, a given instrument or voice can be either
muted, amplified, or more generally moved in the acoustic space. This
will be a kind of generalized karaoke, useful for disc jockeys and also for
music pedagogy (when practicing an instrument). Our system shows that
this dream has come true, with active CDs fully backward compatible
while enabling interactive music. The magic is that “the music is in the
sound”: the structure of the mix is embedded in the sound signal itself,
using audio watermarking techniques, and the embedded information is
exploited by the player to perform the separation of the sources (patent
pending) used in turn by a spatializer.
1 Introduction
Composers of acousmatic music conduct different stages through the composi-
tion process, from sound recording (generally stereophonic) to diffusion (mul-
tiphonic). During live interpretation, they interfere decisively on spatialization
and coloration of pre-recorded sonorities. For this purpose, the musicians gen-
erally use a(n un)mixing console. With two hands, this requires some skill and
becomes hardly tractable with many sources or speakers.
Nowadays, the public is also eager to interact with the musical sound. In-
deed, more and more commercial CDs come with several versions of the same
musical piece. Some are instrumental versions (for karaoke), other are remixes.
The karaoke phenomenon gets generalized from voice to instruments, in musical
video games such as Rock Band 1 . But in this case, to get the interaction the
user has to buy the video game, which includes the multitrack recording.
Yet, the music industry is still reluctant to release the multitrack version of
musical hits. The only thing the user can get is a standard CD, thus a stereo
1
See URL: http://www.rockband.com
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 31–50, 2011.
c Springer-Verlag Berlin Heidelberg 2011
32 S. Marchand, B. Mansencal, and L. Girin
mix, or its dematerialized version available for download. The CD is not dead:
imagine a CD fully backward compatible while permitting musical interaction. . .
We present the proof of concept of the active audio CD, as a player that
can read any active disc – in fact any 16-bit PCM stereo sound file, decode the
musical structure present in the sound signal, and use it to perform high-quality
source separation. Then, the listener can see and manipulate the sound sources
in the acoustic space. Our system is composed of two parts.
First, a CD reader extracts the audio data of the stereo track and decodes
the musical structure embedded in the audio signal (Section 2). This additional
information consists of the combination of active sources for each time-frequency
atom. As shown in [16], this permits an informed source separation of high quality
(patent pending). In our current system, we get up to 5 individual tracks out of
the stereo mix.
Second, a sound spatializer is able to map in real time all the sound sources
to any position in the acoustic space (Section 3). Our system supports either
binaural (headphones) or multi-loudspeaker configurations. As shown in [14],
the spatialization is done in the spectral domain, is based on acoustics and
interaural cues, and the listener can control the distance and the azimuth of
each source.
Finally, the corresponding software implementation is described in Section 4.
2 Source Separation
In this section, we present a general overview of the informed source separation
technique which is at the heart of the active CD player. This technique is based
on a two-step coder-decoder configuration [16][17], as illustrated on Fig. 1. The
decoder is the active CD player, that can process separation only on mix signals
that have been generated by the coder. At the coder, the mix signal is generated
as a linear instantaneous stationary stereo (LISS) mixture, i.e. summation of
source signals with constant-gain panning coefficients. Then, the system looks
for the two sources that better “explain” the mixture (i.e. the two source signals
that are predominant in the mix signal) at different time intervals and frequency
channels, and the corresponding source indexes are embedded into the mixture
signal as side-information using watermarking. The watermarked mix signal is
then quantized to 16-bits PCM. At the decoder, the only available signal is the
watermarked and quantized mix signal. The side-information is extracted from
the mix signal and used to separate the source signals by a local time / frequency
mixture inversion process.
S1[f,t]
s1[n] MDCT
Sˆ1[ f ,t]
~ IMDCT sˆ1 [ n ]
~ W XLW[ f ,t] Ift
x [ n]
L MDCT Wa te r- 2×2
~ De coding IMDCT sˆ2 [ n]
ma rk inve r-
~ XRW[ f , t]
xRW [n] MDCT e xtra ction s ion
A
IMDCT sˆI [ n]
SˆI [ f , t]
[29][7][15][20]. Therefore, the separation of source signals can be carried out more
efficiently in the TF domain. The Modified Discrete Cosine Transform (MDCT)
[21] is used as the TF decomposition since it presents several properties very
suitable for the present problem: good energy concentration (hence emphasizing
audio signals sparsity), very good robustness to quantization (hence robustness
to quantization-based watermarking), orthogonality and perfect reconstruction.
Detailed description of the MDCT equations are not provided in the present
paper, since it can be found in many papers, e.g. [21]. The MDCT is applied on
the source signals and on the mixture signal at the input of the coder to enable
the selection of predominant sources in the TF domain. Watermarking of the
resulting side-information is applied on the MDCT coefficients of the mix signal
and the time samples of the watermarked mix signal are provided by inverse
MDCT (IMDCT). At the decoder, the (PCM-quantized) mix signal is MDCT-
transformed and the side-information is extracted from the resulting coefficients.
Source separation is also carried out in the MDCT domain, and the resulting
separated MDCT coefficients are used to reconstruct the corresponding time-
domain separated source signals by IMDCT. Technically, the MDCT / IMDCT
is applied on signal time frames of W = 2048 samples (46.5ms for a sampling
frequency fs = 44.1kHz), with a 50%-overlap between consecutive frames (of
1024 frequency bins). The frame length W is chosen to follow the dynamics of
music signals while providing a frequency resolution suitable for the separation.
Appropriate windowing is applied at both analysis and synthesis to ensure the
“perfect reconstruction” property [21].
34 S. Marchand, B. Mansencal, and L. Girin
Since the MDCT is a linear transform, the LISS source separation problem
remains LISS in the transformed domain. For each frequency bin f and time bin
t, we thus have:
X(f, t) = A · S(f, t) (1)
where X(f, t) = [X1 (f, t), X2 (f, t)]T denotes the stereo mixture coefficients vec-
tor and S(f, t) = [S1 (f, t), · · · , SN (f, t)]T denotes the N -source coefficients vec-
tor. Because of audio signal sparsity in the TF domain, only at most 2 sources are
assumed to be relevant, i.e. of significant energy, at each TF bin (f, t). Therefore,
the mixture is locally given by:
where If t denotes the set of 2 relevant sources at TF bin (f, t). AIf t represents
the 2 × 2 mixing sub-matrix made with the Ai columns of A, i ∈ If t . If I f t
denotes the complementary set of non-active (or at least poorly active) sources
at TF bin (f, t), the source signals at bin (f, t) are estimated by [7]:
ŜIf t (f, t) = A−1If t X(f, t)
(3)
ŜI f t (f, t) = 0
where A−1 If t denotes the inverse of AIf t . Note that such a separation technique
exploits the 2-channel spatial information of the mixture signal and relaxes the
restrictive assumption of a single active source at each TF bin, as made in
[29][2][3].
The side-information that is transmitted between coder and decoder (in ad-
dition to the mix signal) mainly consists of the coefficients of the mixing matrix
A and the combination of indexes If t that identifies the predominant sources in
each TF bin. This contrasts with classic blind and semi-blind separation meth-
ods where those both types of information have to be estimated from the mix
signal only, generally in two steps which can both be a very challenging task and
source of significant errors.
As for the mixing matrix, the number of coefficients to be transmitted is quite
low in the present LISS configuration2 . Therefore, the transmission cost of A is
negligible compared to the transmission cost of If t , and it occupies a very small
portion of the watermarking capacity.
As for the source indexes, If t is determined at the coder for each TF bin
using the source signals, the mixture signal, and the mixture matrix A, as the
combination that provides the lower mean squared error (MSE) between the
original source signals and the estimated source signals obtained with Equation
(3) (see [17] for details). This process follows the line of oracle estimators, as in-
troduced in [26] for the general purpose of evaluating the performances of source
2
For 5-source signals, if A is made of normalized column vectors depending on source
azimuths, then we have only 5 coefficients.
Interactive Music with Active Audio CDs 35
10
w 11
X (t, f )
01
ΔQIM
X(t, f ) 00
10
11
01
00
10
11
01
00
10
Δ(t, f ) 11
01
00
10
11
01
00
00 01 11 10
Fig. 2. Example of QIM using a set of quantizers for C(t, f ) = 2 and the resulting
global grid. We have Δ(t, f ) = 2C(t,f ) · ΔQIM . The binary code 01 is embedded into
the MDCT coefficient X(t, f ) by quantizing it to X w (t, f ) using the quantizer indexed
by 01.
mixture signal in combination with the use of a psycho-acoustic model (PAM) for
the control of inaudibility. It has been presented in details in [19][18]. Therefore,
we just present the general lines of the watermarking process in this section, and
we refer the reader to these papers for technical details.
The embedding principle is the following. Let us denote by C(t, f ) the capac-
ity at TF bin (t, f ), i.e. the maximum size of the binary code to be embedded
in the MDCT coefficient at that TF bin (under inaudibility constraint). We will
see below how C(t, f ) is determined for each TF bin. For each TF bin (t, f ), a
set of 2C(t,f ) uniform quantizers is defined, which quantization levels are inter-
twined, and each quantizer represents a C(t, f )-bit binary code. Embedding a
given binary code on a given MDCT coefficient is done by quantizing this coef-
ficient with the corresponding quantizer (i.e. the quantizer indexed by the code
to transmit; see Fig. 2). At the decoder, recovering the code is done by compar-
ing the transmitted MDCT coefficient (potentially corrupted by transmission
noise) with the 2C(t,f ) quantizers, and selecting the quantizer with the quan-
tization level closest to the transmitted MDCT coefficient. Note that because
the capacity values depend on (f, t), those values must be transmitted to the
decoder to select the right set of quantizers. For this, a fixed-capacity embedding
“reservoir” is allocated in the higher frequency region of the spectrum, and the
Interactive Music with Active Audio CDs 37
capacity values are actually defined within subbands (see [18] for details). Note
also that the complete binary message to transmit (here the set of If t codes) is
split and spread across the different MDCT coefficients according to the local
capacity values, so that each MDCT coefficient carries a small part of the com-
plete message. Conversely, the decoded elementary messages have to be concate-
nated to recover the complete message. The embedding rate R is given by the
average total number of embedded bits per second of signal. It is obtained by
summing the capacity C(t, f ) over the embedded region of the TF plane and
dividing the result by the signal duration.
The performance of the embedding process is determined by two related con-
straints: the watermark decoding must be robust to the 16-bit PCM conversion
of the mix signal (which is the only source of noise because the “perfect recon-
struction” property of MDCT ensures transparency of IMDCT/MDCT chained
processes), and the watermark must be inaudible. The time-domain PCM quan-
tization leads to additive white Gaussian noise on MDCT coefficients, which
induces a lower bound for ΔQIM the minimum distance between two different
levels of all QIM quantizers (see Fig. 2). Given that lower bound, the inaudibility
constraint induces an upper bound on the number of quantizers, hence a cor-
responding upper bound on the capacity C(t, f ) [19][18]. More specifically, the
constraint is that the power of the embedding error in the worst case remains
under the masking threshold M (t, f ) provided by a psychoacoustic model. The
PAM is inspired from the MPEG-AAC model [11] and adapted to the present
data hiding problem. It is shown in [18] that the optimal capacity is given by:
α
1 M (t, f ) · 10 10
C α (t, f ) = log +1 (4)
2 2 Δ2QIM
where . denotes the floor function, and α is a scaling factor (in dB) that enables
users to control the trade-off between signal degradation and embedding rate
by translating the masking threshold. Signal quality is expected to decrease as
embedding rate increases, and vice-versa. When α > 0dB, the masking threshold
is raised. Larger values of the quantization error allow for larger capacities (and
thus higher embedding rate), at the price of potentially lower quality. At the
opposite, when α < 0dB, the masking threshold is lowered, leading to a “safety
margin” for the inaudibility of the embedding process, at the price of lower
embedding rate. It can be shown that the embedding rate Rα corresponding to
C α and the basic rate R = R0 are related by [18]:
log2 (10)
Rα R + α · · Fu (5)
10
(Fu being the bandwidth of the embedded frequency region). This linear relation
enables to easily control the embedding rate by the setting of α.
The inaudibility of the watermarking process has been assessed by subjective
and objective tests. In [19][18], Objective Difference Grade (ODG) scores [24][12]
were calculated for a large range of embedding rates and different musical styles.
ODG remained very close to zero (hence imperceptibility of the watermark)
38 S. Marchand, B. Mansencal, and L. Girin
for rates up to about 260kbps for musical styles such as pop, rock, jazz, funk,
bossa, fusion, etc. (and “only” up to about 175kbps for classical music). Such
rates generally correspond to the basic level of the masking curve allowed by
the PAM (i.e. α = 0dB). More “comfortable” rates can be set between 150
and 200kbits/s to guarantee transparent quality for the embedded signal. This
flexibility is used in our informed source separation system to fit the embedding
capacity with the bit-rate of the side-information, which is at the very reasonable
value of 64kbits/s/channel. Here, the watermarking is guaranteed to be “highly
inaudible”, since the masking curve is significantly lowered to fit the required
capacity.
3 Sound Spatialization
Now that we have recovered the different sound sources present in the original
mix, we can allow the user to manipulate them in space. We consider each punc-
tual and omni-directional sound source in the horizontal plane, located by its (ρ, θ)
coordinates, where ρ is the distance of the source to the head center and θ is the
azimuth angle. Indeed, as a first approximation in most musical situations, both
the listeners and instrumentalists are standing on the (same) ground, with no rel-
ative elevation. Moreover, we consider that the distance ρ is large enough for the
acoustic wave to be regarded as planar when reaching the ears.
where α(f ) is the average scaling factor that best suits our model, in the least-
square sense, for each listener of the CIPIC database (see Fig. 3). The overall
error of this model over the CIPIC database for all subjects, azimuths, and
frequencies is of 4.29dB.
Interactive Music with Active Audio CDs 39
40
α
30
level scaling factor
20
10
0
0 2 4 6 8 10 12 14 16 18 20
β
4
time scaling factor
0
0 2 4 6 8 10 12 14 16 18 20
Frequency [kHz]
Interaural Time Differences. Because of the head shadowing, Viste uses for
the ITDs a model based on sin(θ) + θ, after Woodworth [28]. However, from the
theory of the diffraction of an harmonic plane wave by a sphere (the head), the
ITDs should be proportional to sin(θ). Contrary to the model by Kuhn [13], our
model takes into account the inter-subject variation and the full-frequency band.
The ITD model is then expressed as:
where β is the average scaling factor that best suits our model, in the least-
square sense, for each listener of the CIPIC database (see Fig. 3), r denotes the
head radius, and c is the sound celerity. The overall error of this model over the
CIPIC database is 0.052ms (thus comparable to the 0.045ms error of the model
by Viste).
the sound spectrum changes with the distance. More precisely, the spectral cen-
troid moves towards the low frequencies as the distance increases. In [4], the
authors show that the frequency-dependent attenuation due to atmospheric at-
tenuation is roughly proportional to f 2 , similarly to the ISO 9613-1 norm [10].
Here, we manipulate the magnitude spectrum to simulate the distance between
the source and the listener. Conversely, we would measure the spectral centroid
(related to brightness) to estimate the source’s distance to listener.
In a concert room, the distance is often simulated by placing the speaker near
/ away from the auditorium, which is sometimes physically restricted in small
rooms. In fact, the architecture of the room plays an important role and can
lead to severe modifications in the interpretation of the piece.
Here, simulating the distance is a matter of changing the magnitude of each
short-term spectrum X. More precisely, the ISO 9613-1 norm [10] gives the
frequency-dependent attenuation factor in dB for given air temperature, humid-
ity, and pressure conditions. At distance ρ, the magnitudes of X(f ) should be
attenuated by D(f, ρ) decibels:
the best panning coefficients under CIPIC conditions for the pair of speakers to
match the binaural signals at the ears (see Equations (11) and (12)) are then
given by:
KL (t, f ) = C · (CRR HL − CLR HR ) , (17)
KR (t, f ) = C · (−CRL HL + CLL HR ) (18)
with the determinant computed as:
C = 1/ (CLL CRR − CRL CLR ) . (19)
During diffusion, the left and right signals (YL , YR ) to feed left and right
speakers are obtained by multiplying the short-term spectra X with KL and
KR , respectively:
YL (t, f ) = KL (t, f )X(t, f ) = C · (CRR XL − CLR XR ) , (20)
YR (t, f ) = KR (t, f )X(t, f ) = C · (−CRL XL + CLL XR ) . (21)
42 S. Marchand, B. Mansencal, and L. Girin
KL KR
YL YR
sound image
SL SR
HL HR
C R
R
L CL
C LL
CRR
XL XR
Fig. 4. Stereophonic loudspeaker display: the sound source X reaches the ears L, R
through four acoustic paths (CLL , CLR , CRL , CRR )
sound image
S2 S
S3 L R S1
S4
Fig. 5. Pairwise paradigm: for a given sound source, signals are dispatched only to the
two speakers closest to it (in azimuth)
In a setup with many speakers we use the classic pair-wise paradigm [9],
consisting in choosing for a given source only the two speakers closest to it (in
Interactive Music with Active Audio CDs 43
player spatializer
file sources N sources ...
reader separator
N output N input M output
2 channels ports ports ports
active CD M speakers
Fig. 6. Overview of the software system architecture
azimuth): one at the left of the source, the other at its right (see Fig. 5). The
left and right signals computed for the source are then dispatched accordingly.
4 Software System
Our methods for source separation and sound spatialization have been imple-
mented as a real-time software system, programmed in C++ language and using
Qt43 , JACK4 , and FFTW5 . These libraries were chosen to ensure portability and
performance on multiple platforms. The current implementation has been tested
on Linux and MacOS X operating systems, but should work with very minor
changes on other platforms, e.g. Windows.
Fig. 6 shows an overview of the architecture of our software system. Source
separation and sound spatialization are implemented as two different modules.
We rely on JACK audio ports system to route audio streams between these two
modules in real time.
This separation in two modules was mainly dictated by a different choice of
distribution license: the source separation of the active player should be patented
and released without sources, while the spatializer will be freely available under
the GNU General Public License.
4.1 Usage
Player. The active player is presented as a simple audio player, based on JACK.
The graphical user interface (GUI) is a very common player interface. It allows
to play or pause the reading / decoding. The player reads “activated” stereo files,
from an audio CD or file, and then decodes the stereo mix in order to extract
the N (mono) sources. Then these sources are transferred to N JACK output
ports, currently named QJackPlayerSeparator:outputi, with i in [1; N ].
Fig. 7. From the stereo mix stored on the CD, our player is allowing the listener
(center ) to manipulate 5 sources in the acoustic space, using here an octophonic display
(top) or headphones (bottom)
Fig. 7 shows the current interface of the spatializer, which displays a bird’s
eye view of the audio scene. The user’s avatar is in the middle, represented by
a head viewed from above. He is surrounded by various sources, represented as
Interactive Music with Active Audio CDs 45
sources speakers
by the fact that it has only two speakers with neither azimuth nor distance
specified. Fig. 8 shows the speaker configuration files for binaural and octophonic
(8-speaker) configuration.
4.2 Implementation
Player. The current implementation is divided into three threads. The main
thread is the Qt GUI. A second thread reads and bufferizes data from the stereo
file, to be able to compensate for any physical CD reader latency. The third
thread is the JACK process function. It separates the data for the N sources and
feeds the output ports accordingly. In the current implementation, the number
of output sources is fixed to N = 5.
Our source separation implementation is rather efficient as for a Modified
Discrete Cosine Transform (MDCT) of W samples, we only do a Fast Fourier
Transform (FFT) of size W/4. Indeed, a MDCT of length W is almost equivalent
to a type-IV DCT of length W/2 that can be computed with a FFT of length
W/4. Thus, as we use MDCT and IMDCT of size W = 2048, we only do FFT
and IFFT of 512 samples.
(11) and (12)). The dispatcher then chooses the pair (j, j + 1) of speakers sur-
rounding the azimuth θi , transforms the spectra XiL and XiR by the coefficients
corresponding to this speaker pair (see Equations (20) and (21)), and adds the
resulting spectra Yj and Yj+1 in the spectra of these speakers. Finally, for each
speaker, its spectrum is transformed with an IFFT to obtain back in the time
domain the mono signal yj for the corresponding output.
Source spatialization is more computation-intensive than source separation,
mainly because it requires more transforms (N FFTs and M IFFTs) of larger
size W = 2048. For now, source spatialization is implemented as a serial pro-
cess. However, we can see that this pipeline is highly parallel. Indeed, almost
everything operates on separate data. Only the spectra of the speakers may be
accessed concurrently, to accumulate the spectra of sources that would be spa-
tialized to the same or neighbouring speaker pairs. These spectra should then
be protected with mutual exclusion mechanisms. A future version will take ad-
vantage of multi-core processor architectures.
4.3 Experiments
Our current prototype has been tested on an Apple MacBook Pro, with an Intel
Core 2 Duo 2.53GHz processor, connected to headphones or to a 8-speaker sys-
tem, via a MOTU 828 MKII soundcard. For such a configuration, the processing
power is well contained. In order to run in real time, given a signal sampling
Fig. 10. Enhanced graphical interface with pictures of instruments for sources and
propagating sound waves represented as colored circles
48 S. Marchand, B. Mansencal, and L. Girin
frequency of 44.1kHz and windows of 2048 samples, the overall processing time
should be less than 23ms. With our current implementation, 5-source separation
and 8-speaker spatialization, this processing time is in fact less than 3ms on the
laptop mentioned previously. Therefore, the margin to increase the number of
sources to separate and/or the number of loudspeakers is quite confortable. To
confirm this, we exploited the split of the source separation and spatialization
modules to test the spatializer without the active player, since the latter is cur-
rently limited to 5 sources. We connected to the spatializer a multi-track player
that reads several files simultaneously and exposes these tracks as JACK output
ports. Tests showed that the spatialization can be applied to roughly 48 sources
on 8 speakers, or 40 sources on 40 speakers on this computer.
These performances allow us to have some processing power for other com-
putations, to improve user experience for example. Fig. 10 shows an example of
an enhanced graphical interface where the sources are represented with pictures
of the instruments, and the propagation of the sound waves is represented for
each source by time-evolving colored circles. The color of each circle is computed
from the color (spectral envelope) of the spectrum of each source and updated
in real time as the sound changes.
We have presented a real-time system for musical interaction from stereo files,
fully backward-compatible with standard audio CDs. This system consists of a
source separator and a spatializer.
The source separation is based on the sparsity of the source signals in the
spectral domain and the exploitation of the stereophony. This system is char-
acterized by a quite simple separation process and by the fact that some side-
information is inaudibly embedded in the signal itself to guide the separation
process. Compared to (semi-)blind approaches also based on sparsity and lo-
cal mixture inversion, the informed aspect of separation guarantees the optimal
combination of the sources, thus leading to a remarkable increase of quality of
the separated signals.
The sound spatialization is based on a simplified model of the head-related
transfer functions, generalized to any multi-loudspeaker configuration using a
transaural technique for the best pair of loudspeaker for each sound source.
Although this quite simple technique does not compete with the 3D accuracy of
Ambisonics or holophony (Wave Field Synthesis), it is very flexible (no specific
loudspeaker configuration) and suitable for a large audience (no hot-spot effect)
with sufficient sound quality.
The resulting software system is able to separate 5-source stereo mixtures
(read from audio CD or 16-bit PCM files) in real time and it enables the user to
remix the piece of music during restitution with basic functions such as volume
and spatialization control. The system has been demonstrated in several coun-
tries with excellent feedback from the users / listeners, with a clear potential in
terms of musical creativity, pedagogy, and entertainment.
Interactive Music with Active Audio CDs 49
For now, the mixing model imposed by the informed source separation is
generally over-simplistic when professional / commercial music production is
at stake. Extending the source separation technique to high-quality convolutive
mixing is part of our future research.
As shown in [14], the model we use for the spatialization is more general, and
can be used as well to localize audio sources. Thus we would like to add the
automatic detection of the speaker configuration to our system, from a pair of
microphones placed in the audience, as well as the automatic fine tuning of the
spatialization coefficients to improve the 3D sound effect.
Regarding performance, lots of operations are on separated data and thus
could easily be parallelized on modern hardware architectures. Last but not least,
we are also porting the whole application to mobile touch devices, such as smart
phones and tablets. Indeed, we believe that these devices are perfect targets for
a system in between music listening and gaming, and gestural interfaces with
direct interaction to move the sources are very intuitive.
Acknowledgments
This research was partly supported by the French ANR (Agence Nationale de la
Recherche) DReaM project (ANR-09-CORD-006).
References
1. Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF
database. In: Proceedings of the IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics (WASPAA), New Paltz, New York, pp. 99–102
(2001)
2. Araki, S., Sawada, H., Makino, S.: K-means based underdetermined blind speech
separation. In: Makino, S., et al. (eds.) Blind Source Separation, pp. 243–270.
Springer, Heidelberg (2007)
3. Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source
separation for arbitrarily arranged multiple sensors. Signal Processing 87(8), 1833–
1847 (2007)
4. Bass, H., Sutherland, L., Zuckerwar, A., Blackstock, D., Hester, D.: Atmospheric
absorption of sound: Further developments. Journal of the Acoustical Society of
America 97(1), 680–683 (1995)
5. Berg, R.E., Stork, D.G.: The Physics of Sound, 2nd edn. Prentice Hall, Englewood
Cliffs (1994)
6. Blauert, J.: Spatial Hearing. revised edn. MIT Press, Cambridge (1997); Transla-
tion by J.S. Allen
7. Bofill, P., Zibulevski, M.: Underdetermined blind source separation using sparse
representations. Signal Processing 81(11), 2353–2362 (2001)
8. Chen, B., Wornell, G.: Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions
on Information Theory 47(4), 1423–1443 (2001)
9. Chowning, J.M.: The simulation of moving sound sources. Journal of the Acoustical
Society of America 19(1), 2–6 (1971)
50 S. Marchand, B. Mansencal, and L. Girin
Kristoffer Jensen
1 Introduction
Music generation has more and more uses in today’s media. Be it in computer games,
interactive music performances, or in interactive films, the emotional effect of the
music is primordial in the appreciation of the media. While traditionally, the music
has been generated in pre-recorded loops that is mixed on-the-fly, or recorded in
traditional orchestras, the better understanding and models of generative music is
believed to push the interactive generative music into the multimedia. Papadopoulos
and Wiggins (1999) gave an early overview of the methods of algorithmic
composition, deploring “that the music that they produce is meaningless: the
computers do not have feelings, moods or intentions”. While vast progress has been
made in the decade since this statement, there is still room for improvement.
The cognitive understanding of musical time perception is the basis of the work
presented here. According to Kühl (2007), this memory can be separated into three
time-scales, the short, microtemporal, related to microstructure, the mesotemporal,
related to gesture, and the macrotemporal, related to form. These time-scales are
named (Kühl and Jensen 2008) subchunk, chunk and superchunk, and subchunks
extend from 30 ms to 300 ms; the conscious mesolevel of chunks from 300 ms to 3
sec; and the reflective macrolevel of superchunks from 3 sec to roughly 30−40 sec.
The subchunk is related to individual notes, the chunk to meter and gesture, and the
superchunk is related to form. The superchunk was analyzed and used for in a
generative model in Kühl and Jensen (2008), and the chunks were analyzed in Jensen
and Kühl (2009). Further analysis of the implications of how temporal perception is
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 51–59, 2011.
© Springer-Verlag Berlin Heidelberg 2011
52 K. Jensen
related to durations and timing of existing music, and anatomic and perceptual finding
from the literature is given in section 2 along with an overview of the previous work
in rhythm. Section 3 presents the proposed model on the inclusion of pitch gestures in
music generation using statistical methods, and the section 4 discusses the integration
of the pitch gesture in the generative music model. Finally, section 5 offers a
conclusion.
(>400msec) & temps courts, and two to one ratios only found between temps longs
and courts. As for natural tempo, when subjects are asked to reproduce temporal
intervals, they tend to overestimate short intervals (making them longer) and under-
estimate long intervals (making them shorter). At an interval of about 500 msec to
600 msec, there is little over- or under-estimation. However, there are large
differences across individuals, the spontaneous tempo is found to be between 1.1 to 5
taps per second, with 1.7 taps per second most occurring. There are also many
spontaneous motor movements that occur at the rate of approximately 2/sec, such as
walking, sucking in the newborn, and rocking.
Friberg (1991), and Widmer (2002) give rules to how the dynamics and timing
should be changed according to the musical position of the notes. Dynamic changes
include 6db increase (doubling), and up to 100msec deviations to the duration,
depending on the musical position of the notes. With these timing changes, Snyder
(2000) indicate the categorical perception of beats, measures and patterns. The
perception of deviations of the timing is examples of within-category distinctions.
Even with large deviations from the nominal score, the notes are recognized as falling
on the beats.
As for melodic perception, Thomassen (1982) investigated the role of interval as
melodic accents. In a controlled experiment, he modeled the anticipation using an
attention span of three notes, and found that the accent perception is described ‘fairly
well’. The first of two opposite frequency changes gives the strongest accentuation.
Two changes in the same direction are equally effective. The larger of two changes
are more powerful, as are frequency rises as compared to frequency falls.
Fig. 1. Different shapes of a chunk. Positive (a-c) or negative arches (g-i), rising (a,d,g) or
falling slopes (c,f,i).
Fig. 2. Note (top) and interval probability density function obtained from The Digital Tradition
folk database
Pitch Gestures in Generative Modeling of Music 55
pitch of music. According to Vos and Troost (1989), the smaller intervals occur more
often in descending form, while the larger ones occur mainly in ascending form.
However, since the slope and arch are modelled in this work, the pdf of the intervals
are mirrored and added around zero, and subsequently weighted and copied back to
recreate the full interval pdf. It is later made possible to create a melodic contour with
a given slope and arch characteristics, as detailed below.
In order to generative pitch contours with gestures, the model in figure 1 is used.
For the pitch contour, only the neutral gesture (e) in figure 1, the falling and rising
slope (d) and (f), and the positive and negative arches (b) and (h) are modeled here.
The gestures are obtained by weighting the positive and negative slope of the interval
probability density function with a weight,
Here, pdfi+ is the mirrored/added positive interval pdf, and w is the weight. If
w=0.5, a neutral gesture is obtained, and if w<0.5, a positive slope is obtained, and if
w>0.5, a negative slope is obtained. In order to obtained an arch, the value of the
weight is changed to w=1- w, in the middle of the gesture.
In order to obtain a musical scale, the probability density function for the intervals
(pdfi) is multiplied with a suitable pdfs for the scale, such as the one illustrated in
figure 2 (top),
pdf = shift( pdf i ,n0 )⋅ pdf s ⋅ wr . (2)
As pdfs is only defined for one octave, it is circularly repeated. The interval
probabilities, pdfi, are shifted for each note n0. This is done under the hypothesis that
the intervals and scale notes are independent. So as to retain the register,
approximately, a register weight wr is further multiplied to the pdf. This weight is one
for one octave, and decreases exponentially on both sides, in order to lower the
possibility of obtaining notes far from the original register.
In order to obtain successive notes, the cumulative density function, cdf, is
calculated from eq (2), and used to model the probability that r is less than or equal to
the note intervals cdf(n0). If r is a random variable with uniform distribution in the
interval (0,1), then n0 can be found as the index of the first occurrence of cdf>r.
Examples of pitch contours obtained by setting w=0, and w=1, respectively, are
shown in figure 3. The rising and falling pitches are reset after each gesture, in order
to stay at the same register throughout the melody.
The positive and negative slopes are easily recognized when listening to the
resulting melodies, because of the abrupt pitch fall at the end of each gesture. The
arches, in comparison, are more in need of loudness and/or brightness variations in
order to make them perceptually recognized. Without this, a positive slope can be
confused for a negative arch that is shifted in time, or a positive or negative slope,
likewise shifted in time. Normally, an emphasis at the beginning of each gesture is
sufficient for the slopes, while the arches may be in need of an emphasis at the peak
of the arch as well.
56 K. Jensen
Fig. 3. Pitch contours of four melodies with positive arch, rising slope, negative arch and
falling slope
Fig. 4. The generative model including meter, gesture and form. Structural changes on the note
values, the intensity and the rhythm is made every 30 seconds, approximately, and gesture
changes are made on average every seven notes
The notes are created using a simple envelope model and the synthesis method
dubbed brightness creation function (bcf, Jensen 1999) that creates a sound with
exponentially decreasing amplitudes that allows the continuous control of the
brightness. The accent affects the note, so that the loudness brightness is doubled, and
the duration is increased by 25 %, with 75% of the elongation made by advancing the
start of the note, as found in Jensen (2010).
These findings are put into a generative model of tonal music. A subset of notes
(3-5) is chosen at each new form (superchunk), together with a new dynamic level. At
the chunk level, new notes are created in a metrical loop, and the gestures are added
to the pitch contour and used for additional gesture emphasis. Finally, at the
microtemporal (subchunk) level, expressive deviations are added in order to render
the loops musical. The interaction of the rigid meter with the more loose pitch gesture
renders the generated notes a more musical sense, by the incertitude and the double
stream that results. The pure rising and falling pitch gestures are still clearly
perceptible, while the arches are less present. By setting w in eq(1) to something in
between (0,1), e.g. 0.2, or 0.8, a more realistic, agreeable rising and falling gestures
are resulting. Still, the arches are more natural to the ear, while the rising and falling
demand more attention, in particular perhaps the rising gestures.
58 K. Jensen
5 Conclusion
The automatic generation of music is in need of model to render the music expressive.
This model is found using knowledge from time perception of music studies, and
further studies of the cognitive and perceptual aspects of rhythm. Indeed, the
generative model consists of three sources, corresponding to the immediate
microtemporal, the present mesotemporal and the long-term memory macroterminal.
This corresponds to the note, the gesture and the form in music. While a single stream
in each of the source may not be sufficient, so far the model incorporates the
macrotemporal superchunk, the metrical mesotemporal chunk and the microtemporal
expressive enhancements. The work presented here has introduced gestures in the
pitch contour, corresponding to the rising and falling slopes, and to the positive and
negative arches, which adds a perceptual stream to the more rigid meter stream.
The normal beat as is given by different researchers to be approximately 100 BPM,
and Fraisse (1982) furthermore shows the existence of two main note durations, one
above and one below 0.4 secs, with a ratio of two. Indications as to subjective time,
given by Zwicker and Fastl (1999) are yet to be investigated, but this may well be
creating uneven temporal intervals in conflict with the pulse.
The inclusion of the pitch gesture model certainly, in the author’s opinion, renders
the music more enjoyable, but more work remains before the generative model is
ready for general-purpose uses.
References
1. Fraisse, P.: Rhythm and Tempo. In: Deutsch, D. (ed.) The Psychology of Music, 1st edn.,
pp. 149–180. Academic Press, New York (1982)
2. Friberg, A.: Performance Rules for Computer-Controlled Contemporary Keyboard Music.
Computer Music Journal 15(2), 49–55 (1991)
3. Gordon, J.W.: The perceptual attack time of musical tones. Journal of the Acoustical
Society of America, 88–105 (1987)
4. Handel, S.: Listening. MIT Press, Cambridge (1989)
5. Huron, D.: The Melodic Arch in Western Folk songs. Computing in Musicology 10, 3–23
(1996)
6. Jensen, K.: Timbre Models of Musical Sounds, PhD Dissertation, DIKU Report 99/7
(1999)
7. Jensen, K.: Investigation on Meter in Generative Modeling of Music. In: Proceedings of
the CMMR, Malaga, June 21-24 (2010)
8. Jensen, K., Kühl, O.: Towards a model of musical chunks. In: Ystad, S., Kronland-
Martinet, R., Jensen, K. (eds.) CMMR 2008. LNCS, vol. 5493, pp. 81–92. Springer,
Heidelberg (2009)
9. Kühl, O., Jensen, K.: Retrieving and recreating musical form. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 270–282. Springer,
Heidelberg (2008)
10. Kühl, O.: Musical Semantics. Peter Lang, Bern (2007)
11. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. The MIT Press,
Cambridge (1983)
Pitch Gestures in Generative Modeling of Music 59
12. Malbrán, S.: Phases in Children’s Rhythmic Development. In: Zatorre, R., Peretz, I. (eds.)
The Biological Foundations of Music. Annals of the New York Academy of Sciences
(2000)
13. Papadopoulos, G., Wiggins, G.: AI methods for algorithmic composition: a survey, a
critical view and future prospects. In: AISB Symposium on Musical Creativity, pp.
110–117 (1999)
14. Patel, A., Peretz, I.: Is music autonomous from language? A neuropsychological appraisal.
In: Deliège, I., Sloboda, J. (eds.) Perception and cognition of music, pp. 191–215.
Psychology Press, Hove (1997)
15. Samson, S., Ehrlé, N., Baulac, M.: Cerebral Substrates for Musical Temporal Processes.
In: Zatorre, R., Peretz, I. (eds.) The Biological Foundations of Music. Annals of the New
York Academy of Sciences (2000)
16. Snyder, B.: Music and Memory. An Introduction. The MIT Press, Cambridge (2000)
17. The Digital Tradition (2010), http://www.mudcat.org/AboutDigiTrad.cfm
(visited December 1, 2010)
18. Thomassen, J.M.: Melodic accent: Experiments and a tentative model. J. Acoust. Soc.
Am. 71(6), 1596–1605 (1982)
19. Vos, P.G., Troost, J.M.: Ascending and Descending Melodic Intervals: Statistical Findings
and Their Perceptual Relevance. Music Perception 6(4), 383–396 (1089)
20. Widmer, G.: Machine discoveries: A few simple, robust local expression principles.
Journal of New Music Research 31, 37–50 (2002)
21. Zwicker, E., Fastl, H.: Psychoacoustics: facts and models, 2nd edn. Springer series in
information sciences. Springer, Berlin (1999)
An Entropy Based Method for Local
Time-Adaptation of the Spectrogram
1 Introduction
Far from being restricted to entertainment, sound processing techniques are re-
quired in many different domains: they find applications in medical sciences,
security instruments, communications among others. The most challenging class
of signals to consider is indeed music: the completely new perspective opened
by contemporary music, assigning a fundamental role to concepts as noise and
timbre, gives musical potential to every sound.
The standard techniques of digital analysis are based on the decomposition
of the signal in a system of elementary functions, and the choice of a specific
system necessarily has an influence on the result. Traditional methods based on
single sets of atomic functions have important limits: a Gabor frame imposes a
fixed resolution over all the time-frequency plane, while a wavelet frame gives a
strictly determined variation of the resolution: moreover, the user is frequently
This work is supported by grants from Region Ile-de-France.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 60–75, 2011.
c Springer-Verlag Berlin Heidelberg 2011
A Method for Local Time-Adaptation of the Spectrogram 61
asked to define himself the analysis window features, which in general is not a
simple task even for experienced users. This motivates the search for adaptive
methods of sound analysis and synthesis, and for algorithms whose parameters
are designed to change according to the analyzed signal features. Our research
is focused on the development of mathematical models and tools based on the
local automatic adaptation of the system of functions used for the decomposition
of the signal: we are interested in a complete framework for analysis, spectral
transformation and re-synthesis; thus we need to define an efficient strategy to
reconstruct the signal through the adapted decomposition, which must give a
perfect recovery of the input if no transformation is applied.
Here we propose a method for local automatic time-adaptation of the Short
Time Fourier Transform window function, through a minimization of the Rényi
entropy [22] of the spectrogram; we then define a re-synthesis technique with
an extension of the method proposed in [11]. Our approach can be presented
schematically in three parts:
Af 2 ≤ | f, φγ
|2 ≤ Bf 2 . (1)
γ∈Γ
For any frame {φk }k∈Z there exist dual frames {φ̃k }k∈Z such that for all f ∈
L2 (R)
f= f, φk
φ̃k = f, φ̃k
φk , (3)
k∈Z k∈Z
so that given a frame it is always possible to perfectly reconstruct a signal f
using the coefficients of its decomposition through the frame. The inverse of the
frame operator allows the calculation of the canonical dual frame
φ̃k = U−1 φk (4)
which guarantees minimal-norm coefficients in the expansion.
A Gabor frame is obtained by time-shifting and frequency-transposing a win-
dow function g according to a regular grid. They are particularly interesting in
the applications as the analysis coefficients are simply given by sampling the
STFT of f with window g according to the nodes of a specified lattice. Given a
time step a and a frequency step b we write {un }n∈Z = an and {ξk }k∈Z = bk;
these two sequences generate the nodes of the time-frequency lattice Λ for the
frame {gn,k }(n,k)∈Z2 defined as
gn,k (t) = g(t − un )e2πiξk t ; (5)
the nodes are the centers of the Heisenberg boxes associated to the windows in
the frame. The lattice has to satisfy certain conditions for {gn,k } to be a frame
[7], which impose limits on the choice of the time and frequency steps: for certain
choices [6] which are often adopted in standard applications, the frame operator
takes the form of a multiplication,
Uf (t) = b−1 |g(t − un )|2 f (t) , (6)
n∈Z
Thus we see that the frame bounds provide also information on the redundancy
of the decomposition of the signal within the frame.
64 M. Liuni et al.
Here, if N = l b1l |gn(l) (s)|2 1 then U is invertible and the set (9) is a frame
whose dual frame is given by
1
g̃n(l),k (t) = gn(l) (t)e2πibl kt . (11)
N
Nonstationary Gabor frames belong to the recently introduced class of quilted
frames [9]: in this kind of decomposing systems the choice of the analysis window
depends on both the time and the frequency location, causing more difficulties
A Method for Local Time-Adaptation of the Spectrogram 65
1 PSf [n, k]
α
HG [PSf ] = log
+ log2 (ab) . (15)
α
1−α 2
[n ,k ]∈G PSf [n , k ]
[n,k]∈G
66 M. Liuni et al.
As we are working with finite discrete densities we can also consider the case
α = 0 which is simply the logarithm of the number of elements in P ; as a
consequence H0 (P ) ≥ Hα (P ) for every admissible order α.
A third basic fact is that for every order α the Rényi entropy Hα is maximum
when P is uniformly distributed, while it is minimum and equal to zero when P
has a single non-zero value.
All of these results give useful information on the values of different measures
on a single density P as in (15), while the relations between the entropies of two
different densities P and Q are in general hard to determine analytically; in our
problem, P and Q are two spectrograms of a signal in the same time-frequency
area, based on two window functions with different scaling as in (8). In some
basic cases such a relation is achievable, as shown in the following example.
The sparsity measure we are using chooses as best window the one which mini-
mizes the entropy measure: we deduce from (17) that it is the one obtained with
the largest scaling factor available, therefore with the largest time-support. This
is coherent with our expectation as stationary signals, such as sinusoids, are best
analyzed with a high frequency resolution, because time-independency allows a
small time resolution. Moreover, this is true for any order α used for the entropy
calculus. Symmetric considerations apply whenever the spectrogram of a signal
does not depend on frequency, as for impulses.
A Method for Local Time-Adaptation of the Spectrogram 67
and then normalize to obtain a unitary sum. We then apply Rényi entropy
measures with α varying between 0 and 30: as we see from figure 1, there is a
relation between M and the slope of the entropy curves for the different values
of α. For α = 0, H0 [DM ] is the logarithm of the number of non-zero coefficients
and it is therefore constant; when α increases, we see that densities with a small
amount of large coefficients gradually decrease their entropy, faster than the
almost flat vectors corresponding to larger values of M . This means that by
increasing α we emphasize the difference between the entropy values of a peaky
distribution and that of a nearly flat one. The sparsity measure we consider
select as best analysis the one with minimal entropy, so reducing α rises the
probability of less peaky distributions to be chosen as sparsest: in principle, this
is desirable as weaker components of the signal, such as partials, have to be
taken into account in the sparsity evaluation. But as well, this principle should
be applied with care as a small coefficient in a spectrogram could be determined
by a partial as well as by a noise component; choosing an extremely small α,
the best window chosen could vary without a reliable relation with spectral
concentration depending on the noise level within the sound.
6
entropy
3 0
5
10
2 15
0 20
20 alpha
40 25
60
80 30
M 100
Fig. 1. Rényi entropy evaluations of the DM vectors with varying α; the distribution
becomes flatter as M increases
68 M. Liuni et al.
with χ the indicator function of the specified interval, but it is obviously possible
to generalize the results thus obtained to the entire class of compactly supported
window functions. In both the versions of our algorithm we create a multiple
Gabor frame as in (5), using as mother functions some scaled version of h,
obtained as in (8) with a finite set of positive real scaling factors L.
We consider consecutive segments of the signal, and for each segment we
calculate |L| spectrograms with the |L| scaled windows: the length of the anal-
ysis segment and the overlap between two consecutive segments are given as
parameters.
In the first version of the algorithm the different frames composing the multi-
frame have the same time step a and frequency step b: this guarantees that for
each signal segment the different frames have Heisenberg boxes whose centers
lay on a same lattice on the time-frequency plane, as illustrated in figure 2. To
A Method for Local Time-Adaptation of the Spectrogram 69
4096
window length(smps)
3043
2261
1680
1248
927
689
512
0.05 0.1 0.15
time
Fig. 2. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the first version of our algorithm
x 10
4 512-samples hann window
2
frequency
1.5
0.5
x 10
4 4096-samples hann window
2
frequency
1.5
0.5
guarantee that all the |L| scaled windows constitute a frame when translated
and modulated according to this global lattice, the time step a must be set
with the hop size assigned to the smallest window frame. On the other hand, as
the FFT of a discrete signal has the same number of points of the signal itself,
the frequency step b has to be the FFT size of the largest window analysis: for
the smaller ones, a zero-padding is performed.
70 M. Liuni et al.
4096
2048
1024
512
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
4
x 10
2
frequency
1.5
0.5
Fig. 4. Example of an adaptive analysis performed by the first version of our algorithm
with four Hanning windows of different sizes (512, 1024, 2048 and 4096 samples) on a
B4 note played by a marimba: on top, the best window chosen as a function of time; at
the bottom, the adaptive spectrogram. The entropy order is α = 0.7 and each analysis
segment contains twenty-four analyses frames with a sixteen-frames overlap between
consecutive segments.
4096
window length (smps)
3043
2261
1680
1248
927
689
512
0 0.05 0.1 0.15
time
Fig. 5. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the second version of our algorithm
good time resolution at the strike with that of a good frequency resolution on
the harmonic resonance. This is fully provided by the algorithm, as shown in
the adaptive spectrogram at the bottom of the figure 4. Moreover, we see that
the pre-echo of the analysis at the bottom of figure 3 is completely removed in
the adapted spectrogram.
The main difference in the second version of our algorithm concerns the indi-
vidual frames composing the multi-frame, which have the same frequency step
b but different time steps {al : l ∈ L}: the smallest and largest window sizes are
given as parameters together with |L|, the number of different windows com-
posing the multi-frame, and the global overlap needed for the analyses. The
algorithm fixes the intermediate sizes so that, for each signal segment, the dif-
ferent frames have the same overlap between consecutive windows, and so the
same redundancy.
This choice highly reduces the computational cost by avoiding unnecessary
small hop sizes for the larger windows, and as we have observed in the previous
section it does not affect the entropy evaluation. Such a structure generates an
irregular time disposition of the multi-frame elements in each signal segment,
as illustrated in figure 5; in this way we also avoid the problem of unshared
parts of signal between the systems, but we still have a different influence of the
boundary parts depending on the analysis frame: the beginning and the end of
the signal segment have a higher energy when windowed in the smaller frames.
This is avoided with a preliminary weighting: the beginning and the end of each
signal segment are windowed respectively with the first and second half of the
largest analysis window.
As for the first implementation, the weighting does not concern the decomposi-
tion for re-synthesis purpose, but only the analyses used for entropy evaluations.
72 M. Liuni et al.
Fig. 6. Example of an adaptive analysis performed by the second version of our algo-
rithm with eight Hanning windows of different sizes from 512 to 4096 samples, on a
B4 note played by a marimba sampled at 44.1kHz: on top, the best window chosen
as a function of time; at the bottom, the adaptive spectrogram. The entropy order is
α = 0.7 and each analysis segment contains four frames of the largest window analysis
with a two-frames overlap between consecutive segments.
After the pre-weighting, the algorithm follows the same steps described above:
calculation of the |L| local spectrograms, evaluation of their entropy, selection of
the window providing minimum entropy, computation of the adapted spectro-
gram with the best window at each time point, thus creating an analysis with
time-varying resolution and hop size.
In figure 6 we give a first example of an adaptive analysis performed by the
second version of our algorithm with eight Hanning windows of different sizes:
the sound is still the B4 note of a marimba, and we can see that the two versions
give very similar results. Thus, if the considered application does not specifically
ask for a fixed hop size of the overall analysis, the second version is preferable
as it highly reduces the computational cost without affecting the best window
choice.
In figure 8 we give a second example with a synthetic sound, a sinusoid with si-
nusoidal frequency modulation: as figure 7 shows, a small window is best adapted
where the frequency variation is fast compared to the window length; on the
other hand, the largest window is better where the signal is almost stationary.
1.5
0.5
1.5
0.5
In our case, after the automatic selection step we dispose of a temporal sequence
with the best windows at each time position; in the first version we have a
fixed hop for all the windows, in the second one every window has its own
time step. In both the cases we have thus reduced the initial multi-frame to
a nonstationary Gabor frame: we extend the same technique of (21) using a
variable window h and time step a according to the composition of the reduced
multi-frame, obtaining a perfect reconstruction as well. The interest of (21) is
that the given distribution does not need to be the STFT of a signal: for example,
a transformation S ∗ [n, k] of the STFT of a signal could be considered. In this
case, (21) gives the signal whose STFT has minimal least squares error with
S ∗ [n, k].
As seen by the equations (9) and (11), the theoretical existence and the math-
ematical definition of the canonical dual frame for a nonstationary Gabor frame
like the one we employ has been provided [14]: it is thus possible to define the
whole analysis and re-synthesis framework within the Gabor theory. We are at
present working on the interesting analogies between the two approaches, to
establish a unified interpretation and develop further extensions.
74 M. Liuni et al.
Fig. 8. Example of an adaptive analysis performed by the second version of our al-
gorithm with eight Hanning windows of different sizes from 512 to 4096 samples, on
a sinusoid with sinusoidal frequency modulation synthesized at 44.1 kHz: on top, the
best window chosen as a function of time; at the bottom, the adaptive spectrogram.
The entropy order is α = 0.7 and each analysis segment contains four frames of the
largest window analysis with a three-frames overlap between consecutive segments.
5 Conclusions
We have presented an algorithm for time-adaptation of the spectrogram reso-
lution, which can be easily integrated in existent framework for analysis, trans-
formation and re-synthesis of an audio signal: the adaptation is locally obtained
through an entropy minimization within a finite set of resolutions, which can be
defined by the user or left as default. The user can also specify the time duration
and overlap of the analysis segments where entropy minimization is performed,
to privilege more or less discontinuous adapted analyses.
Future improvements of this method will concern the spectrogram adaptation
in both time and frequency dimensions: this will provide a decomposition of the
signal in several layers of analysis frames, thus requiring an extension of the
proposed technique for re-synthesis.
References
1. Baraniuk, R.G., Flandrin, P., Janssen, A.J.E.M., Michel, O.J.J.: Measuring Time-
Frequency Information Content Using the Rényi Entropies. IEEE Trans. Info. The-
ory 47(4) (2001)
2. Borichev, A., Gröchenig, K., Lyubarskii, Y.: Frame constants of gabor frames near
the critical density. J. Math. Pures Appl. 94(2) (2010)
A Method for Local Time-Adaptation of the Spectrogram 75
Abstract. In this paper models and algorithms are presented for tran-
scription of pitch and timings in polyphonic music extracts. The data
are decomposed framewise into the frequency domain, where a Poisson
point process model is used to write a polyphonic pitch likelihood func-
tion. From here Bayesian priors are incorporated both over time (to link
successive frames) and also within frames (to model the number of notes
present, their pitches, the number of harmonics for each note, and inhar-
monicity parameters for each note). Inference in the model is carried out
via Bayesian filtering using a powerful Sequential Markov chain Monte
Carlo (MCMC) algorithm that is an MCMC extension of particle fil-
tering. Initial results with guitar music, both laboratory test data and
commercial extracts, show promising levels of performance.
1 Introduction
The audio signal generated by a musical instrument as it plays a note is com-
plex, containing multiple frequencies, each with a time-varying amplitude and
phase. However, the human brain perceives such a signal as a single note, with
associated “high-level” properties such as timbre (the musical “texture”) and
expression (loud, soft, etc.). A musician playing a piece of music takes as input
a score, which describes the music in terms of these high-level properties, and
produces a corresponding audio signal. An accomplished musician is also able to
reverse the process, listening to a musical audio signal and transcribing a score.
A desirable goal is to automate this transcription process. Further developments
in computer “understanding” of audio signals of this type can be of assistance
to musicologists; they can also play an important part in source separation sys-
tems, as well as in automated mark-up systems for content-based annotation of
music databases.
Perhaps the most important property to extract in the task of musical tran-
scription is the note or notes playing at each instant. This will be the primary
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 76–83, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Sequential MCMC for Musical Transcription 77
Fig. 1. An example of a single note spectrum, with the associated median threshold
(using a window of ±4 frequency bins) and peaks identified by the peak detection
algorithm (circles)
where Y = {y1 , y2 , ..., yK } are the observed peak data in the K frequency
bins, such that yk = 1 if a peak is observed in the k th bin, and yk = 0 oth-
erwise.
It only remains to formulate the intensity function μ(f ), and hence
μk = f ∈ kth bin μ(f )df . For this purpose, the Gaussian mixture model of Peel-
ing et al.[8] is used. Note that in this formulation we can regard each harmonic
of each note to be an independent Poisson process itself, and hence by the union
property of Poisson processes, all of the individual Poisson intensities add to
give a single overall intensity μ, as follows:
N
μ(f ) = μj (f ) + μc (4)
j=1
Hj
A (f − fj,h )2
μj (f ) = exp − 2 (5)
2
2πσj,h 2σj,h
h=1
where j indicates the note number, h indicates the partial number, and N and
Hj are the numbers of notes and harmonics in each note, respectively. μc is a
constant that accounts for detected “clutter” peaks due to noise and non-musical
2
sounds. σj,h = κ2 h2 sets the variance of each Gaussian. A and κ are constant
parameters, chosen so as to give good performance on a set of test pieces. fj,h is
the frequency of the hth partial of the j th note, given by the inharmonic model
[4]:
fj,h = f0,j h 1 + Bj h2 (6)
f0,j is the fundamental frequency of the jth note. Bj is the inharmonicity pa-
rameter for the note (of the order 10−4 ).
Three parameters for each note are variable and to be determined by the
inference engine: the fundamental, the number of partials,and the inharmonicity.
Moreover, the number of notes N is also treated as unknown in the fully Bayesian
framework.
N
P (θ) = P (N ) × P (f0,j ) × P (Hj |f0,j ) × P (Bj |Hj , f0,j ) (7)
j=1
In fact, we have here assumed all priors to be uniform over their expected ranges,
except for f0,j and N , which are stochastically linked to their values in previous
frames. To consider this linkage explicitly, we now introduce a frame number
label t and the corresponding parameters for frame t as θt , with frame peak data
Yt . In order to carry out optimal sequential updating we require a transition
density p(θ t |θt−1 ), and assume that the {θt } process is Markovian. Then we can
write the required sequential update as:
p(θ t−1:t |Y1:t ) ∝ p(θ t−1 |Y1:t−1 )p(θ t |θt−1 )p(Yt |θ t ) (8)
To see how this can be implemented in a sequential MCMC framework, assume
that at time t − 1 the inference problem is solved and a set of M >> 1 Monte
(i)
Carlo (dependent) samples {θt−1 } are available from the previous time’s target
distribution p(θt−1 |Y1:t−1 ). These samples are then formed into an empirical
distribution p̂(θt−1 ) which is used as an approximation to p(θ t−1 |Y1:t−1 ) in Eq.
(8). This enables the (approximated) time updated distribution p(θ t−1:t |Y1:t ) to
be evaluated pointwise, and hence a new MCMC chain can be run with Eq. (8)
as its target distribution. The converged samples from this chain are then used
to approximate the posterior distribution at time t, and the whole procedure
repeats as time step t increases.
The implementation of the MCMC at each time step is quite complex, since
it will involve updating all elements of the parameter vector θ t , including the
number of notes, the fundamental frequencies, the number of harmonics in each
note and the inharmonicity parameter for each note. This is carried out via a
combination of Gibbs sampling and Metropolis-within-Gibbs sampling, using a
Reversible Jump formulation wherever the parameter dimension (i.e. the number
of notes in the frame) needs to change, see [7] for further details of how such
schemes can be implemented in tracking and finance applications and [3] for
general information about MCMC. In order to enhance the practical performance
we modified the approximating density at t − 1, p̂(θ t−1 ) to be a univariate
density over one single fundamental frequency, which can be thought of as the
posterior distribution of fundamental frequency at time t − 1 with all the other
parameters marginalised, including the number of notes, and a univariate density
over the number of notes. This collapsing of the posterior distribution onto a
univariate marginal, although introducing an additional approximation into the
updating formula, was found to enhance the MCMC exploration at the next
time step significantly, since it avoids combinatorial updating issues that increase
dramatically with the dimension of the full parameter vector θt .
Having carried out the MCMC sampling at each time step, the fundamental
frequencies and their associated parameters (inharmonicity and number of har-
monics, if required) may be estimated. This estimation is based on extracting
maxima from the collapsed univariate distribution over fundamental frequency,
as described in the previous paragraph.
Sequential MCMC for Musical Transcription 81
Fig. 2. Reversible Jump MCMC Results: Dots indicate note estimates. Line below in-
dicates estimate of the number of notes. Crosses in panels (a) and (b) indicate notes
estimated by the MCMC algorithm but removed by post-processing. A manually ob-
tained ground-truth is shown overlayed in panel (c).
82 P. Bunch and S. Godsill
3 Results
The methods have been evaluated on a selection of guitar music extracts, recorded
both in the laboratory and taken from commercial recordings. See Fig. 2 in which
three guitar extracts, two lab-generated (a) and (b) and one from a commercial
recording (c) are processed. Note that a few spurious note estimates arise, par-
ticularly around instants of note change, and many of these have been removed
by a post-processing stage which simply eliminates note estimates which last
for a single frame. The results are quite accurate, agreeing well with manually
obtained transcriptions.
When two notes an octave apart are played together, the upper note is not
found. See final chord of panel (a) in Figure 2. This is attributable to the two
notes sharing many of the same partials, making discrimination difficult based
on peak frequencies alone.
In the case of strong notes, the algorithm often correctly identifies up to 35
partial frequencies. In this regard, the use of inharmonicity modelling has proved
succesful: Without this feature, the estimate of the number of harmonics is often
lower, due to the inaccurate partial frequencies predicted by the linear model.
The effect of the sequential formulation is to provide a degree of smoothing
when compared to the frame-wise algorithm. Fewer single-frame spurious notes
appear, although these are not entirely removed, as shown in Figure 2. Octave
errors towards the end of notes are also reduced.
The new algorithms have shown significant promise, especially given that the
likelihood function takes account only of peak frequencies and not amplitudes
or other information that may be useful for a transcription system. The good
performance so far obtained is a result of several novel modelling and algorith-
mic features, notably the formulation of a flexible frame-based model that can
account robustly for inharmonicities, unknown numbers of notes and unknown
numbers of harmonics in each note. A further key feature is the ability to link
frames together via a probabilistic model; this makes the algorithm more robust
in estimation of continuous fundamental frequency tracks from the data. A final
important component is the implementation through sequential MCMC, which
allows us to obtain reasonably accurate inferences from the models as posed.
The models may be improved in several ways, and work is underway to address
these issues. A major point is that the current Poisson model accounts only
for the frequencies of the peaks present. It is likely that performance may be
improved by including the peak amplitudes in the model. For example, this
might make it possible to distinguish more robustly when two notes an octave
apart are being played. Improvements are also envisaged in the dynamical prior
linking one frame to the next, which is currently quite crudely formulated. Thus,
further improvements will be possible if the dependency between frames is more
carefully considered, incorporating melodic and harmonic principles to generate
Sequential MCMC for Musical Transcription 83
likely note and chord transitions over time. Ideally also, the algorithm should be
able to run in real time, processing a piece of music as it is played. Currently,
however, the Matlab-based processing is at many times real time and we will
study the parallel processing possibilities (as a simple starting point, the MCMC
runs can be split into several shorter parallel chains at each time frame within
a parallel architecture).
References
1. Cemgil, A., Godsill, S.J., Peeling, P., Whiteley, N.: Bayesian statistical methods for
audio and music processing. In: O’Hagan, A., West, M. (eds.) Handbook of Applied
Bayesian Analysis, OUP (2010)
2. Davy, M., Godsill, S., Idier, J.: Bayesian analysis of polyphonic western tonal music.
Journal of the Acoustical Society of America 119(4) (April 2006)
3. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.): Markov Chain Monte Carlo
in Practice. Chapman and Hall, Boca Raton (1996)
4. Godsill, S.J., Davy, M.: Bayesian computational models for inharmonicity in musical
instruments. In: Proc. of IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, New Paltz, NY (October 2005)
5. Kashino, K., Nakadai, K., Kinoshita, T., Tanaka, H.: Application of the Bayesian
probability network to music scene analysis. In: Rosenthal, D.F., Okuno, H. (eds.)
Computational Audio Scene Analysis, pp. 115–137. Lawrence Erlbaum Associates,
Mahwah (1998)
6. Klapuri, A., Davy, M.: Signal processing methods for music transcription. Springer,
Heidelberg (2006)
7. Pang, S.K., Godsill1, S.J., Li, J., Septier, F.: Sequential inference for dynamically
evolving groups of objects. To appear: Barber, Cemgil, Chiappa (eds.) Inference and
Learning in Dynamic Models, CUP (2009)
8. Peeling, P.H., Li, C., Godsill, S.J.: Poisson point process modeling for poly-
phonic music transcription. Journal of the Acoustical Society of America Express
Letters 121(4), EL168–EL175 (2007)
Single Channel Music Sound Separation Based
on Spectrogram Decomposition and Note
Classification
1 Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 84–101, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Single Channel Music Sound Separation 85
X = WH (1)
R
X= wr hr (2)
r=1
2
10
FFT bin index
1
10
0
10
50 100 150 200 250 300 350 400 450 500 550
Time slices index
Fig. 2. The contour plot of a sound mixture (i.e. the matrix X) containing two different
musical notes G4 and A3
88 W. Wang and H. Mustafa
2 2
10 10
0 0
10 10
20 40 60 80 100 20 40 60 80 100
Time slices index Time slices index
Fig. 3. The contour plots of the individual musical notes which were obtained by
applying an NMF algorithm to the sound mixture X. The separated notes G4 and A3
are shown in the left and right plot respectively.
150 100
Coefficient values
80
100
(a) (b)
60
50 40
20
0
0
−50 −20
0 5 10 15 0 5 10 15
50 50
(c)
Coefficient values
40 40
(d)
30 30
20 20
10 10
0 0
−10 −10
0 5 10 15 0 5 10 15
MFCC feature dimension MFCC feature dimension
Fig. 4. The 13-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
120 100
100 80
(a) (b)
Coefficient values
80
60
60
40
40
20
20
0 0
−20 −20
0 5 10 15 20 0 5 10 15 20
50 50
40 (c) 40
Coefficient values
(d)
30 30
20 20
10 10
0 0
−10 −10
0 5 10 15 20 0 5 10 15 20
MFCC feature dimension MFCC feature dimension
Fig. 5. The 20-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
90 W. Wang and H. Mustafa
150 100
Coefficient values
80 (b)
100
(a)
60
50 40
20
0
0
−50 −20
0 2 4 6 8 0 2 4 6 8
50 50
Coefficient values
40
40 (c) (d)
30
30
20
20
10
10 0
0 −10
0 2 4 6 8 0 2 4 6 8
MFCC feature dimension MFCC feature dimension
Fig. 6. The 7-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) “Piano.ff.A0.wav”, (b) “Piano.ff.B0.wav”, (c) “Vio-
lin.pizz.mf.sulG.C4B4.wav”, and (d) “Violin.pizz.pp.sulG.C4B4.wav”. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
1) Calculate the 13-D MFCCs feature vectors of all the musical examples
in the training database with class labels. This creates a feature space
for the training data.
2) Extract similarly the MFCCs feature vectors of all separated compo-
nents whose class labels need to be determined.
3) Assign the labels to all the feature vectors in the separated compo-
nents to the appropriate classes via the K-NN algorithm.
4) The majority vote of feature vectors determines the class label of the
separated components.
5) Optimize the classification results by different choices of K.
of piano and violin notes. The basic steps in music note classification include
preprocessing, feature extraction or selection, classifier design and optimization.
The main steps used in our system are detailed in Table 1.
The main disadvantage of the classification technique based on simple “ma-
jority voting” is that the classes with more frequent examples tend to come up in
the K-nearest neighbors when the neighbors are computed from a large number
of training examples [5]. Therefore, the class with more frequent training exam-
ples tends to dominate the prediction of the new vector. One possible technique
to solve this problem is to weight the classification based on the distance from
the test pattern to all of its K nearest neighbors.
where N is the total number of examples in the dataset, V is the volume sur-
rounding unknown pattern x and K is the number of examples within V . The
class prior probability depends on the number of examples in the dataset
Ni
P (ωi ) = (4)
N
and the mesurement distribution of patterns in class ωi is defined as
Ki
P (x | ωi ) = (5)
Ni V
According to the Bayes theorem, the posteriori probability becomes
P (x | ωi )P (ωi )
P (ωi | x) = (6)
P (x)
Ki
P (ωi | x) = (7)
K
The discriminant function gi (x) = KK assigns the class label to an unknown
i
the frames is used to avoid discontinuities between the neighboring frames. The
similarity measure of the feature vectors of the separated components to the
feature vectors obtained from the training process determines which class the
separated notes belong to. This is achieved by the K-NN classifier. If majority
vote goes to the piano, then a piano label is assigned to the separated component
and vice-versa.
3 Evaluations
Two music sources (played by two different instruments, i.e. piano and violin)
with different number of notes overlapping each other in the time domain, were
used to generate artificially an instantaneous mixture signal. The lengths of
piano and violin source signals are both 20 seconds, containing 6 and 5 notes
respectively. The K-NN classifier constant K was selected as K = 30. The signal-
to-noise ratio (SNR), defined as follows, was used to measure the quality of both
the separated notes and the whole source signal,
2
s,t [Xm ]s,t
SN R(m, j) = (8)
s,t ([Xm ]s,t − [Xj ]s,t )
2
where s and t are the row and column indices of the matrix respectively. The
SNR was computed based on the magnitude spectrograms Xm and Xj of the
mth reference and the j th separated component to prevent the reconstruction
94 W. Wang and H. Mustafa
300
250
200
Coefficient values
150
100
50
−50
0 2 4 6 8 10 12 14
MFCC feature space
Fig. 7. The collection of the audio features from a typical piano signal (i.e. “Pi-
ano.ff.A0.wav”) in the training process. In total, 999 frames of features were computed.
250
200
Coefficient values
150
100
50
−50
0 2 4 6 8 10 12 14
MFCC feature space
Fig. 8. The collection of the audio features from a typical violin signal (i.e. “Vio-
lin.pizz.pp.sulG.C4B4.wav”) in the training process. In total, 999 frames of features
were computed.
process from affecting the quality [22]. For the same note, j = m. In general,
higher SNR values represent better separation quality of the separated notes
and source signals, vice-versa. The training database used in the classification
process was provided by the McGill University Master Samples Collection [16],
University of Iowa website [21]. It contains 53 music signals with 29 of which
are piano signals and the rest are violin signals. All the signals were sampled
at 44100 Hz. The reference source signals were stored for the measurement of
separation quality.
For the purpose of training, the signals were firstly segmented into frames,
and then the MFCC feature vectors were computed from these frames. In total,
Single Channel Music Sound Separation 95
200
150
50
−50
0 2 4 6 8 10 12 14
MFCC feature space
Fig. 9. The collection of the audio features from a separated speech component in the
testing process. Similar to the training process, 999 frames of features were computed.
7000
6000
5000
4000
MSE
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Frame index 4
x 10
Fig. 10. The MSEs between the feature vector of a frame of the music component to
be classified and those from the training data. The frame indices in the horizontal axis
are ranked from the lower to the higher. The frame index 28971 is the highest frame
number of the piano signals. Therefore, on this plot, to the left of this frame are those
from piano signals, and to the right are those from the violin signals.
999 frames were computed for each signal. Figures 7 and 8 show the collection of
the features from the typical piano and violin signals (i.e. “Piano.ff.A0.wav” and
“Violin.pizz.pp.sulG.C4B4.wav”) respectively. In both figures, it can be seen that
there exist features whose coefficients are all zeros due to the silence part of the
signals. Before running the training algorithm, we performed feature selection by
removing such frames of features. In the testing stage, the MFCC feature vectors
of the individual music components that were separated by the NMF algorithm
were calculated. Figure 9 shows the feature space of 15th separated component
96 W. Wang and H. Mustafa
7000
6000
5000
4000
MSE
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Sorted frame index 4
x 10
Fig. 11. The MSE values obtained in Figure 10 were sorted from the lower to the
higher. The frame indices in the horizontal axis, associated with the MSEs, are shuffled
accordingly.
34
32
30
28
MSE
26
24
22
20
0 5 10 15 20 25 30
K nearest frames
Fig. 12. The MSE values of the K nearest neighbors (i.e. the frames with the K minimal
MSEs) are selected based on the K-NN clustering. In this experiment, K was set to 30.
4
x 10
5
4.5
3.5
Frame index 3
2.5
1.5
0.5
0
0 5 10 15 20 25 30 35
The K nearest frames
Fig. 13. The frame indices of the 30 nearest neighbors to the frame of the decomposed
music note obtained in Figure 12. In our experiment, the maximum frame index for
the piano signals is 28971, shown by the dashed line, while the frame indices of violin
signals are all greater than 28971. Therefore, this typical audio frame under testing
can be classified as a violin signal.
1
(a)
0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(b)
0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(c)
0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(d)
0
−1
0 1 2 3 4 5 6 7 8 9
5
x 10
1
(e)
0
−1
0 1 2 3 4 5 6 7 8 9
Time in samples 5
x 10
Fig. 14. A separation example of the proposed system. (a) and (b) are the piano and
violin sources respectively, (c) is the single channel mixture of these two sources, and
(d) and (e) are the separated sources respectively. The vertical axes are the amplitude
of the signals.
than 28971, which was the highest index number of the piano signals in the
training data. As a result, this component was classified as a violin signal.
Figure 14 shows a separation example of the proposed system, where (a) and
(b) are the piano and violin sources respectively, (c) is the single channel mixture
98 W. Wang and H. Mustafa
of these two sources, and (d) and (e) are the separated sources respectively. From
this figure, we can observe that, although most notes are correctly separated and
classified into the corresponding sources, there exist notes that were wrongly
classified. The separated notes with the highest SNR is the first note of the
violin signal, for which the SNR equals to 9.7dB, while the highest SNR of the
note within the piano signal is 6.4dB. The average SNRs for piano and violin
are respectively 3.7 dB and 1.3 dB. According to our observation, the separation
quality of the notes varies from notes to notes. In average, the separation quality
of the piano signal is better than the violin signal.
4 Discussions
At the moment, for the separated components by the NMF algorithm, we cal-
culate their MFCC features in the same way as for the signals in the training
data. As a result, the evaluation of the MSEs becomes straightforward, which
consequently facilitates the K-NN classification. It is however possible to use the
dictionary returned by the NMF algorithm (and possibly the activation coeffi-
cients as well) as a set of features. In such a case, the NMF algorithm needs to
be applied to the training data in the same way as the separated components
obtained in the testing and classification process. Similar to principal compo-
nent analysis (PCA) which has been widely used to generate features in many
classification system, using NMF components directly as features has a great
potential. As compared to using the MFCC features, the computational cost
associated with the NMF features could be higher due to the iterations required
for the NMF algorithms to converge. However, its applicability as a feature for
classification deserves further investigation in the future.
Another important issue in applying NMF algorithms is the selection of the
mode of the NMF model (i.e. the rank R). In our study, this determines the
number of components that will be learned from the signal. In general, for a
higher rank R, the NMF algorithm learns the components that are more likely
corresponding to individual notes. However, there is a trade-off between the de-
composition rank and the computational load, as a larger R incurs a higher
computational cost. Also, it is known that NMF produces not only harmonic
dictionary components but also sometimes ad-hoc spectral shapes correspond-
ing to drums, transients, residual noise, etc. In our recognition system, these
components were treated equally as the harmonic components. In other words,
the feature vectors of these components were calculated and evaluated in the
same way as the harmonic components. The final decision was made from the
labelling scores and the K-NN classification results.
We note that many classification algorithms could also be applied for labelling
the separated components, such as the Gaussian Mixture Models (GMMs), which
have been used in both automatic speech/speaker recognition and music infor-
mation retrieval. In this work, we choose the K-NN algorithm due its simplicity.
Moreover, the performance of the single channel source separation system de-
veloped here is largely dependent on the separated components provided by the
Single Channel Music Sound Separation 99
NMF algorithm. Although the music components obtained by the NMF algo-
rithm are somehow sparse, their sparsity is not explicitly controlled. Also, we
didn’t use the information from the music signals explicitly, such as the pitch in-
formation and harmonic structure. According to Li et al. [14], the information of
pitch and common amplitude modulation can be used to improve the separation
quality. Com
5 Conclusions
We have presented a new system for the single channel music sound separation
problem. The system essentially integrates two techniques, automatic note de-
composition using NMF, and note classification based on the K-NN algorithm. A
main assumption with the proposed system is that we have the prior knowledge
about the type of instruments used for producing the music sounds. The simu-
lation results show that the system produces a reasonable performance for this
challenging source separation problem. Future works include using more robust
classification algorithm to improve the note classification accuracy, and incorpo-
rating pitch and common amplitude modulation information into the learning
algorithm to improve the separation performance of the proposed system.
References
1. Abdallah, S.A., Plumbley, M.D.: Polyphonic Transcription by Non-Negative Sparse
Coding of Power Spectra. In: International Conference on Music Information Re-
trieval, Barcelona, Spain (October 2004)
2. Barry, D., Lawlor, B., Coyle, E.: Real-time Sound Source Separation: Azimuth
Discrimination and Re-synthesis, AES (2004)
3. Brown, G.J., Cooke, M.P.: Perceptual Grouping of Musical Sounds: A Computa-
tional Model. J. New Music Res. 23, 107–132 (1994)
4. Casey, M.A., Westner, W.: Separation of Mixed Audio Sources by Independent
Subspace Analysis. In: Proc. Int. Comput. Music Conf. (2000)
5. Devijver, P.A., Kittler, J.: Pattern Recognition - A Statistical Approach. Prentice
Hall International, Englewood Cliffs (1982)
6. Every, M.R., Szymanski, J.E.: Separation of Synchronous Pitched Notes by Spec-
tral Filtering of Harmonics. IEEE Trans. Audio Speech Lang. Process. 14, 1845–
1856 (2006)
7. Fevotte, C., Bertin, N., Durrieu, J.-L.: Nonnegative Matrix Factorization With the
Itakura-Saito Divergence. With Application to Music Analysis. Neural Computa-
tion 21, 793–830 (2009)
8. FitzGerald, D., Cranitch, M., Coyle, E.: Extended Nonnegative Tensor Factori-
sation Models for Musical Sound Source Separation, Article ID 872425, 15 pages
(2008)
9. Fukunage, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press Inc., London (1990)
100 W. Wang and H. Mustafa
29. Wang, W., Luo, Y., Chambers, J.A., Sanei, S.: Note Onset Detection via
Non-negative Factorization of Magnitude Spectrum. EURASIP Journal on
Advances in Signal Processing, Article ID 231367, 15 pages (June 2008);
doi:10.1155/2008/231367
30. Wang, W., Cichocki, A., Chambers, J.A.: A Multiplicative Algorithm for Convo-
lutive Non-negative Matrix Factorization Based on Squared Euclidean Distance.
IEEE Transactions on Signal Processing 57, 2858–2864 (2009)
31. Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, New York (2005)
32. Woodruff, J., Pardo, B.: Using Pitch, Amplitude Modulation and Spatial Cues for
Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP J.
Adv. Signal Process. (2007)
Notes on Nonnegative Tensor Factorization of
the Spectrogram for Audio Source Separation:
Statistical Insights and Towards Self-Clustering
of the Spatial Cues
1 Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 102–115, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Notes on NTF for Audio Source Separation 103
K
|Xi | ≈ qik |Ck | (1)
k=1
and |Ck | is the matrix containing the modulus of the coefficients of some “latent”
components whose precise meaning we will attempt to clarify in this paper.
Equivalently, Eq. (1) writes
K
|xif n | ≈ qik wf k hnk (2)
k=1
with
def
K
v̂if n = qik wf k hnk (4)
k=1
and where the constraint A ≥ 0 means that the coefficients of matrix A are non-
negative, and d(x|y) is a scalar cost function, taken as the generalized Kullback-
Leibler (KL) divergence in [5] or as the Euclidean distance in [11]. Complex-
valued STFT estimates Ĉk are subsequently constructed using the phase of the
observations (typically, ĉkf n is given the phase of xif n , where i = argmax{qik }i
[6]) and then inverted to produce time-domain components. The components
pertaining to same “sources” (e.g, instruments) can then be grouped either man-
ually or via clustering of the estimated spatial cues {qk }k .
In this paper we build on these previous works and bring the following
contributions :
104 C. Févotte and A. Ozerov
where J is the number of sources and sj (t) = [s1j (t), . . . sij (t), . . . , sIj (t)]T is the
multichannel contribution of source j to the data. Under the common assump-
tions of point-sources and linear instantaneous mixing, we have
sij (t) = sj (t) aij (6)
where the coefficients {aij } define a I×J mixing matrix A, with columns denoted
[a1 , . . . , aJ ]. In the following we will show that the NTF techniques described
in this paper correspond to maximum likelihood (ML) estimation of source and
mixing parameters in a model where the point-source assumption is dropped
and replaced by
(i)
sij (t) = sj (t) aij (7)
(i)
where the signals sj (t), i = 1, . . . , I are assumed to share a certain “resem-
blance”, as modelled by being two different realizations of the same random
Notes on NTF for Audio Source Separation 105
K
(i)
xif n = mik ckf n (10)
k=1
(i) (i)
where xif n and ckf n are the complex-valued STFTs of xi (t) and ck (t), and
where f = 1, . . . , F is a frequency bin index and n = 1, . . . , N is a time frame
index.
where P(λ) denotes the Poisson distribution, defined in Appendix A, and the
KL divergence dKL (·|·) is defined as
x
dKL (x|y) = x log + y − x. (14)
y
The link between KL-NMF/KL-NTF and inference in composite models with
Poisson components has been established in many previous publications, see,
e.g, [2,12]. In our opinion, model (12)-(13) suffers from two drawbacks. First, the
linearity of the mixing model is assumed on the magnitude of the STFT frames -
see Eq. (12) - instead of the frames themselves - see Eq. (10) -, which inherently
(i)
assumes that the components {ckf n }k have the same phase and that the mixing
parameters {mik }k have the same sign, or that only one component is active in
every time-frequency tile (t, f ). Second, the Poisson distribution is formally only
defined on integers, which impairs rigorous statistical interpretation of KL-NTF
on non-countable data such as audio spectra.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the component amplitudes are given by
(i)
Then, time-domain components ck (t) are reconstructed through inverse-STFT
(i) (i)
of ckf n = |ckf n |arg(xif n ), where arg(x) denotes the phase of complex-valued x.
Note that our notations are abusive in the sense that the mixing parameters
|mik | and the components |ckf n | appearing through their modulus in Eq. (12)
are in no way the modulus of the mixing parameters and the components ap-
pearing in Eq. (17). Similarly, the matrices W and H represent different types of
quantities in every case; in Eq. (13) their product is homogeneous to component
magnitudes while in Eq. (18) their product is homogeneous to variances of com-
ponent variances. Formally we should have introduced variables |cKL kf n |, W
KL
,
KL IS IS IS
H to be distinguished from variables ckf n , W , H , but we have not in
order to avoid cluttering the notations. The difference between these quantities
should be clear from the context.
Model (17)-(18) is a truly generative model in the sense that the linear mix-
ing assumption is made on the STFT frames themselves, which is a realistic
(i)
assumption in audio. Eq. (18) defines a Gaussian variance model of ckf n ; the
zero mean assumption reflects the property that the audio frames taken as the
input of the STFT can be considered centered, for typical window size of about
(i)
20 ms or more. The proper Gaussian assumption means that the phase of ckf n
is assumed to be a uniform random variable [9], i.e., the phase is taken into the
model, but in a noninformative way. This contrasts from model (12)-(13), which
simply discards the phase information.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the components are given by
We would like to underline that the MMSE estimator of components in the STFT
domain (21) is equivalent (thanks to the linearity of the STFT and its inverse) to
KL-NTF.mag IS-NTF.pow
Model
(i) (i)
Mixing model |xif n | = k |mik | |ckf n | xif n = k mik ckf n
(i) (i)
Comp. distribution |ckf n | ∼ P(wf k hnk ) ckf n ∼ Nc (0|wf k hnk )
ML estimation
Data V = |X| V = |X|2
Parameters W, H, Q = |M| W, H, Q = |M|2
Approximate v̂if n = k qik wf k hnk
Optimization min if n dKL (vif n |v̂if n ) min if n dIS (vif n |v̂if n )
Q,W,H≥0 Q,W,H≥0
Reconstruction
(i) q w h (i) q w h
|ckf n | = ik f k nk |xif n | ĉkf n = ik f k nk xif n
l qil wf l hnl l qil wf l hnl
108 C. Févotte and A. Ozerov
the MMSE estimator of components in the time domain, while the the MMSE
estimator of STFT magnitudes (15) for KL-NTF is not consistent with time
domain MMSE. Equivalence of an estimator with time domain signal squared
error minimization is an attractive property, at least because it is consistent with
a popular objective source separation measure such as signal to distortion ratio
(SDR) defined in [16].
The differences between the two models, termed “KL-NTF.mag” and “IS-
NTF.pow” are summarized in Table 1.
We note in the following G the I ×F ×N tensor with entries gif n = d (vif n |v̂if n ).
For the KL and IS cost functions we have
x
dKL (x|y) = 1 − (29)
y
1 x
dIS (x|y) = − 2 (30)
y y
Let A and B be F × K and N × K matrices. We denote A ◦ B the F × N × K
tensor with elements af k bnk , i.e, each frontal slice k contains the outer product
ak bTk .1 Now we note < S, T >KS ,KT the contracted product between tensors S
and T, defined in Appendix B, where KS and KT are the sets of mode indices
over which the summation takes place. With these definitions we get
< G− , Q ◦ H >{1,3},{1,2}
W ← W. (35)
< G+ , Q ◦ H >{1,3},{1,2}
< G− , Q ◦ W >{1,2},{1,2}
H ← H. (36)
< G+ , Q ◦ W >{1,2},{1,2}
The resulting algorithm can easily be shown to nonincrease the cost function at
each iteration by generalizing existing proofs for KL-NMF [13] and for IS-NMF
[1]. In our implementation normalization of the variables is carried out at the
end of every iteration by dividing every column of Q by their 1 norm and scaling
the columns of W accordingly, then dividing the columns of W by their 1 norm
and scaling the columns of H accordingly.
M = AL (37)
1
This is similar to the Khatri-Rao product of A and B, which returns a matrix of
dimensions F N × K with column k equal to the Kronecker product of ak and bk .
110 C. Févotte and A. Ozerov
ljk = 1 iff k ∈ Kj (38)
ljk = 0 otherwise. (39)
Q = DL (40)
where
i.e.,
∇D D(V|V̂) =< G, W ◦ H >{2,3},{1,2} LT (45)
so that multiplicative updates for D can be obtained as
< G− , W ◦ H >{2,3},{1,2} LT
D ← D. (46)
< G+ , W ◦ H >{2,3},{1,2} LT
As before, we normalize the columns of D by their 1 norm at the end of every
iteration, and scale the columns of W accordingly.
In our Matlab implementation the resulting multiplicative algorithm for IS-
cNTF.pow is 4 times faster than the one presented in [10] (for linear instanta-
neous mixtures), which was based on sequential updates of the matrices [qk ]k∈Kj ,
[wk ]k∈Kj , [hk ]k∈Kj . The Matlab code of this new algorithm as well as the
other algorithms described in this paper can be found online at http://perso.
telecom-paristech.fr/~fevotte/Samples/cmmr10/.
4 Results
We consider source separation of simple audio mixtures taken from the Signal
Separation Evaluation Campaign (SiSEC 2008) website. More specifically, we
used some “development data” from the “underdetermined speech and music
mixtures task” [18]. We considered the following datasets :
Notes on NTF for Audio Source Separation 111
Fig. 1. Mixing parameters estimation and ground truth. Top : wdrums dataset. Bot-
tom : nodrums dataset. Left : results of KL-NTF.mag and KL-cNTF.mag; ground
truth mixing vectors {|aj |}j (red), mixing vectors {dj }j estimated with KL-cNTF.mag
(blue), spatial cues {qk }k given by KL-NTF.mag (dashed, black). Right : results of IS-
NTF.pow and IS-cNTF.pow; ground truth mixing vectors {|aj |2 }j (red), mixing vectors
{dj }j estimated with IS-cNTF.pow (blue), spatial cues {qk }k given by IS-NTF.pow
(dashed, black).
112 C. Févotte and A. Ozerov
Every four algorithm was run 10 times from 10 random initializations for 1000 it-
erations. For every algorithm we then selected the solutions Q, W and H yielding
smallest cost value. Time-domain components were reconstructed as discussed
in Section 2.2 for KL-NTF.mag and KL-cNTF.mag and as is in Section 2.3 for
IS-NTF.pow and IS-cNTF.pow. Given these reconstructed components, source
estimates were formed as follows :
Note that we are here not reconstructing the original single-channel sources
(1) (I)
sj (t) but their multichannel contribution [sj (t), . . . , sj (t)] to the multichan-
nel data (i.e, their spatial image). The quality of the source image estimates
was assessed using the standard Signal to Distortion Ratio (SDR), source Im-
age to Spatial distortion Ratio (ISR), Source to Interference Ratio (SIR) and
Source to Artifacts Ratio (SAR) defined in [17]. The numerical results are
reported in Table 2. The source estimates may also be listened to online at
http://perso.telecom-paristech.fr/~fevotte/Samples/cmmr10/. Figure 1
displays estimated spatial cues together with ground truth mixing matrix, for
every method and dataset.
Table 2. SDR, ISR, SIR and SAR of source estimates for the two considered datasets.
Higher values indicate better results. Values in bold font indicate the results with best
average SDR.
wdrums nodrums
s1 s2 s3 s1 s2 s3
(Hi-hat) (Drums) (Bass) (Bass) (Lead G.) (Rhythmic G.)
KL-NTF.mag KL-NTF.mag
SDR -0.2 0.4 17.9 SDR 13.2 -1.8 1.0
ISR 15.5 0.7 31.5 ISR 22.7 1.0 1.2
SIR 1.4 -0.9 18.9 SIR 13.9 -9.3 6.1
SAR 7.4 -3.5 25.7 SAR 24. 2 7.4 2.6
KL-cNTF.mag KL-cNTF.mag
SDR -0.02 -14.2 1.9 SDR 5.8 -9.9 3.1
ISR 15.3 2.8 2.1 ISR 8.0 0.7 6.3
SIR 1.5 -15.0 18.9 SIR 13.5 -15.3 2.9
SAR 7.8 13.2 9.2 SAR 8.3 2.7 9.9
IS-NTF.pow IS-NTF.pow
SDR 12.7 1.2 17.4 SDR 5.0 -10.0 -0.2
ISR 17.3 1.7 36.6 ISR 7.2 1.9 4.2
SIR 21.1 14.3 18.0 SIR 12.3 -13.5 0.3
SAR 15.2 2.7 27.3 SAR 7.2 3.3 -0.1
IS-cNTF.pow IS-cNTF.pow
SDR 13.1 1.8 18.0 SDR 3.9 -10.2 -1.9
ISR 17.0 2.5 35.4 ISR 6.2 3.3 4.6
SIR 22.0 13.7 18.7 SIR 10.6 -10.9 -3.7
SAR 15.9 3.4 26.5 SAR 3.7 1.0 1.5
contain much bass and lead guitar.2 Results from all four methods on this dataset
are overly all much worse than with dataset wdrums, corroborating an estab-
lished idea than percussive signals are favorably modeled by NMF models [7].
Increasing the number of total components K did not seem to solve the observed
deficiencies of the 4 approaches on this dataset.
5 Conclusions
In this paper we have attempted to clarify the statistical models latent to audio
source separation using PARAFAC-NTF of the magnitude or power spectro-
gram. In particular we have emphasized that the PARAFAC-NTF does not op-
timally exploits interchannel redundancy in the presence of point-sources. This
still may be sufficient to estimate spatial cues correctly in linear instantaneous
mixtures, in particular when the NMF model suits well the sources, as seen from
2
The numerical evaluation criteria were computed using the bss eval.m function
available from SiSEC website. The function automatically pairs source estimates
with ground truth signals according to best mean SIR. This resulted here in pairing
left, middle and right blue directions with respectively left, middle and right red
directions, i.e, preserving the panning order.
114 C. Févotte and A. Ozerov
the results on dataset wdrums but may also lead to incorrect results in other cases,
as seen from results on dataset nodrums. In contrast methods fully exploiting
interchannel dependencies, such as the EM algorithm based on model (17)-(18)
(i)
with ckf n = ckf n in [10], can successfully estimates the mixing matrix in both
datasets. The latter method is however about 10 times computationally more
demanding than IS-cNTF.pow.
In this paper we have considered a variant of PARAFAC-NTF in which the
loading matrix Q is given a structure such that Q = DL. We have assumed that
L is known labelling matrix that reflects the partition K1 , . . . , KJ . An important
perspective of this work is to let the labelling matrix free and automatically
estimate it from the data, either under the constraint that every column lk of L
may contain only one nonzero entry, akin to a hard clustering, i.e., lk 0 = 1, or
more generally under the constraint that lk 0 is small, akin to soft clustering.
This should be made feasible using NTF under sparse 1 -constraints and is left
for future work.
References
1. Cao, Y., Eggermont, P.P.B., Terebey, S.: Cross Burg entropy maximization and its
application to ringing suppression in image reconstruction. IEEE Transactions on
Image Processing 8(2), 286–292 (1999)
2. Cemgil, A.T.: Bayesian inference for nonnegative matrix factorisation models.
Computational Intelligence and Neuroscience (Article ID 785152), 17 pages (2009);
doi:10.1155/2009/785152
3. Févotte, C.: Itakura-Saito nonnegative factorizations of the power spectrogram
for music signal decomposition. In: Wang, W. (ed.) Machine Audition: Principles,
Algorithms and Systems, ch. 11. IGI Global Press (August 2010), http://perso.
telecom-paristech.fr/~fevotte/Chapters/isnmf.pdf
4. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with
the Itakura-Saito divergence. With application to music analysis. Neural Com-
putation 21(3), 793–830 (2009), http://www.tsi.enst.fr/~fevotte/Journals/
neco09_is-nmf.pdf
5. FitzGerald, D., Cranitch, M., Coyle, E.: Non-negative tensor factorisation for sound
source separation. In: Proc. of the Irish Signals and Systems Conference, Dublin,
Ireland (September 2005)
6. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorisa-
tion models for musical sound source separation. Computational Intelligence and
Neuroscience (Article ID 872425), 15 pages (2008)
7. Helén, M., Virtanen, T.: Separation of drums from polyphonic music using non-
negative matrix factorization and support vector machine. In: Proc. 13th European
Signal Processing Conference (EUSIPCO 2005) (2005)
8. Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix
factorization. Nature 401, 788–791 (1999)
9. Neeser, F.D., Massey, J.L.: Proper complex random processes with applications to
information theory. IEEE Transactions on Information Theory 39(4), 1293–1302
(1993)
Notes on NTF for Audio Source Separation 115
10. Ozerov, A., Févotte, C.: Multichannel nonnegative matrix factorization in convolu-
tive mixtures for audio source separation. IEEE Transactions on Audio, Speech and
Language Processing 18(3), 550–563 (2010), http://www.tsi.enst.fr/~fevotte/
Journals/ieee_asl_multinmf.pdf
11. Parry, R.M., Essa, I.: Estimating the spatial position of spectral components in
audio. In: Rosca, J.P., Erdogmus, D., Prı́ncipe, J.C., Haykin, S. (eds.) ICA 2006.
LNCS, vol. 3889, pp. 666–673. Springer, Heidelberg (2006)
12. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to
statistics and computer vision. In: Proc. 22nd International Conference on Machine
Learning, pp. 792–799. ACM, Bonn (2005)
13. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomogra-
phy. IEEE Transactions on Medical Imaging 1(2), 113–122 (1982)
14. Smaragdis, P.: Convolutive speech bases and their application to speech separation.
IEEE Transactions on Audio, Speech, and Language Processing 15(1), 1–12 (2007)
15. Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music
transcription. In: IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA 2003) (October 2003)
16. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind au-
dio source separation. IEEE Transactions on Audio, Speech and Language Pro-
cessing 14(4), 1462–1469 (2006), http://www.tsi.enst.fr/~fevotte/Journals/
ieee_asl_bsseval.pdf
17. Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.P.: First stereo audio source
separation evaluation campaign: Data, algorithms and results. In: Davies, M.E.,
James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666,
pp. 552–559. Springer, Heidelberg (2007)
18. Vincent, E., Araki, S., Bofill, P.: Signal Separation Evaluation Campaign.
In: (SiSEC 2008) / Under-determined speech and music mixtures task re-
sults (2008), http://www.irisa.fr/metiss/SiSEC08/SiSEC_underdetermined/
dev2_eval.html
19. Virtanen, T.: Monaural sound source separation by non-negative matrix factor-
ization with temporal continuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing 15(3), 1066–1074 (2007)
A Standard Distributions
Proper complex Gaussian Nc (x|μ, Σ) = |π Σ|−1 exp −(x − μ)H Σ −1 (x − μ)
x
Poisson P(x|λ) = exp(−λ) λx!
The contracted tensor product should be thought of as a form a generalized dot product
of two tensors along common modes of same dimensions.
What Signal Processing Can Do for the Music
1 Introduction
Signal Processing Techniques are a powerful set of mathematical tools that allow
to obtain from a signal the required information for a certain purpose. Signal
Processing Techniques can be used for any type of signal: communication signals,
medical signals, Speech signals, multimedia signals, etc. In this contribution, we
focus on the application of signal processing techniques to music information:
audio and scores.
Signal processing techniques can be used for music database exploration. In
this field, we present a 3D adaptive environment for music content exploration
that allows the exploration of musical contents in a novel way. The songs are
analyzed and a series of numerical descriptors are computed to characterize
their spectral content. Six main musical genres are defined as axes of a multidi-
mensional framework, where the songs are projected. A three-dimensional sub-
domain is defined by choosing three of the six genres at a time and the user is
allowed to navigate in this space, browsing, exploring and analyzing the elements
of this musical universe. Also, inside this field of music database exploration, a
novel method for music similarity evaluation is presented. The evaluation of mu-
sic similarity is one of the core components of the field of Music Information
Retrieval (MIR). In this study, rhythmic and spectral analyses are combined to
extract the tonal profile of musical compositions and evaluate music similarity.
Music signal processing can be used also for the preservation of the cultural
heritage. In this sense, we have developed a complete system with an interactive
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 116–137, 2011.
c Springer-Verlag Berlin Heidelberg 2011
What Signal Processing Can Do for the Music 117
graphical user interface for Optical Music Recognition (OMR), specially adapted
for scores written in white mensural notation. Color photographies of Ancient
Scores taken at the Archivo de la Catedral de Málaga have been used as input to
the system. A series of pre-processing steps are aimed to improve their quality
and return binary images to be processed. The music symbols are extracted and
classified, so that the system is able to transcribe the ancient music notation
into modern notation and make it sound.
Music signal processing can also be focused in developing tools for technology-
enhanced learning and revolutionary learning appliances. In this sense, we present
different applications we have developed to help learning different instruments:
piano, violin and guitar. The graphical tool for piano learning we have developed,
is able to detect if a person is playing the proper piano chord. The graphical tool
shows to the user the time and frequency response of each frame of piano sound
under analysis and a piano keyboard in which the played notes are highlighted
as well as the name of the played notes. The core of the designed tool is a poly-
phonic transcription system able to detect the played notes, based on the use of
spectral patterns of the piano notes. The designed tool is useful both for users
with knowledge of music and users without these knowledge. The violin learning
tool is based on a transcription system able to detect the pitch and duration of
the violin notes and to identify the different expressiveness techniques: détaché
with and without vibrato, pizzicato, tremolo, spiccato, flageolett-töne. The in-
terface is a pedagogical tools to aid in violin learning. For the guitar, we have
developed a system able to perform in real time string and fret estimation of
guitar notes. The system works in three modes: it is able to estimate the string
and fret of a single note played on a guitar, strummed chords from a predefined
list and it is also able to make a free estimation if no information of what is
being played is given. Also, we have developed a lightweight pitch detector for
embedded systems to be used in toys. The detector is based on neural networks
in which the signal preprocessing is a frequency analysis. The selected neural
network is a perceptron-type network. For the preprocessing, the Goertzel Al-
gorithm is the selected technique for the frequency analysis because it is a light
alternative to FFT computing and it is very well suited when only few spectral
points are enough to extract the relevant information.
Therefore, the outline of the paper is as follows. In Section 2, musical content
management related tools are presented. Section 3 is devoted to the presentation
of the tool directly related to the preservation of the cultural heritage. Section 4
will present the different tools developed for technology-enhanced music learning.
Finally, the conclusions are presented in Section 5.
musical contents and it also gives the opportunity to the user to get to know
certain types of music that he would not have found with other more traditional
ways of searching musical contents.
In order to use a 3D environment as the one presented, or other types of
methods for music information retrieval, the evaluation of music similarity is one
of the core components. In subsection 2.2, the rhythmic and spectral analyses of
music contents are combined to extract the tonal profile of musical compositions
and evaluate music similarity.
Fig. 1. The graphical user interface for the 3D exploration of musical audio
the extraction of low-level time and frequency descriptors [25] or on the compu-
tation of rhythmic patterns [21]. Logan and Salomon [26] use the Mel Frequency
Cepstral Coefficients (MFCCs) as a main tool to compare audio tracks, based
on their spectral content. Ellis et al. [13] adopt the cross-correlation of rhythmic
patterns to identify common parts among songs.
In this study, rhythmic and spectral analyses are combined to extract the tonal
profile of musical compositions and evaluate music similarity. The processing
stage comprises two main steps: the computation of the main rhythmic meter of
the song and the estimation of the distribution of contributions of tonalities to
the overall tonal content of the composition. The calculus of the cross-correlation
of the rhythmic pattern of the envelope of the raw signal allows a quantitative
estimation of the main melodic motif of the song. Such temporal unit has to be
employed as a base for the temporal segmentation of the signal, aimed to extract
the pitch class profile of the song [14] and, consequently, the vector of tonality
contributions. Finally, this tonal behavior vector is employed as the main feature
to describe the song and it is used to evaluate similarity.
Estimation of the melodic cell. In order to characterize the main melodic
motif of the track, the songs are analyzed to estimate the tempo. More than a
real quantitative metrical calculus of the rhythmic pattern, the method aims at
delivering measures for guiding the temporal segmentation of the musical signal,
and at subsequently improving the representation of the song dynamics. This
is aimed at optimizing the step for the computation of the tonal content of the
audio signal, supplying the reference temporal frame for the audio windowing.
The aim of the tempo induction is to estimate the width of the window used for
windowing, so that the stage for the computation of the tonal content of the song
120 I. Barbancho et al.
1. The raw signal is half-way rectified and filtered with a low-pass Butterworth
filter, with a cut-off frequency of 100 Hz [12].
2. The envelope of the filtered signal is computed, using a low-pass Butterworth
filter with a cut-off frequency of 1 Hz.
3. The first order derivative is computed on the envelope.
4. The zero-crossing points of the derivative are found (the crests and the
troughs of the envelope).
5. The difference between crests and troughs is computed and its empirical
cumulative distribution is evaluated.
6. Only the values exceeding the 75th percentile of their cumulative distribu-
tions are kept.
7. The temporal distances among the selected troughs (or crests) are computed
and the average value is calculated.
Table 1. Relative and absolute differences among the widths of the melodic window
manually evaluated by the listeners and the ones automatically computed by the pro-
posed algorithm
In Table 1, the differences between the widths of the window manually measured
and automatically computed are shown.
The best results are obtained for the Disco music tracks (6.2%), where the
clear drummed bass background is well detected and the pulse coincides most of
times with the tempo. The worst results are related to the lack of a clear driving
bass in Classical music (21.2%), where the changes in time can be frequent and
a uniform tempo measure is hardly detectable.
However, the beats, or lower-level metrical features are, most of the times,
submultiples of such tempo value, which make them usable for the melodic cell
computation.
Tonal behavior. Most of the music similarity systems aim at imitating the
human perception of a song. This capacity is complex to analyze. The human
brain carries out a series of subconscious processes, as the computation of the
rhythm, the instruments richness, the musical complexity, the tonality, the mode,
the musical form or structure, the presence of modulations, etc., even without
any technical musical knowledge [29].
A novel technique for the determination of the tonal behavior of music signals
based on the extraction of the pattern of tonality contributions is presented.
The main process is based on the calculus of the contributions of each note of
the chromatic scale (Pitch Class Profile - PCP), and the computation of the
possible matching tonalities. The outcome is a vector reflecting the variation of
the spectral contribution of each tonality throughout the entire piece. The song
is time windowed with no overlapping windows, whose width is determined on
the basis of the tempo induction algorithm.
The Pitch Class Profile is based on the contribution of the twelve semitone
pitch classes to the whole spectrum. Fujishima [14] employed the PCPs as main
tool for chord recognition, while İzmirli [22] defined them as ‘Chroma Template’
and used them for audio key finding. Gomez and Herrera [16] applied machine
learning methods to the ‘Harmonic Pitch Class Profile’, to estimate the tonalities
of polyphonic audio tracks.
The spectrum of the whole audio is analyzed, and the distribution of the
strengths of all the tones is evaluated. The different octaves are grouped to
measure the contribution of the 12 basic tones. A detailed description follows.
122 I. Barbancho et al.
where Xs is the simplified spectrum, the index k covers the twelve semitone
pitches and i is used to index each octave. The subscript t stands for the temporal
frame for which the PCP is computed.
In order to estimate the predominant tonality of a track, it is important to define
a series of PCPs for all the possible tonalities, to be compared with its own PCP.
The shape of the PCP mainly depends on the modality of the tonality (Major or
Minor). Hence, by assembling only two global profiles, for major and minor modes,
and by shifting each of them twelve times according to the tonic pitch of the twelve
possible tonalities of each mode, 24 tonalities profiles are obtained.
Krumhansl [24] defined the profiles empirically, on the base of a series of listen-
ing sessions carried out on a group of undergraduates from University of Harvard,
who had to evaluate the correspondence among test tracks and probe tones. The
author presented two global profiles, one for major and one for minor mode, rep-
resenting the global contribution of each tone to all the tonalities for each mode.
More recently, Temperley [35] presented a modified less biased version of
Krumhansl profiles. In this context we propose a revised version of the Krumhansl’s
profiles with the aim of avoiding the bias of the system for a particular mode. Ba-
sically, the two mode profiles are normalized to show the same sum of values and,
then, their profiles are divided by their corresponding maximums.
For each windowed frame of the track, the squared Euclidean distance be-
tween the PCP of the frame and each tonality profile is computed to define a
24-elements vector. Each element of the vector is the sum of the squared differ-
ences between the amplitudes of the PCP and the tonality profiles. The squared
distance is defined as follow:
⎧ 11
⎪
⎪
⎨ [(PM (j + 1)) − P CPt ((j + k − 1) mod 12 + 1)]2 1 ≤ k ≤ 12
j=0
Dt (k) =
⎪
⎪ 11
⎩ [(Pm (j + 1)) − P CPt ((j + k − 1) mod 12 + 1)]2 13 ≤ k ≤ 24
j=0
(2)
where Dt (k) is the squared distance computed at time t of the k-th tonality, with
k ∈ {1, 2, ..., 24}, and PM /Pm are, respectively, the major and minor profile.
The predominant tonality of each frame corresponds to the minimum of
the distance vector Dt (k), where the index k, with k ∈ {1, ..., 12}, refers to
the twelve major tonalities (from C to B) and k, with k ∈ {13, ..., 24}, refers
What Signal Processing Can Do for the Music 123
Tonal behavior
1
0.8
Normalized Amplitude
0.6
0.4
0.2
0
C C# D Eb E F F# G Ab A Bb B c c# d d# e f f# g g# a a# b
Tonalities
Fig. 2. An example of the tonal behavior of the Beatles’ song “I’ll be back”, where the
main tonality is E major
to the twelve minor tonalities (from c to b). Usually major and minor tonalities
are represented with capital and lower-case letter respectively.
The empirical distribution of all the predominant tonalities, estimated through-
out the entire piece, is calculated in order to represent the tonality contributions
to the tonal content of the song. This is defined as the ‘tonal behavior’ of the com-
position. In Figure 2, an example of the distribution of the tonality contributions
for the Beatles’ song “I’ll be back” is shown.
Music similarity. The vectors describing the tonal behavior of the songs are
employed to measure their reciprocal degree of similarity. In fact the human
brain is able to detect the main melodic pattern, even by means of subconscious
processes and its perception of musical similarity is partially based on it [24].
The tonal similarity among the songs is computed by the Euclidean distance
of the tonal vector calculated, following the equation:
T SAB = TA − TB (3)
where T SAB stands for the coefficient of tonal similarity between the songs A
and B and TA and TB are the empirical tonality distributions for song A and
B, respectively.
A robust evaluation of the performance of the proposed method for evalua-
tion of music similarity is very hard to achieve. The judgment of the similarity
among audio files is a very subjective issue, showing the complex reality of hu-
man perception. Nevertheless, a series of tests have been performed on some
predetermined lists of songs.
Four lists of 11 songs have been submitted to a group of ten listeners. They
were instructed to sort the songs according to their perceptual similarity and tonal
similarity. For each list, a reference song was defined and the remaining 10 songs
had to be sorted with respect to their degree of similarity with the reference one.
124 I. Barbancho et al.
A series of 10-element lists were returned by the users, as well as by the automatic
method. Two kinds of experimental approaches were carried out: in the first ex-
periment, the users had to listen to the songs and sort them according to a global
perception of their degree of similarity. In the second framework, they were asked
to focus only on the tonal content. The latter was the hardest target to obtain,
because of the complexity of discerning the parameters to be taken into account
when listening to a song and evaluating its similarity with respect to other songs.
The degree of coherence among the list manually sorted and the ones auto-
matically processed was obtained. A weighted matching score for each pair of
lists was computed, the reciprocal distance of the songs (in terms of the position
index in the lists) was calculated. Such distances were linearly weighted, so that
the first songs in the lists reflected more importance than the last ones. In fact, it
is easier to evaluate which is the most similar song among pieces that are similar
to the reference one, than performing the same selection among very different
songs. The weights aid to compensate for this bias.
Let Lα and Lβ represent two different ordered lists of n songs, for the same
reference song. The matching score C has been computed as follows:
n
C= |i − j| · ωi (4)
i=1
where i and j are the indexes for lists Lα and Lβ , respectively, such that j is the
index of the j-th song in list Lβ with Lα (i) ≡ Lβ (j). The absolute difference is
linearly weighted by the weights
n ωi normalized such to sum to one, defined by
the following expression: i=1 ωi = 1. Finally, the scores are transformed to be
represented as percentage of the maximum score attainable.
The efficiency of the automatic method was evaluated by measuring its co-
herence with the users’ response. The closer the two set of values, the better the
performance of the automatic method. As expected, the evaluation of the auto-
matic method in the first experimental framework did not return reliable results
because of the extreme deviation of the marks, due to the scarce relevance of the
tones distribution in the subjective judgment of the song. As mentioned before,
the tonal behavior of the song is only one of the parameters taken into account
subconsciously by the human ear. Nevertheless, if the same songs were asked to
be evaluated only by their tonal content, the scores drastically decreased, reveal-
ing the extreme lack of abstraction of the human ear. In Table 2 the results for
both experimental frameworks are shown.
The differences between the results of the two experiments are evident. Con-
cerning the first experiment, the mean score correspondence is 74.2%, among
the users lists and 60.1% among the users and the automatic list. That is, the
automatic method poorly reproduces the choices made by the users, taking into
account a global evaluation of music similarity. Conversely, in the second ex-
periment, better results were obtained. The mean correspondence score for the
users’ lists decrease to 61.1%, approaching the value returned by the users and
automatic list together, 59.8%. The performance of the system can be considered
to be similar to the behavior of a mean human user, regarding the perception of
tonal similarities.
What Signal Processing Can Do for the Music 125
Table 2. Means and standard deviations of the correspondence scores obtained com-
puting equation (4). The raws ‘Auto+Users’ and ‘Users’ refer to the correspondence
scores computed among the users lists together with the automatic list and among
only the users lists, respectively. The ‘Experiment 1’ is done listening and sorting the
songs on the base of a global perception of the track, while ‘Experiment 2’ is performed
trying to take into account only the tone distributions.
Experiment 1 Experiment 2
Lists Method
Mean St. Dev. Mean St. Dev.
Auto+Users 67.6 7.1 66.6 8.8
List A
Users 72.3 13.2 57.9 11.5
Auto+Users 63.6 1.9 66.3 8.8
List B
Users 81.8 9.6 66.0 10.5
Auto+Users 61.5 4.9 55.6 10.2
List C
Users 77.2 8.2 57.1 12.6
Auto+Users 47.8 8.6 51.0 9.3
List D
Users 65.7 15.4 63.4 14.4
Fig. 3. A snapshot of some of the main windows of interface of the OMR system
Fig. 4. A snapshot of the main windows of the interface of the tool for piano learning
– A piano keyboard in which the played notes are highlighted as well as the
name of the played notes is shown at the bottom.
selected position and ask the application to correct it, returning the errors made.
Otherwise, in the ‘free practice’ sub-section, any kind of violin recording can be
analyzed for its melodic content, detecting the pitch, the duration of the notes
and the techniques employed (e.g.:détaché with and without vibrato, pizzicato,
tremolo, spiccato, flageolett-töne). The user can also visualize the envelope and
the spectrum of the each note and listen to the MIDI transcription generated.
In Figure 5, some snapshots of the interface are shown. The overall performance
attained by our system in the detection and correction of the notes and expres-
siveness is 95.4%.
The guitar is one of the most popular musical instruments nowadays. In contrast
to other instruments like the piano, in the guitar the same note can be played
plucking different strings at different positions. Therefore, the algorithms used
for piano transcription [10] cannot be used for guitar. In guitar transcription it
is important to estimate the string used to play a note [7].
Fig. 5. Three snapshots of the interface for violin learning are shown. Clockwise from
top left: the main window, the analysis window and a plot of the MIDI melody.
130 I. Barbancho et al.
The system presented in this demonstration is able to estimate the string and
the fret of a single note played with a very low error probability. In order to keep
a low error probability when a chord is strummed on a guitar, the system chooses
which chord has been most likely played from a predefined list. The system works
with classical guitars as well as acoustic or electric guitars. The sound has to
be captured with a microphone connected to the computer soundcard. It is also
possible to plug a cable from an electric guitar to the sound card directly.
The graphical interface consists of a main window (Figure 6(a)) with a pop-up
menu where you can choose the type of guitar you want to use with the interface.
The main window includes a panel (Estimación) with three push buttons, where
you can choose between three estimation modes:
– The mode Nota única (Figure 6(b)) estimates the string and fret of a single
note that is being played and includes a tuner (afinador ).
– The mode Acorde predeterminado estimates strummed chords that are being
played. The system estimates the chord by choosing the most likely one from
a predefined list.
– The last mode, Acorde libre, makes a free estimation of what is being played.
In this mode the system does not have the information of how many notes
are being played, so this piece of information is also estimated.
Each mode includes a window that shows the microphone input, a window with
the Fourier Transform of the sound sample, a start button, a stop button and an
exit button (Salir ). At the bottom of the screen there is a panel that represents
a guitar. Each row stands for a string on the guitar and the frets are numbered
from one to twelve. The current estimation of the sound sample, either note or
chord, is shown on the panel with a red dot.
Audio in
Pitch
I2C out Preprocessing
Detection
AVR ATMEGA168
Figure 7 shows the block diagram of the detection system. This figure shows
the hardware connected to the microcontroller’s A/D input, which consists of a
preamplifier in order to accommodate the input from the electret microphone
into the A/D input range and an anti-aliasing filter. The anti-aliasing filter pro-
vides 58dB of attenuation at cutoff which is enough to ensure the anti-aliasing
function. After the filter, the internal A/D converter of the microcontroller is
used. After conversion, a buffer memory is required in order to store enough sam-
ples for the preprocessing block. The output of the preprocessing block is used for
pitch detection using a neural network. Finally, an I2C (Inter-Integrated Circuit)
[32] interface is used for connecting the microcontroller with other boards.
We use the open source Arduino environment [1] with the AVR ATMEGA168
microcontroller [2] for development and testing of the pitch detection implemen-
tation. The system will be configured to detect the notes between A3 (220Hz)
What Signal Processing Can Do for the Music 133
and G#5 (830.6Hz), following the well-tempered scale, as it is the system mainly
used in Western music. This range of notes has been selected because one of the
applications of the proposed system is the detection of vocal music of children
and adolescents.
The aim of the preprocessing stage is to transform the samples of the audio
signal from the time domain to the frequency domain. The Goertzel algorithm
[15], [30] is a light alternative to FFT computing if the interest is focused only
in some of the spectrum points, as in this case. Given the frequency range of
the musical notes in which the system is going to work, along with the sampling
restriction of the selected processor, the selected sampling frequency is fs =
4KHz and the number of input samples N = 400, that obtain a precision of
10Hz, which is sufficient for the pitch detection system. On the other hand, in
the preprocessing block, the number of frequencies fk , in which the Goertzel
p
algorithm is computed, is 50 and are given according to fp = 440 · 2 12 Hz
with p = −24, −23, ..., 0, ..., 24, 25, so that, each note in the range of interest
have, at least, one harmonic and one subharmonic to improve the detection
performance of notes with octave or perfect fifth relation. Finally, the output of
the preprocessing stage is a vector that contains the squared modulus of the 50
points of interest of the Goertzel algorithm: the points of the power spectrum of
the input audio signal in the frequencies of interest.
For the algorithm implemented using fixed-point arithmetic, the execution
time is less than 3ms on a 16 MIPS AVR microcontroller. The number of points
of the Goertzel algorithm are limited by the available memory. The Eq. 5 shows
the number of bytes required to implement the algorithm.
N
nbytes = 2 + 2N + m (5)
4
In this expression, m represents the number of desired frequency points. Thus
with m = 50 points and N = 400, the algorithm requires 1900bytes of RAM
memory for signal input/processing/output buffering. Since the microcontroller
has 1024bytes of RAM memory, it is necessary to use an external high-speed SPI
RAM memory in order to have enough memory for buffering audio samples.
Once the Goertzel Algorithm has been performed and the points are stored in
the RAM memory, a recognition algorithm has to be executed for pitch detection.
A useful alternative to spectral processing techniques consist of using artificial
intelligence techniques. We use a statically trained neural network storing the
network weights vectors in a EEPROM memory. Thus, the network training is
performed in a computer with the same algorithm implemented and the embed-
ded system only runs the network. Figure 8 depicts the structure of the neural
network used for pitch recognition. It is a multilayer feed-forward perceptron
with a back-propagation training algorithm.
In our approach, sigmoidal activation has been used for each neuron as well
as no neuron bias. This provides a fuzzy set of values, yj , at the output of each
neural layer. The fuzzy set is controlled by the shape factor, α, of the sigmoid
function, which is set to 0.8, and it is applied to a threshold-based decision func-
tion. Hence, outputs below 0.5 does not activate output neurons while values
134 I. Barbancho et al.
1
1
1
2 2
2
3
3
50 4
24
5
Hidden Output
Input Layer
Layer Layer
G#5
G5 Ideal Output
F#5 Validation Test
F5 Learning Test
E5
D#5
D5
C#5
C5
B4
Output (Note)
A#4
A4
G#4
G4
F#4
F4
E4
D#4
D4
C#4
C4
B3
A#3
A3
0
A3 A#3 B3 C4 C#4 D4 D#4 E4 F4 F#4 G4 G#4 A4 A#4 B4 C5 C#5 D5 D#5 E5 F5 F#5 G5 G#5
Input (Note)
Fig. 9. Learning test, validation test and ideal output of the designed neural network
above 0.5 activate output neurons. The neural network parameters such as the
number of neurons in the hidden layer or the shape factor of the sigmoid func-
tion has been determined experimentally. The neural network has been trained
by running the BPN (Back Propagation Neural Network) on a PC. Once the
network convergence is achieved, the weight vectors are stored. Regarding the
output layer of the neural network, we use five neurons to encode 24 different
outputs corresponding to each note in two octaves (A3 − G#5 notation).
The training and the evaluation of the proposed system has been done using
independent note samples taken from the Musical Instrument Data Base RWC
[19]. The selected instruments have been piano and human voice. The training
of the neural network has been performed using 27 samples for each note in the
What Signal Processing Can Do for the Music 135
range of interest. Thus, we used 648 input vectors to train the network. This
way, the network convergence was achieved with an error of 0.5%.
In Figure 9, we show the learning characteristic of the network when simulating
the network with the training vectors. At the same time, we show the validation
test using 96 input vectors (4 per note) which corresponds to about 15% of new
inputs. As shown in Figure 9, the inputs are correctly classified due to the small
difference among the outputs for the ideal, learning and validation inputs.
5 Conclusions
Acknowledgments
This work has been funded by the Ministerio de Ciencia e Innovación of the Span-
ish Government under Project No. TIN2010-21089-C03-02, by the Junta de
Andalucı́a under Project No. P07-TIC-02783 and by the Ministerio de Industria,
Turismo y Comercio of the Spanish Government under Project No. TSI-020501-
2008-117. The authors are grateful to the person in charge of the Archivo de la
Catedral de Málaga, who allowed the utilization of the data sets used in this work.
References
1. Arduino board, http://www.arduino.cc (last viewed February 2011)
2. Atmel corporation web side, http://www.atmel.com (last viewed February 2011)
3. Aliev, R.: Soft Computing and its Applications. World Scientific Publishing Com-
pany, Singapore (2001)
136 I. Barbancho et al.
4. Barbancho, A.M., Barbancho, I., Fernandez, J., Tardón, L.J.: Polyphony number
estimator for piano recordings using different spectral patterns. In: 128th Audio
Engineering Society Convention (AES 2010), London, UK (2010)
5. Barbancho, A.M., Tardón, L., Barbancho, I.: CDMA systems physical function level
simulation. In: IASTED International Conference on Advances in Communication,
Rodas, Greece (2001)
6. Barbancho, A.M., Tardón, L.J., Barbancho, I.: PIC detector for piano chords.
EURASIP Journal on Advances in Signal Processing (2010)
7. Barbancho, I., Tardón, L.J., Barbancho, A.M., Sammartino, S.: Pitch and played
string estimation in classic and acoustic guitars. In: Proc. of the 126th Audio
Engineering Society Convention (AES 126th), Munich, Germany (May 2009)
8. Barbancho, I., Bandera, C., Barbancho, A.M., Tardón, L.J.: Transcription and
expressiveness detection system for violin music. In: IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), Taipei, Taiwan, pp. 189–192 (2009)
9. Barbancho, I., Segura, C., Tardón, L.J., Barbancho, A.M.: Automatic selection of
the region of interest in ancient scores. In: IEEE Mediterranean Electrotechnical
Conference (MELECON 2010), Valletta, Malta (May 2010)
10. Bello, J.: Automatic piano transcription using frequency and time-domain informa-
tion. IEEE Transactions on Audio, Speech and Language Processing 14(6), 2242–
2251 (2006)
11. Boo, W., Wang, Y., Loscos, A.: A violin music transcriber for personalized learning.
In: IEEE Int. Conf. on Multimdia and Expo. (ICME), Toronto, Canada, pp. 2081–
2084 (2006)
12. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: Proceedings of the International Conference on Music Information
Retrieval (ISMIR 2003), October 26-30, pp. 159–165. John Hopkins University,
Baltimore, USA (2003)
13. Ellis, D.P.W., Cotton, C.V., Mandel, M.I.: Cross-correlation of beat-synchronous
representations for music similarity. In: Proceedings of the IEEE International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, USA, pp.
57–60 (2008), http://mr-pc.org/work/icassp08.pdf (last viewed February 2011)
14. Fujishima, T.: Realtime chord recognition of musical sound: a system using com-
mon lisp music. In: Proc. International Computer Music Association, ICMC 1999,
pp. 464–467 (1999), http://ci.nii.ac.jp/naid/10013545881/en/ (last viewed,
February 2011)
15. Goertzel, G.: An algorithm for the evaluation of finite trigonomentric series. The
American Mathematical Monthly 65(1), 34–35 (1958)
16. Gómez, E., Herrera, P.: Estimating the tonality of polyphonic audio files: Cognitive
versus machine learning modelling strategies. In: Proc. Music Information Retrieval
Conference (ISMIR 2004), Barcelona, Spain, pp. 92–95 (2004)
17. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall
Inc., Upper Saddle River (2006)
18. Goto, M.: Development of the RWC music database. In: 18th Int. Congress on
Acoustics., pp. I-553–I-556 (2004)
19. Goto, M.: Development of the RWC music database. In: Proc. of the 18th Interna-
tional Congress on Acoustics ICA 2004, Kyoto, Japan, pp. 553–556 (April 2004)
20. Gouyon, F.: A computational approach to rhythm description — Audio features
for the computation of rhythm periodicity functions and their use in tempo in-
duction and music content processing. Ph.D. thesis, Ph.D. Dissertation. UPF
(2005), http://www.mtg.upf.edu/files/publications/9d0455-PhD-Gouyon.pdf
(last viewed February 2011)
What Signal Processing Can Do for the Music 137
21. Holzapfel, A., Stylianou, Y.: Rhythmic similarity of music based on dynamic pe-
riodicity warping. In: IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2008, Las Vegas, USA, March 31- April 4, pp. 2217–
2220 (2008)
22. Izmirli, Ö.: Audio key finding using low-dimensional spaces. In: Proc. Music Infor-
mation Retrieval Conference, ISMIR 2006, Victoria, Canada, pp. 127–132 (2006)
23. Klapuri, A.: Automatic music transcription as we know it today. Journal of New
Music Research 33(3), 269–282 (2004)
24. Krumhansl, C.L., Kessler, E.J.: Tracing the dynamic changes in perceived tonal
organization in a spatial representation of musical keys. Psychological Review 89,
334–368 (1982)
25. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Individualization of mu-
sic similarity perception via feature subset selection. In: Proc. Int. Conference on
Systems, Man and Cybernetics, Massachusetts, USA, vol. 1, pp. 552–556 (2004)
26. Logan, B., Salomon, A.: A music similarity function based on signal analysis.
In: IEEE International Conference on Multimedia and Expo., ICME 2001,Tokyo,
Japan, pp. 745–748 (August 2001)
27. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proc. Music
Information Retrieval Conference(ISMIR 2000) (2000)
28. Marolt, M.: A connectionist approach to automatic transcription of polyphonic
piano music. IEEE Transactions on Multimedia 6(3), 439–449 (2004)
29. Ockelford, A.: On Similarity, Derivation and the Cognition of Musical Structure.
Psychology of Music 32(1), 23–74 (2004), http://pom.sagepub.com/cgi/content/
abstract/32/1/23 (last viewed February 2011)
30. Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Prentice-Hall, En-
glewood Cliffs (1989)
31. Pampalk, E.: Islands of music - analysis, organization, and visualization of music
archives. Vienna University of Technology, Tech. rep. (2001)
32. Philips: The I2C bus specification v.2.1. (2000), http://www.nxp.com (last viewed
February 2011)
33. Prasad, B., Mahadeva, S.: Speech, Audio, Image and Biomedical Signal Processing
using Neural Networks. Springer, Heidelberg (2004)
34. Tardón, L.J., Sammartino, S., Barbancho, I., Gómez, V., Oliver, A.J.: Optical
music recognition for scores written in white mensural notation. EURASIP Jour-
nal on Image and Video Processing 2009, Article ID 843401, 23 pages (2009),
doi:10.1155/2009/843401
35. Temperley, D.: The Cognition of Basic Musical Structures. The MIT Press, Cam-
bridge (2004)
36. William, W.K.P.: Digital image processing, 2nd edn. John Wiley & Sons Inc., New
York (1991)
Speech/Music Discrimination in Audio Podcast
Using Structural Segmentation and Timbre
Recognition
Centre for Digital Music, Queen Mary University of London, Mile End Road, London
E1 4NS, United Kingdom
{mathieu.barthet,steven.hargreaves,mark.sandler}@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/digitalmusic/
1 Introduction
Increasing amounts of broadcast material are being made available in the pod-
cast format which is defined in reference [52] as a “digital audio or video file
that is episodic; downloadable; programme-driven, mainly with a host and/or
theme; and convenient, usually via an automated feed with computer software”
(the word podcast comes from the contraction of webcast, a digital media file
distributed over the Internet using streaming technology, and iPod, the portable
media player by Apple). New technologies have indeed emerged allowing users
Correspondence should be addressed to Mathieu Barthet.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 138–162, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Speech/Music Discrimination Using Timbre Models 139
to access audio podcasts material either online (on radio websites such as the
one from the BBC used in this study: http://www.bbc.co.uk/podcasts), or
offline, after downloading the content on personal computers or mobile devices
using dedicated services. A drawback of the podcast format, however, is its lack of
indexes for individual songs and sections, such as speech. This makes navigation
through podcasts a difficult, manual process, and software built on top of auto-
mated podcasts segmentation methods would therefore be of considerable help
for end-users. Automatic segmentation of podcasts is a challenging task in speech
processing and music information retrieval since the nature of the content from
which they are composed is very broad. A non-exhaustive list of type of content
commonly found in podcast includes: spoken parts of various types depending
on the characteristics of the speakers (language, gender, number, etc.) and the
recording conditions (reverberation, telephonic transmission, etc.), music tracks
often belonging to disparate musical genres (classical, rock, jazz, pop, electro,
etc.) and which may include a predominant singing voice (source of confusion
since the latter intrinsically shares properties with the spoken voice), jingles and
commercials which are usually complex sound mixtures including voice, music,
and sound effects. One step of the process of automatically segmenting and an-
notating podcasts therefore is to segregate sections of speech from sections of
music. In this study, we propose two computational models for speech/music
discrimination based on structural segmentation and/or timbre recognition and
evaluate their performances in the classification of audio podcasts content. In
addition to their use with audio broadcast material (e.g. music shows, inter-
views) as assessed in this article, speech/music discrimination models may also
be of interest to enhance navigation into archival sound recordings that con-
tain both spoken word and music (e.g. ethnomusicology interviews available on
the online sound archive from the British Library: https://sounds.bl.uk/). If
speech/music discrimination models find a direct application in automatic audio
indexation, they may also be used as a preprocessing stage to enhance numerous
speech processing and music information retrieval tasks such as speech and mu-
sic coding, automatic speaker recognition (ASR), chord recognition, or musical
instrument recognition.
The speech/music discrimination methods proposed in this study rely on
timbre models (based on various features such as the line spectral frequen-
cies [LSF], and the mel-frequency cepstral coefficients [MFCC]), and machine
learning techniques (K-means clustering and hidden Markov models [HMM]).
The first proposed method comprises an automatic timbre recognition (ATR)
stage using the model proposed in [7] and [16] trained here with speech and
music content. The results of the timbre recognition system are then post-
processed using a median filter to minimize the undesired inter-class switches.
The second method utilizes the automatic structural segmentation (ASS) model
proposed in [35] to divide the signal into a set of segments which are homoge-
neous with respect to timbre before applying the timbre recognition procedure.
A database of classical music, jazz, and popular music podcasts from the BBC
was manually annotated for training and testing purposes (approximately 2,5
140 M. Barthet, S. Hargreaves, and M. Sandler
hours of speech and music). The methods were both evaluated at the semantic
level to measure the accuracy of the machine estimated classifications, and at the
temporal level to measure the accuracy of the machine estimated boundaries be-
tween speech and music sections. Whilst studies on speech/music discrimination
techniques usually provide the first type of evaluation (classification accuracy),
boundary retrieval performances are not reported to our knowledge, despite their
interest. The results of the proposed methods were also compared with those ob-
tained with a state-of-the-art’s speech/music discrimination algorithm based on
support vector machine (SVM) [44].
The remainder of the article is organized as follows. In section 2, a review
of related works on speech/music discrimination is proposed. Section 3, we give
a brief overview of timbre research in psychoacoustics, speech processing and
music information retrieval, and then describe the architecture of the proposed
timbre-based methods. Section 4 details the protocols and databases used in the
experiments, and specifies the measures used to evaluate the algorithms. The
results of the experiments are given and discussed in section 5. Finally, section
6 is devoted to the summary and the conclusions of this work.
2 Related Work
samples above and below the mean). When both the ZCR and energy-based
features were used jointly with a supervised machine learning technique relying
on a multivariate-Gaussian classifier, a 98% accuracy was obtained on average
(speech and music) using 2.4 s-long audio segments. The good performance of
the algorithm can be explained by the fact that the zero-crossing rate is a good
candidate to discern unvoiced speech (fricatives) with a modulated noise spec-
trum (relatively high ZCR) from voiced speech (vowels) with a quasi-harmonic
spectrum (relatively low ZCR): speech signals whose characteristic structure is
a succession of syllabes made of short periods of fricatives and long periods of
vowels present a marked rise in the ZCR during the periods of fricativity, which
do not appear in music signals, which are largely tonal (this however depends on
the musical genre which is considered). Secondly, the energy contour dip mea-
sure characterizes the differences between speech (whose systematic changeovers
between voiced vowels and fricatives produce marked and frequent change in the
energy envelope), and music (which tends to have a more stable energy envelope)
well. However, the algorithm proposed by Saunders is limited in time resolution
(2.4 s). In [48], Scheirer and Slaney proposed a multifeature approach and ex-
amined various powerful classification methods. Their system relied on the 13
following features and, in some cases, their variance: 4 Hz modulation energy
(characterizing the syllabic rate in speech [30]), the percentage of low-energy
frames (more silences are present in speech than in music), the spectral rolloff,
defined as the 95th percentile of the power spectral distribution (good candidate
to discriminate voiced from unvoiced sounds), the spectral centroid (often higher
for music with percussive sounds than for speech whose pitches stay in a fairly
low range), the spectral flux, which is a measure of the fluctuation of the short-
term spectrum (music tends to have a higher rate of spectral flux change than
speech), the zero-crossing rate as in [46], the cepstrum resynthesis residual mag-
nitude (the residual is lower for unvoiced speech than for voiced speech or mu-
sic), and a pulse metric (indicating whether or not the signal contains a marked
beat, as is the case in some popular music). Various classification frameworks
were tested by the authors, a multidimensional Gaussian maximum a posteriori
(MAP) estimator as in [46], a Gaussian mixture model (GMM), a k-nearest-
neighbour estimator (k-NN), and a spatial partitioning scheme (k-d tree), and
all led to similar performances. The best average recognition accuracy using the
spatial partitioning classification was of 94.2% on a frame-by-frame basis, and of
98.6% when integrating on 2.4 s long segments of sound, the latter results being
similar to those obtained by Saunders. Some authors used extensions or corre-
lates of the previous descriptors for the speech/music discrimination task such
as the higher order crossings (HOC) which is the zero-crossing rate of filtered
versions of the signal [37] [20] originally proposed by Kedem [33], the spectral
flatness (quantifying how tonal or noisy a sound is) and the spectral spread (the
second central moment of the spectrum) defined in the MPEG-7 standard [9],
and a rhythmic pulse computed in the MPEG compressed domain [32]. Carey
et al. introduced the use of the fundamental frequency f0 (strongly correlated
to the perceptual attribute of pitch) and its derivative in order to characterize
142 M. Barthet, S. Hargreaves, and M. Sandler
some prosodic aspects of the signals (f0 changes in speech are more evenly dis-
tributed than in music where they are strongly concentrated about zero due to
steady notes, or large due to shifts between notes) [14]. The authors obtained a
recognition accuracy of 96% using the f0 -based features with a Gaussian mix-
ture model classifier. Descriptors quantifying the shape of the spectral envelope
were also widely used, such as the Mel Frequency Cepstral Coefficients (MFCC)
[23] [25] [2], and the Linear Prediction Coefficients (LPC) [23] [1]. El-Maleh et
al. [20] used descriptors quantifying the formant structure of the spectral enve-
lope, the line spectral frequencies (LSF), as in this study (see section 3.1). By
coupling the LSF and HOC features with a quadratic Gaussian classifier, the au-
thors obtained a 95.9% average recognition accuracy with decisions made over 1
s long audio segments, procedure which performed slightly better than the algo-
rithm by Scheirer and Slaney tested on the same dataset (an accuracy increase
of approximately 2%). Contrary to the studies described above that relied on
generative methods, Ramona and Richard [44] developed a discriminative classi-
fication system relying on support vector machines (SVM) and median filtering
post-processing, and compared diverse hierarchical and multi-class approaches
depending on the grouping of the learning classes (speech only, music only, speech
with musical background, and music with singing voice). The most relevant fea-
tures amongst a large collection of about 600 features are selected using the
inertia ratio maximization with feature space projection (IRMFSP) technique
introduced in [42] and integrated on 1 s long segments. The method provided an
F-measure of 96.9% with a feature vector dimension of 50. Those results repre-
sent an error reduction of about 50% compared to the results gathered by the
French ESTER evaluation campaign [22]. As will be further shown in section
5, we obtained performances favorably comparable to those provided by this
algorithm. Surprisingly, all the mentioned studies evaluated the speech/music
classes recognition accuracy, but none, to our knowledge, evaluated the bound-
ary retrieval performance commonly used to evaluate structural segmentation
algorithms [35] (see section 4.3), which we also investigate in this work.
3 Classification Frameworks
Testing audio
Structural
S, D
segmentation
Testing audio
homogeneous segments
Post-processsing Post-processsing
W
(median filtering) (class decision)
Fig. 1. Architecture of the two proposed audio segmentation systems. The tuning pa-
rameters of the systems’ components are also reported: number of line spectral frequen-
cies (LSF), number of codevectors K, latency L for the automatic timbre recognition
module, size of the sliding window W used in the median filtering (post-processing),
maximal number S of segment types, and minimal duration D of segments for the
automatic structural segmentation module.
The two proposed systems rely on the assumption that speech and music can
be discriminated based on their differences in timbre. Exhaustive computational
models of timbre have not yet been found and the common definition used by
scholars remains vague: “timbre is that attribute of auditory sensation in terms
of which a listener can judge that two sounds similarly presented and having the
same loudness and pitch are dissimilar; Timbre depends primarily upon the spec-
trum of the stimulus, but it also depends on the waveform, the sound pressure,
the frequency location of the spectrum, and the temporal characteristics of the
stimulus.” [3]. Research in psychoacoustics [24] [10], [51], analysis/synthesis [45],
144 M. Barthet, S. Hargreaves, and M. Sandler
music perception [4] [5], speech recognition [19], and music information retrieval
[17] have however developed acoustical correlates of timbre characterizing some
of the facets of this complex and multidimensional variable.
a mostly fixed formant1 structure, i.e. zones of high spectral energy (however, in
the case of large pitch changes the formant structure needs to be slightly shifted
for the timbral identity of the sounds to remain unchanged): “The popular no-
tion that a particular timbre depends upon the presence of certain overtones (if
that notion is interpreted as the “relative pitch” theory of timbre) is seen [...] to
lead not to invariance but to large differences in musical timbre with changes in
fundamental frequency. The “fixed pitch” or formant theory of timbre is seen in
those same results to give much better predictions of the minimum differences in
musical timbre with changes in fundamental frequency. The results [...] suggest
that the formant theory may have to be modified slightly. A precise determina-
tion of minimum differences in musical timbre may require a small shift of the
lower resonances, or possibly the whole spectrum envelope, when the fundamen-
tal frequency changes drastically.” [49]. The findings by Slawson have causal
and cognitive explanations. Sounds produced by the voice (spoken or sung) and
most musical instruments present a formant structure closely linked to reso-
nances generated by one or several components implicated in their production
(e.g. the vocal tract for the voice, the body for the guitar, the mouthpiece for
the trumpet). It seems therefore legitimate from the perceptual point of view to
suggest that the auditory system relies on the formant structure of the spectral
envelope to discriminate such sounds (e.g. two distinct male voices of same pitch,
loudness, and duration), as proposed by the “source” or identity mode of timbre
perception hypothesis mentioned earlier.
The timbre models used in this study to discriminate speech and music rely
on features modeling the spectral envelope (see the next section). In these timbre
models, the temporal dynamics of timbre are captured up to a certain extent by
performing signal analysis on successive frames where the signal is assumed to
be stationary, and by the use of hidden markov model (HMM), as described in
section 3.3. Temporal (e.g. attack time) and spectro-temporal parameters (e.g.
spectral flux) have also shown to be major correlates of timbre spaces but these
findings were obtained in studies which did not include speech sounds but only
musical instrument tones either produced on different instruments (e.g. [40]),
or within the same instrument (e.g. [6]). In situations where we discriminate
timbres from various sources either implicitly (e.g. in everyday life’s situations)
or explicitly (e.g. in a controlled experiment situation), it is most probable that
the auditory system uses different acoustical clues depending on the typological
differences of the considered sources. Hence, the descriptors used to account for
timbre differences between musical instruments’ tones may not be adapted for
the discrimination between speech and music sounds. If subtle timbre differences
are possible within a same instrument, large timbre differences are expected to
occur between disparate classes, such as speech, and music, and those are li-
able to be captured by spectral envelope correlates. Music generally being a
mixture of musical instrument sounds playing either synchronously in a poly-
phonic way, or solo, may exhibit complex formant structures induced by its
1
In this article, a formant is considered as being a broad band of enhanced power
present within the spectral envelope.
146 M. Barthet, S. Hargreaves, and M. Sandler
The method is based on the timbre recognition system proposed in [7] and [16],
which we describe in the remainder of this section.
Fig. 2. Automatic timbre recognition system based on line spectral frequencies and
K-means clustering
148 M. Barthet, S. Hargreaves, and M. Sandler
Fig. 3. Podcast ground truth annotations (a), classification results at 1 s intervals (b)
and post-processed results (c)
discrimination relies on the fact that a higher level of similarity is expected be-
tween the various spoken parts one one hand, and between the various music
parts, on the other hand.
The algorithm, implemented as a Vamp plugin [43], is based on a frequency-
domain representation of the audio signal using either a constant-Q transform,
a chromagram or mel-frequency cepstral coefficients (MFCC). For the reasons
mentioned earlier in section 3.1, we chose the MFCCs as underlying features in
this study. The extracted features are normalised in accordance with the MPEG-
7 standard (normalized audio spectrum envelope [NASE] descriptor [34]), by ex-
pressing the spectrum in the decibel scale and normalizing each spectral vector
by the root mean square (RMS) energy envelope. This stage is followed by the
extraction of 20 principal components per block of audio data using principal
component analysis. The 20 PCA components and the RMS envelope consti-
tute a sequence of 21 dimensional feature vectors. A 40-state hidden markov
model (HMM) is then trained on the whole sequence of features (Baum-Welsh
algorithm), each state of the HMM being associated to a specific timbre quality.
After training and decoding (Viterbi algorithm) the HMM, the signal is assigned
a sequence of timbre features according to specific timbre quality distributions
for each possible structural segment. The minimal duration D of expected struc-
tural segments can be tuned. The segmentation is then computed by clustering
timbre quality histograms. A series of histograms are created using a sliding
window and are then grouped into S clusters with an adapted soft K-means
Speech/Music Discrimination Using Timbre Models 151
Automatic Timbre Recognition. Once the signal has been divided into seg-
ments assumed to be homogeneous in timbre, the latter are processed with the
automatic timbre recognition technique described in section 3.2 (see Figure 1(b)).
This yields intermediate classification decisions defined on a short-term basis
(depending on the latency L used in the ATR model).
4 Experiments
Several experiments were conducted to evaluate and compare the performances
of the speech/music discrimination ATR and ASS/ATR methods respectively
presented in sections 3.2 and 3.3. In this section, we first describe the experi-
mental protocols, and the training and testing databases. The evaluation mea-
sures computed to assess the class identification and boundary accuracy of the
systems are then specified.
4.1 Protocols
Influence of the Training Class Taxonomy. In a first set of experiments,
we evaluated the precision of the ATR model according to the taxonomy used to
represent speech and music content in the training data. The classes associated
to the two taxonomic levels schematized in Figure 5 were tested to train the
ATR model. The first level correspond to a coarse division of the audio content
into two classes: speech and music. Given that common spectral differences may
be observed between male and female speech signals due to vocal tract morphol-
ogy changes, and that musical genres are often associated with different sound
textures or timbres due to changes of instrumentation, we sought to establish
whether there was any benefit to be gained by training the LSF/K-means algo-
rithm on a wider, more specific set of classes. Five classes were chosen: two to
represent speech (male speech and female speech), and three to represent music
according to the genre (classical, jazz, and rock & pop). The classifications ob-
tained using the algorithm trained on the second, wider, set of classes are later
152 M. Barthet, S. Hargreaves, and M. Sandler
Fig. 5. Taxonomy used to train the automatic timbre recognition model in the
speech/music discrimination task. The first taxonomic level is associated to a training
stage with two classes: speech and music. The second taxonomic level is associated to
a training stage with five classes: male speech (speech m), female speech (speech f),
classical, jazz, and rock & pop music.
4.2 Database
The training data used in the automatic timbre recognition system consisted of
a number of audio clips extracted from a wide variety of radio podcasts from
BBC 6 Music (mostly pop) and BBC Radio 3 (mostly classical and jazz) emis-
sions. The clips were manually auditioned and then, classified as either speech
or music when the ATR model was trained with two classes, or as male speech,
female speech, classical music, jazz music, and rock & pop music when the ATR
model was trained with five classes. These manual classifications constituted the
ground truth annotations further used in the algorithm evaluations. All speech
was english language, and the training audio clips, whose durations are shown
in Table 1, gathered approximately 30 min. of speech, and 15 min. of music.
For testing purposes, four podcasts different from the ones used for training
(hence containing different speakers and music excerpts) were manually anno-
tated using terms from the following vocabulary: speech, multi-voice speech,
music, silence, jingle, efx (effects), tone, tones, beats. Mixtures of these terms
were also employed (e.g. “speech + music”, to represent speech with background
music). The music class included cases where a singing voice was predominant
(opera and choral music). More detailed descriptions of the podcast material
used for testing are given in Tables 2 and 3.
Table 1. Audio training data durations. Durations are expressed in the following
format: HH:MM:SS (hours:mn:s).
Table 3. Audio testing data durations. Durations are expressed in the following format:
HH:MM:SS (hours:mn:s).
where {.} denotes a set of segments, and —.— their duration. When comparing
the machine estimated segments with the manually annotated ones, any sections
not labelled as speech (male or female), multi-voice speech, or music (classical,
jazz, rock & pop) were disregarded due to their ambiguity (e.g. jingle). The
durations of these disregarded parts are stated in the results section.
correctly detected boundaries (true positives tp), false detections (false positives
f p), and missed detections (false negatives f n) as follows:
tp
P = (2)
tp + f p
tp
R= (3)
tp + f n
Hence, the precision and the recall can be viewed as measures of exactness
and completeness, respectively. As in [35] and [41], the number of true positives
were determined using a tolerance window of duration ΔT = 3 s: a retrieved
boundary is considered to be a “hit” (correct) if its time position l lies within
ΔT ΔT
the range l − ≤ l ≤ l+ . This method to compute the F-measure is also
2 2
used in onset detector evaluation [18] (the tolerance window in the latter case
being much shorter). Before comparing the manually and the machine estimated
boundaries, a post-processing was performed on the ground-truth annotations in
order to remove the internal boundaries between two or more successive segments
whose type was discarded in the classification process (e.g. the boundary between
a jingle and a sound effect section).
In this section, we present and discuss the results obtained for the two sets
of experiments described in section 4.1. In both sets of experiments, all audio
training clips were extracted from 128 kbps, 44.1 kHz, 16 bit stereo mp3 files
(mixed down to mono) and the podcasts used in the testing stage were full
duration mp3 files of the same format.
Table 4. Influence of the training class taxonomy on the performances of the automatic
timbre recognition model assessed at the semantic level with the relative correct overlap
(RCO) measure
Speech Music
ATR model - RCO measure (%)
Two class Five class Two class Five class
1 (Rock & pop) 90.5 91.9 93.7 94.5
2 (Classical) 91.8 93.0 97.8 99.4
3 (Classical) 88.3 91.0 76.1 82.7
4 (Rock & pop) 48.7 63.6 99.8 99.9
Overall 85.2 89.2 96.6 97.8
jazz, and rock & pop). We see from that table that training the ATR model on
five classes instead of two improved classification performances in all cases, but
most notably for the speech classifications of podcast number 4 (an increase of
14.9% from 48.7% to 63.6%) and for the music classifications of podcasts number
3 (up from 76.1% to 82.7%, an increase of 6.6%). In all other cases, the increase
is more modest; being between 0.1% and 2.7%. The combined results show an
increased RCO of 4% for speech and 1.2% for music when trained on five classes
instead of two.
Analysis Parameters. The automatic timbre model was trained with five
classes since this configuration gave the best RCO performances. Regarding the
ATR method, the short-term analysis was performed with a window of 1024
samples, a hop size of 256 samples, and K = 32 codevectors, as in the first set
of experiments. However, in this set of experiments the number of line spectral
frequencies LSF were varied between 8 and 32 by steps of 8, and the duration of
the median filtering windows were tuned accordingly based on experimenting.
The automatic structural segmenter Vamp plugin was used with the default
window and hop sizes (26460 samples, i.e. 0.6 s, and 8820 samples, i.e. 0.2 s,
respectively), parameters defined based on typical beat-length in music [35].
Five different number of segments S were tested (S = {5;7;8;10;12}). The best
relative correct overlap and boundary retrieval performances were obtained with
S = 8 and S = 7, respectively.
Table 5. Comparison of the relative correct overlap performances for the ATR and
ASS/ATR methods, as well as the SVM-based algorithm from [44]. For each method,
the best average result (combining speech and music) is indicated in bold
ATR method (the computation time has not been measured in these experi-
ments). The lower performances obtained by the three compared methods for
the speech class of the fourth podcast is to be nuanced by the very short pro-
portion of spoken excerpts within this podcast (see Table 3), which hence does
not affect much the overall results. The good performances obtained with a low
dimensional LSF vector can be explained by the fact that the voice has a limited
number of formants that are therefore well characterized by a small number of
line spectral frequencies (LSF = 8 corresponds to the characterization of 4 for-
mants). Improving the recognition accuracy for the speech class diminishes the
confusions made with the music class, which explains the concurrent increase of
RCO for the music class when LSF = 8. When considering the class identifi-
cation accuracy, the ATR method conducted with a low number of LSF hence
appears interesting since it is not computationally expensive relatively to the
performances of modern CPUs (linear predictive filter determination, computa-
tion of 8 LSFs, K-means clustering and distance computation). For the feature
vectors of higher dimensions, the higher-order LSFs may contain information
associated with the noise in the case of the voice which would explain the drop
of overall performances obtained with LSF = 16 and LSF = 24. However the
RCOs obtained when LSF = 32 are very close to that obtained when LSF =
8. In this case, the higher number of LSF may be adapted to capture the more
complex formant structures of music.
outclassed the ATR method regarding the boundary retrieval accuracy. The best
overall F-measure of the ASS/ATR method (50.1% with LSF = 8) is approxi-
mately 15% higher than the one obtained with the ATR method (35.1% for LSF
= 16). This shows the benefit of using the automatic structural segmenter prior
to the timbre recognition stage to locate the transitions between the speech and
music sections. As in the previous set of experiments, the best configuration is
obtained with a small amount of LSF features (ASS/ATR method with LSF =
8) which stems from the fact the boundary positions are a consequence of the
classification decisions. For all the tested podcasts, the ASS/ATR method yields
a better precision than the SVM-based algorithm. The most notable difference
happens for the second podcast where the precision of the ASS/ATR method
(72.7%) is approximately 14% higher than the one obtained with the SVM-based
algorithm (58.2%). The resulting increase in overall precision achieved with the
ASS/ATR method (62.3%) compared with the SVM-based method (47.0%) is
of approximately 15%. The SVM-based method however obtains a better overall
boundary recall measure (54.8%) than the ASS/ATR method (42.4%), inducing
the boundary F-measures of both methods to be very close (50.6% and 50.1%,
respectively).
audio podcasts. The first method (ATR) relies on automatic timbre recognition
(LSF/K-means) and median filtering. The second method (ASS/ATR) performs
an automatic structural segmentation (MFCC, RMS / HMM, K-means) before
applying the timbre recognition system. The algorithms were tested with more
than 2,5 hours of speech and music content extracted from popular and classical
music podcasts from the BBC. Some of the music tracks contained a predomi-
nant singing voice which can be a source of confusion with the spoken voice. The
algorithms were evaluated both at the semantic level to measure the quality of
the retrieved segment-type labels (classification relative correct overlap), and at
the temporal level to measure the accuracy of the retrieved boundaries between
sections (boundary retrieval F-measure). Both methods obtained similar and rel-
atively high segment-type labeling performances. The ASS/ATR method lead to
a RCO of 92.8% for speech, and 96.2% for music, yielding an average performance
of 94.5%. The boundary retrieval performances were higher for the ASS/ATR
method (F-measure = 50.1%) showing the benefit to use a structural segmen-
tation technique to locate transitions between different timbral qualities. The
results were compared against the SVM-based algorithm proposed in [44] which
provides a good benchmark of the state-of-the-art’s speech/music discrimina-
tors. The performances obtained by the ASS/ATR method were approximately
3% lower than those obtained with the SVM-based method for the segment-type
labeling evaluation, but lead to better boundary retrieval precisions (approxi-
mately 15% higher).
The boundary retrieval scores were clearly lower for the three compared meth-
ods, relatively to the segment-type labeling performances which were fairly high,
up to 100% of correct identifications in some cases. Future works will be dedi-
cated to refine the accuracy of the sections’ boundaries either by performing a
new analysis of the feature variations locally around the retrieved boundaries,
or by including descriptors complementary to the timbre ones, by using e.g. the
rhythmic information such as tempo whose fluctuations around speech/music
transitions may give complementary clues to accurately detect them. The dis-
crimination of intricate mixtures of music, speech, and sometimes strong post-
production sound effects (e.g. the case of jingles) will also be investigated.
Acknowledgments. This work was partly funded by the Musicology for the
Masses (M4M) project (EPSRC grant EP/I001832/1, http://www.elec.qmul.
ac.uk/digitalmusic/m4m/), the Online Music Recognition and Searching 2
(OMRAS2) project (EPSRC grant EP/E017614/1, http://www.omras2.org/),
and a studentship (EPSRC grant EP/505054/1). The authors wish to thank
Matthew Davies from the Centre for Digital Music for sharing his F-measure
computation Matlab toolbox, as well as György Fazekas for fruitful discussions
on the structural segmenter. Many thanks to Mathieu Ramona from the Institut
de Recherche et Coordination Acoustique Musique (IRCAM) for sending us the
results obtained with his speech/music segmentation algorithm.
160 M. Barthet, S. Hargreaves, and M. Sandler
References
1. Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Seg-
mentation. In: Proc. ICASSP 2002, vol. 1, pp. 297–300 (2002)
2. Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of
Fisher Linear Discriminant Analysis to Speech Music Classification. In: Proc.
EUROCON 2005, vol. 2, pp. 1666–1669 (2005)
3. ANSI: USA Standard Acoustical Terminology. American National Standards In-
stitute, New York (1960)
4. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates
of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2),
135–153 (2010)
5. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis
of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Per-
ception 28(3), 265–278 (2011)
6. Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Con-
trol to Timbre Perception. Acta Acustica United with Acustica 96(4), 678–689
(2010)
7. Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recog-
nition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling
and Retrieval (CMMR 2010), Malaga, Spain, pp. 183–194 (2010)
8. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A
Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and
Audio Processing (2005)
9. Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal
of the Audio Engineering Society 52(7/8), 724–739 (2004)
10. Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre
Space Dimensions: A Confirmatory Study Using Synthetic Tones. J. Acoust. Soc.
Am. 118(1), 471–482 (2005)
11. Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.
org/SonicAnnotator
12. Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.
sonicvisualiser.org/
13. Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin
System, http://www.vamp-plugins.org/
14. Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech,
Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, pp. 149–152 (1999)
15. Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriété du Signal, de
l’Instrument, ou Construction Cognitive (Timbre or Timbres? Property of the
Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on
Interdisciplinary Musicology (CIM 2005), Montréal, Québec, Canada (2005)
16. Chétry, N., Davies, M., Sandler, M.: Musical Instrument Identification using LSF
and K-Means. In: Proc. AES 118th Convention (2005)
17. Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing.
Proc. of the IEEE 65, 1428–1443 (1977)
18. Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Au-
dio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary
University of London, Centre for Digital Music (2009), http://www.eecs.qmul.
ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf
Speech/Music Discrimination Using Timbre Models 161
19. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Mono-
syllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366 (1980)
20. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for
Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 2445–2448 (2000)
21. Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help
of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention,
Vienna, Austria (2007)
22. Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.:
The ESTER Phase II Evaluation Campaign for the Rich Transcription of French
Broadcast News. In: Proc. Interspeech (2005)
23. Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for
Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187–200
(2001)
24. Grey, J.M., Gordon, J.W.: Perception of Spectral Modifications on Orchestral In-
strument Tones. Computer Music Journal 11(1), 24–31 (1978)
25. Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation
and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the
DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137
(1998)
26. Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodologi-
cal Issues in Timbre Research. In: Deliége, I., Sloboda, J. (eds.) Perception and
Cognition of Music, 2nd edn., pp. 253–306. Psychology Press, New York (1997)
27. Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identification,
2nd edn., pp. 425–461. Academic Press, San Diego (1995)
28. Harte, C.: Towards Automatic Extraction of Harmony Information From Music
Signals. Ph.D. thesis, Queen Mary University of London (2010)
29. Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the
works of 1877). English trad. with notes and appendix from E.J. Ellis
30. Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room
Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 66–73 (1973)
31. Itakura, F.: Line Spectrum Representation of Linear Predictive Coefficients of
Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)
32. Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm Detection For Speech-
Music Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th
International Conference on Digital Signal Processing (DSP), Santorini (2002)
33. Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc.
IEEE 74, 1477–1493 (1986)
34. Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG-
7 Descriptors. In: Proc. of EUROSPEECH (2003)
35. Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained
Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318–326
(2008)
36. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE
Transactions on Communications 28, 702–710 (1980)
37. Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classification and Segmentation
Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203–211
(2001)
38. Marozeau, J., de Cheveigné, A., McAdams, S., Winsberg, S.: The Dependency
of Timbre on Fundamental Frequency. Journal of the Acoustical Society of
America 114(5), 2946–2957 (2003)
162 M. Barthet, S. Hargreaves, and M. Sandler
39. Mauch, M.: Automatic Chord Transcription from Audio using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)
40. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimphoff, J.: Perceptual
Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and
Latent Subject Classes. Psychological Research 58, 177–192 (1995)
41. Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation
(2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation
42. Peeters, G.: Automatic Classification of Large Musical Instrument Databases Us-
ing Hierarchical Classifiers with Inertia Ratio Maximization. In: Proc. AES 115th
Convention, New York (2003)
43. Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/
SonicAnnotator
44. Ramona, M., Richard, G.: Comparison of Different Strategies for a SVM-Based
Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference
(EUSIPCO 2009), pp. 20–24 (2009)
45. Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In:
Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)
46. Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc.
ICASSP 1996, vol. 2, pp. 993–996 (1996)
47. Schaeffer, P.: Traité des Objets Musicaux (Treaty of Musical Objects). Éditions
du seuil (1966)
48. Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature
Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 1331–1334 (1997)
49. Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum En-
velope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)
50. Sundberg, J.: Articulatory Interpretation of the ‘Singing Formant’. J. Acoust. Soc.
Am. 55, 838–844 (1974)
51. Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception.
In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition
(SAPA 2006), pp. 18–23 (2006)
52. Gil de Zúñiga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimag-
ining Pathways to Political Participation. Journal of Information Technology &
Politics 7(1), 36–51 (2010)
Computer Music Cloud
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 163–175, 2011.
c Springer-Verlag Berlin Heidelberg 2011
164 J.L. Alvaro and B. Barros
2 Cloud Computing
IT continues evolving. Cloud Computing, a new term defined in various different
ways [8], involves a new paradigm in which computer infrastructure and software
are provided as a service [5]. These services themselves have been referred to as
Software as a Service (SaaS ). Google Apps is a clear example of SaaS [10].
Computation infrastructure is also offered as a service (IaaS ), thus enabling
the user to run the customer software. Several providers currently offer resizable
compute capacity as a Public Cloud, such as the Amazon Elastic Compute Cloud
(EC2) [4] and the Google AppEngine [9].
This situation offers new possibilities for both software developers and users.
For instance, this paper was written and revised in GoogleDocs [11], a Google
Computer Music Cloud 165
3 EvMusic Representation
The first step when planning a composition system should be choosing a proper
music representation. The chosen representation will set the frontiers of the sys-
tem’s capabilities. As a result, our CM research developed a solid and versatile
representation for music composition. EvMetamodel [3] was used to model the
music knowledge representation behind EvMusic. A previous, deep analysis of
music knowledge was carried out to assure the representation meets music com-
position requirements. This multilevel representation is not only compatible with
traditional notation but also capable of representing highly abstract music ele-
ments. It can also represent symbolic pitch entities [1] from both music theory
and algorithmic composition, keeping the door open to the representation of the
music elements of higher symbolic level conceived by the composer’s creativity.
It is based on real composition experience and was designed to support CM
Composition, including experiences in musical artificial intelligence (MAI).
Current music representation is described in a platform-independent UML for-
mat [15]. Therefore, it is not confined to its original LISP system, but can be used
in any system or language: a valuable feature when approaching a cloud system.
Fig. 2 is an UML class diagram for the representation core of the EvMetamodel,
the base representation for time dimension. The three main classes are shown:
event, parameter and dynamic object. High level music elements are represented
as subclasses of metaevent, the interface which provides the develop function-
ality. The special dynamic object changes is also shown. This is a very useful
option for the graphic edition of parameters, since it represents a dynamic object
as a sequence of parameter-change events which can be easily moved in time.
needed tool or library was not available at that time. At others the available
tool was suitable at that moment, but offered no long-term availability. As stated
above, the recent shift of IT into cloud computing brings new opportunities for
evolution. In CM, system development can benefit from computing distribution
and specialization. Splitting the system into several specialized services prevents
the limitations involved by a single programming language or platform. There-
fore, individual music services can be developed and evolved independently from
the others. Each component service can be implemented in the most appropriate
platform for its particular task, regardless of the rest of services, without being
conditioned by the technologies necessary for the implementation of other ser-
vices. In the previous paradigm, all services were performed by one only system,
and the selection of technologies to complete a particular task affected or even
conditioned the implementation of other tasks. This frees the system design,
thus making it more platform-independent. In addition, widely-available tools
can be used for specific tasks, thus benefitting from tool development in other
areas such as database storage and web application design.
Input. This group includes the services aimed particularly at incorporating new
music elements and translating them from other input formats.
Agents. They are those services which are capable of inspecting and modify-
ing music composition, as well as introducing new elements. This type includes
human user interfaces, but may also include other intelligent elements taking
part in composition introducing decisions, suggestions or modifications. In our
prototype, we have developed a web application acting as a user interface for
the edition of music elements. This service is described in the next section.
Storage. At this step, only music object instances and relations are stored,
but the hypothetical model also includes procedural information. Three music
storage services are implemented in the prototype. Main lib stores shared music
elements as global definitions. This content may be seen as some kind of mu-
sic culture. User-related music objects are stored in the user lib. These include
music objects defined by the composer which can be reused in several parts and
complete music sections or represent the composer’s style. The editing storage
service is provided as temporary storage for the editing session. The piece is
progressively composed in the database. The composition environment (i.e., ev-
erything related to the piece under composition) is in the database. This is the
environment in which several composing agents can integrate and interact by
reading and writing on this database-stored score. All three storage services in
this experience, were written in Python and clouded with Google AppEngine.
DOM
COMPONENT LAYER
PROXY & DATA
remote storage service is synched with the editions an updates the editor writes
in its proxy. Several editors can share the same proxy, so all listening editors are
updated when the data are modified in the proxy. The intermediate layer is a
symbolic zone for all components. It includes both graphic interface components
such as editor windows, container views and robjects: representations for the
objects under current edition. Interface components are subclassed from ExtJS
components. Every editor is displayed in its own window on the working desktop
and optionally contains a contentView displaying its child objects as robjects.
Computer Music Cloud 171
Fig. 6 is a browser-window capture showing the working desktop and some editor
windows. The application menu is shown in the lower left-hand corner, including
user setting. The central area of the capture shows a diatonic sequence editor
based on our TclTk Editor [1]. A list editor and a form-based note editor are also
shown. In the third layer or DOM (Document Object Model) [19], all components
are rendered as DOM elements (i.e., HTML document elements to be visualized).
The code shows four sections in the content. Library is an array of libraries to
be loaded with object definitions. Both main and user libraries can be addressed.
The following section includes local definitions of objects. As an example, a mo-
tive and a chord type are defined. Next section establishes instrumentation as-
signments by means of the arrangement object role. Last section is the score
itself, where all events are placed in a tree structure using parts. Using MusicJ-
SON as the intermediary communication format enables us to connect several
music services conforming a cloud composition system.
7 Conclusion
The present paper puts forward an experience of music composition under a dis-
tributed computation approach as a viable solution for Computer Music Com-
position in the Cloud. The system is split into several music services hosted in
common IaaS providers such as Google or Amazon. Different music systems can
be built by joint operation of some of these music services in the cloud.
In order to cooperate and deal with music objects, each service in the music
cloud must understand the same music knowledge. The music knowledge repre-
sentation they must share must be therefore standardized. EvMusic representa-
tion is proposed for this, since it is a solid multilevel representation successfully
tested in real CM compositions in recent years.
Furthermore, MusicJSON is proposed as an exchange data format between
services. Example descriptions of music elements, as well as a file format for
local saving of a musical composition, are given. A graphic environment is also
proposed for the creation of user interfaces for object editing as a web application.
As an example, the EvEditor application is described.
This CMC approach opens multiple possibilities for derivative work. New
music creation interfaces can be developed as web applications benefiting from
the upcoming web technologies such as the promising HTML5 standard [20]. The
described music in the cloud, together with EvMusic representation, provides a
ground environment for MAI research, where especialised agents can cooperate
in a music composition environment sharing the same music representation.
References
1. Alvaro, J.L. : Symbolic Pitch: Composition Experiments in Music Representation.
Research Report, http://cml.fauno.org/symbolicpitch.html (retrieved Decem-
ber 10, 2010) (last viewed February 2011)
2. Alvaro, J.L., Barros, B.: MusicJSON: A Representation for the Computer Mu-
sic Cloud. In: Proceedings of the 7th Sound and Music Computer Conference,
Barcelona (2010)
3. Alvaro, J.L., Miranda, E.R., Barros, B.: Music knowledge analysis: Towards an
efficient representation for composition. In: Marı́n, R., Onaindı́a, E., Bugarı́n, A.,
Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 331–341. Springer,
Heidelberg (2006)
Computer Music Cloud 175
1 Introduction
How do sounds convey meaning? How can acoustic characteristics that con-
vey the relevant information in sounds be identified? These questions interest
researchers within various research fields such as cognitive neuroscience, musi-
cology, sound synthesis, sonification, etc. Recognition of sound sources, identi-
fication, discrimination and sonification deal with the problem of linking signal
properties and perceived information. In several domains (linguistic, music anal-
ysis), this problem is known as “semiotics” [21]. The analysis by synthesis
approach [28] has permitted to understand some important features that char-
acterize the sound of vibrating objects or interaction between objects. A similar
approach was also adopted in [13] where the authors use vocal imitations in
order to study human sound source identification with the assumption that vo-
cal imitations are simplifications of original sounds that still contain relevant
information.
Recently, there has been an important development in the use of sounds to
convey information to a user (of a computer, a car, etc.) within a new research
community called auditory display [19] which deals with topics related to sound
design, sonification and augmented reality. In such cases, it is important to use
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 176–187, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Abstract Sounds and Their Applications 177
sounds that are meaningful independently of cultural references taking into ac-
count that sounds are presented through speakers concurrently with other au-
dio/visual information.
Depending on the research topics, authors focused on different sound cate-
gories (i.e. speech, environmental sounds, music or calibrated synthesized stim-
uli). In [18], the author proposed a classification of everyday sounds according
to physical interactions from which the sound originates. When working within
synthesis and/or sonification domains, the aim is often to reproduce the acoustic
properties responsible for the attribution of meaning and thus, sound categories
can be considered from the point of view of semiotics i.e. focusing on information
that can be gathered in sounds.
In this way, we considered a specific category of sounds that we call “abstract
sounds”. This category includes any sound that cannot be associated with an
identifiable source. It includes environmental sounds that cannot be easily iden-
tified by listeners or that give rise to many different interpretations depending on
listeners and contexts. It also includes synthesized sounds, and laboratory gener-
ated sounds if they are not associated with a clear origin. For instance, alarm or
warning sounds cannot be considered as abstract sounds. In practice, recordings
with a microphone close to the sound source and some synthesis methods like
granular synthesis are especially efficient for creating abstract sounds. Note that
in this paper, we mainly consider acoustically complex stimuli since they best
meet our needs in the different applications (as discussed further).
Various labels that refer to abstract sounds can be found in the literature: “con-
fused” sounds [6], “strange” sounds [36], “sounds without meaning” [16]. Con-
versely, [34] uses the term “source-bonded” and the expression “source bonding”
for the “The natural tendency to relate sounds to supposed sources and causes”.
Chion introduced “acousmatic sounds” [9] in the context of cinema and audio-
visual applications with the following definition: “sound one hears without seeing
their originating cause - an invisible sound source” (for more details see section 2).
The most common expression is “abstract sounds” [27,14,26] particularly
within the domain of auditory display, when concerning “earcons” [7]. “Ab-
stract” used as an adjective means “based on general ideas and not on any
particular real person, thing or situation” and also “existing in thought or as an
idea but not having a physical reality”1. For sounds, we can consider another
definition used for art ”not representing people or things in a realistic way”1 .
Abstract as a noun is “a short piece of writing containing the main ideas in a
document”1 and thus share the ideas of essential attributes which is suitable in
the context of semiotics. In [4], authors wrote: “Edworthy and Hellier (2006)
suggested that abstract sounds can be interpreted very differently depending on
the many possible meanings that can be linked to them, and in large depending
on the surrounding environment and the listener.”
In fact, there is a general agreement for the use of the adjective “abstract”
applied to sounds that express both ideas of source recognition and different
possible interpretations.
1
Definitions from http://dictionary.cambridge.org/
178 A. Merer et al.
This paper will first present the existing framework for the use of abstract
sounds by electroacoustic music composers and researchers. We will then discuss
some important aspects that should be considered when conducting listening
tests with a special emphasis on the specificities of abstract sounds. Finally, three
practical examples of experiments with abstract sounds in different research
domains will be presented.
Even if the term “abstract sounds” was not used in the context of electroacoustic
music, it seems that this community was one of the first to consider the issue
related to the recognition of sound sources and to use such sounds. In 1966,
P. Schaeffer, who was both a musician and a researcher, wrote the Traité des
objets musicaux [29], in which he reported more than ten years of research on
electroacoustic music. With a multidisciplinary approach, he intended to carry
out fundamental music research that included both “Concrète”2 and traditional
music. One of the first concepts he introduced was the so called “acousmatic”
listening, related to the experience of listening to a sound without paying atten-
tion to the source or the event. The word “acousmatic” is at the origin of many
discussions, and is now mainly employed in order to describe a musical trend.
Discussions about “acousmatic” listening was kept alive due to a fundamental
problem in Concrète music. Indeed, for music composers the problem is to create
new meaning from sounds that already carry information about their origins. In
compositions where sounds are organized according to their intrinsic properties,
thanks to the acousmatic approach, information on the origins of sounds is still
present and interacts with the composers’ goals.
There was an important divergence of points of view between Concrète and
Elektronische music (see [10] for a complete review), since the Elektronische
music composers used only electronically generated sounds and thus avoided the
problem of meaning [15]. Both Concrète and Elektronische music have developed
a research tradition on acoustics and perception, but only Schaeffer adopted a
scientific point of view. In [11], the author wrote: “Schaeffer’s decision to use
recorded sounds was based on his realization that such sounds were often rich
in harmonic and dynamic behaviors and thus had the largest potential for his
project of musical research”. This work was of importance for electroacoustic
musicians, but is almost unknown by researchers in auditory perception, since
there is no published translation of his book except for concomitant works [30]
and Chion’s Guide des objets musicaux 3 . As reported in [12], translating Scha-
effer’s writing is extremely difficult since he used neologisms and very specific
2
The term “concrete” is related to a composition method which is based on concrete
material i.e recorded or synthesized sounds, in opposition with “abstract” music
which is composed in an abstract manner i.e from ideas written on a score, and
become “concrete” afterwards.
3
Translation by J.Dack available at http://www.ears.dmu.ac.uk/spip.php?
page=articleEars&id_article=3597
Abstract Sounds and Their Applications 179
Fig. 1. Schaeffer’s typology. Note that some column labels are redundant since the
table must be read from center to borders. For instance, the “Non existent evolution”
column in the right part of the table corresponds to endless iterations whereas the
“Non existent evolution” column in the left part concerns sustained sounds (with no
amplitude variations).
Translation from [12]
meanings of french words. However, recently has been a growing interest in this
book and in particular in the domain of music information retrieval, for the mor-
phological sound description [27,26,5]. Authors indicate that in the case of what
they call “abstract” sounds, classical approaches based on sound source recogni-
tion are not relevant and thus base their algorithms on Schaeffer’s morphology
and typology classifications.
Morphology and typology have been introduced as analysis and creation tools
for composers as an attempt to construct a music notation that includes electroa-
coustic music and therefore any sound. The typology classification (cf. figure 1)
is based on a characterization of spectral (mass) and dynamical (facture 4 ) “pro-
files” of with respect to their complexity and consists of twenty-eight categories.
There are nine central categories of “balanced” sounds for which the variations
are neither too rapid and random nor too slow or nonexistent. Those nine cate-
gories included three facture profiles (sustained, impulsive or iterative) and three
mass profiles (tonic, complex and varying). On both sides of the “balanced ob-
jects” in the table, there are nineteen additional categories for which mass and
facture profiles are very simple/repetitive or vary a lot.
Note that some automatic classification methods are available [26]. In [37] the
authors proposed an extension of Schaeffer’s typology that includes graphical
notations.
Since the 1950s, electroacoustic music composers have addressed the problem
of meaning of sounds and provided an interesting tool for classification of sounds
with no a priori differentiation on the type of sound. For sound perception
research, a classification of sounds according to these categories may be useful
4
As discussed in [12] even if facture is not a common English word, there is no better
translation from French.
180 A. Merer et al.
since they are suitable for any sound. The next section will detail the use of such
classification for the design of listening tests.
3.1 Stimuli
It is common to assume that perception differs as a function of sound categories
(e.g. speech, environmental sounds, music). Even more, these categories are un-
derlying elements defining a research area. Consequently, it is difficult to deter-
mine a general property of human perception based on collected results obtained
from different studies. For instance, results concerning loudness conducted on el-
ementary synthesized stimuli (sinusoids, noise, etc.) cannot be directly adapted
to complex environmental sounds as reported by [31]. Furthermore, listeners’
judgements might differ for sounds belonging to a same category. For instance,
in the environmental sound category, [14] have shown specific categorization
strategies for sounds that involve human activity.
When there is no hypothesis regarding the signal properties, it is important
to gather sounds that present a large variety of acoustic characteristics as dis-
cussed in [33]. Schaeffer’s typology offers an objective selection tool than can
help the experimenter to construct a very general sound corpus representative
of most existing sound characteristics by covering all the typology categories.
As a comparison, environmental sounds can be classified only in certain rows
of Schaeffer’s typology categories (mainly the “balanced” objects). Besides, ab-
stract sounds may constitute a good compromise in terms of acoustic properties
between elementary (sinusoids, noise, etc.) and ecological (speech, environmental
sounds and music) stimuli.
A corpus of abstract sounds can be obtained in different ways. Many databases
available for audiovisual applications contain such sounds (see [33]). Different
synthesis techniques (like granular or FM synthesis, etc.) are also efficient to
create abstract sounds. In [16] and further works [38,39], the authors presented
some techniques to transform any recognizable sound into an abstract sound,
preserving several signal characteristics. Conversely, many transformations dras-
tically alter the original (environmental or vocal) sounds when important acous-
tic attributes are modified. For instance, [25] has shown that applying high and
low-pass filtering influence the perceived naturalness of speech and music sounds.
Since abstract sounds do not convey univocal meaning, it is possible to use them
in different ways according to the aim of the experience. For instance, a same
sound corpus can be evaluated in different contexts (by drawing the listener’s
Abstract Sounds and Their Applications 181
3.2 Procedure
randomly. This step allowed us to validate the abstract sounds since no label refer-
ring to the actual source was given. Indeed when listeners are asked to explicitly
label abstract sounds, different labels that were more related to the sound quality
were collected. In a first experiment a written word (prime) was visually presented
before a sound (target) and subjects had to decide whether or not the sound and
the word fit together. In a second experiment, presentation order was reversed (i.e.
sound presented before word). Results showed that participants were able to evalu-
ate the semiotic relation between the prime and the target in both sound-word and
word-sound presentations with relatively low inter-subject variability and good
consistency (see [32] for details on experimental data and related analysis). This
result indicated that abstract sounds are suitable for studying conceptual process-
ing. Moreover, their contextualization by the presentation of a word reduced the
variability of interpretations and led to a consensus between listeners. The study
also revealed similarities in the electrophysiological patterns (Event Related Po-
tentials) between abstract sounds and word targets, supporting the assumption
that similar processing is involved for linguistic and non-linguistic sounds.
5 Conclusion
In this paper, we presented the advantages of using abstract sounds in audio
and perception research based on a review of studies in which we exploited their
distinctive features. The richness of abstract sounds in terms of their acoustic
characteristics and potential evocations open various perspectives. Indeed, they
are generally perceived as “unrecognizable”, “synthetic” and “bizarre” depend-
ing on context and task and these aspects can be relevant to help listeners to
focus on the intrinsic properties of sounds, to orient the type of listening, to
evoke specific emotions or to better investigate individual differences. Moreover,
they constitute a good compromise between elementary and ecological stimuli.
We addressed the design of the sound corpus and of specific procedures for
listening tests using abstract sounds. In auditory perception research, sound
categories based on well identified sound sources are most often considered (ver-
bal/non verbal sounds, environmental sounds, music). The use of abstract sounds
may allow defining more general sound categories based on other criteria such as
listeners’ evocations or intrinsic sound properties. Based on empirical researches
from electroacoustic music trends, the sound typology proposed by P. Schaeffer
should enable the definition of such new sound categories and may be relevant
for future listening tests including any sound. Otherwise, since abstract sounds
186 A. Merer et al.
References
1. Association, A.P.: The Diagnostic and Statistical Manual of Mental Disorders,
Fourth Edition (DSM-IV). American Psychiatric Association (1994), http://www.
psychiatryonline.com/DSMPDF/dsm-iv.pdf (last viewed February 2011)
2. Ballas, J.A.: Common factors in the identification of an assortment of brief every-
day sounds. Journal of Experimental Psychology: Human Perception and Perfor-
mance 19, 250–267 (1993)
3. Bentin, S., McCarthy, G., Wood, C.C.: Event-related potentials, lexical decision
and semantic priming. Electroencephalogr Clin. Neurophysiol. 60, 343–355 (1985)
4. Bergman, P., Skold, A., Vastfjall, D., Fransson, N.: Perceptual and emotional cat-
egorization of sound. The Journal of the Acoustical Society of America 126, 3156–
3167 (2009)
5. Bloit, J., Rasamimanana, N., Bevilacqua, F.: Towards morphological sound de-
scription using segmental models. In: DAFX, Milan, Italie (2009)
6. Bonebright, T.L., Miner, N.E., Goldsmith, T.E., Caudell, T.P.: Data collection
and analysis techniques for evaluating the perceptual qualities of auditory stimuli.
ACM Trans. Appl. Percept. 2, 505–516 (2005)
7. Bonebright, T.L., Nees, M.A.: Most earcons do not interfere with spoken passage
comprehension. Applied Cognitive Psychology 23, 431–445 (2009)
8. Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)
9. Chion, M.: Audio-vision, Sound on Screen. Columbia University Press, New-York
(1993)
10. Cross, L.: Electronic music, 1948-1953. Perspectives of New Music (1968)
11. Dack, J.: Abstract and concrete. Journal of Electroacoustic Music 14 (2002)
12. Dack, J., North, C.: Translating pierre schaeffer: Symbolism, literature and music.
In: Proceedings of EMS 2006 Conference, Beijing (2006)
13. Dessein, A., Lemaitre, G.: Free classification of vocal imitations of everyday sounds.
In: Sound And Music Computing (SMC 2009), Porto, Portugal, pp. 213–218 (2009)
14. Dubois, D., Guastavino, C., Raimbault, M.: A cognitive approach to urban sound-
scapes: Using verbal data to access everyday life auditory categories. Acta Acustica
United with Acustica 92, 865–874 (2006)
15. Eimert, H.: What is electronic music. Die Reihe 1 (1957)
16. Fastl, H.: Neutralizing the meaning of sound for sound quality evaluations. In:
Proc. Int. Congress on Acoustics ICA 2001, Rome, Italy, vol. 4, CD-ROM (2001)
17. Gaver, W.W.: How do we hear in the world? explorations of ecological acoustics.
Ecological Psychology 5, 285–313 (1993)
18. Gaver, W.W.: What in the world do we hear? an ecological approach to auditory
source perception. Ecological Psychology 5, 1–29 (1993)
Abstract Sounds and Their Applications 187
19. Hermann, T.: Taxonomy and definitions for sonification and auditory display.
In: Proceedings of the 14th International Conference on Auditory Display, Paris,
France (2008)
20. Hoffman, M., Cook, P.R.: Feature-based synthesis: Mapping acoustic and percep-
tual features onto synthesis parameters. In: Proceedings of the 2006 International
Computer Music Conference (ICMC), New Orleans (2006)
21. Jekosch, U.: 8. Assigning Meaning to Sounds - Semiotics in the Context of Product-
Sound Design. J. Blauert, 193–221 (2005)
22. McKay, C., McEnnis, D., Fujinaga, I.: A large publicly accessible prototype audio
database for music research (2006)
23. Merer, A., Ystad, S., Kronland-Martinet, R., Aramaki, M.: Semiotics of sounds
evoking motions: Categorization and acoustic features. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 139–158. Springer,
Heidelberg (2008)
24. Micoulaud-Franchi, J.A., Cermolacce, M., Vion-Dury, J.: Bizzare and familiar
recognition troubles of auditory perception in patient with schizophrenia (2010)
(in preparation)
25. Moore, B.C.J., Tan, C.T.: Perceived naturalness of spectrally distorted speech and
music. The Journal of the Acoustical Society of America 114, 408–419 (2003)
26. Peeters, G., Deruty, E.: Automatic morphological description of sounds. In: Acous-
tics 2008, Paris, France (2008)
27. Ricard, J., Herrera, P.: Morphological sound description computational model and
usability evaluation. In: AES 116th Convention (2004)
28. Risset, J.C., Wessel, D.L.: Exploration of timbre by analysis and synthesis. In:
Deutsch, D. (ed.) The psychology of music. Series in Cognition and Perception,
pp. 113–169. Academic Press, London (1999)
29. Schaeffer, P.: Traité des objets musicaux. Editions du seuil (1966)
30. Schaeffer, P., Reibel, G.: Solfège de l’objet sonore. INA-GRM (1967)
31. Schlauch, R.S.: 12 - Loudness. In: Ecological Psychoacoustics, pp. 318–341. Else-
vier, Amsterdam (2004)
32. Schön, D., Ystad, S., Kronland-Martinet, R., Besson, M.: The evocative power
of sounds: Conceptual priming between words and nonverbal sounds. Journal of
Cognitive Neuroscience 22, 1026–1035 (2010)
33. Shafiro, V., Gygi, B.: How to select stimuli for environmental sound research and
where to find them. Behavior Research Methods, Instruments, & Computers 36,
590–598 (2004)
34. Smalley, D.: Defining timbre — refining timbre. Contemporary Music Review 10,
35–48 (1994)
35. Smalley, D.: Space-form and the acousmatic image. Org. Sound 12, 35–58 (2007)
36. Tanaka, K., Matsubara, K., Sato, T.: Study of onomatopoeia expressing strange
sounds: Cases of impulse sounds and beat sounds. Transactions of the Japan Society
of Mechanical Engineers C 61, 4730–4735 (1995)
37. Thoresen, L., Hedman, A.: Spectromorphological analysis of sound objects: an
adaptation of pierre schaeffer’s typomorphology. Organised Sound 12, 129–141
(2007)
38. Zeitler, A., Ellermeier, W., Fastl, H.: Significance of meaning in sound quality
evaluation. Fortschritte der Akustik, CFA/DAGA 4, 781–782 (2004)
39. Zeitler, A., Hellbrueck, J., Ellermeier, W., Fastl, H., Thoma, G., Zeller, P.: Method-
ological approaches to investigate the effects of meaning, expectations and context
in listening experiments. In: INTER-NOISE 2006, Honolulu, Hawaii (2006)
Pattern Induction and Matching in Music
Signals
Anssi Klapuri
1 Introduction
Pattern induction and matching plays an important part in understanding the
structure of a given music piece and in detecting similarities between two differ-
ent music pieces. The term pattern is here used to refer to sequential structures
that can be characterized by a time series of feature vectors x1 , x2 , . . . , xT . The
vectors xt may represent acoustic features calculated at regularly time intervals
or discrete symbols with varying durations. Many different elements of music
can be represented in this form, including melodies, drum patterns, and chord
sequences, for example.
In order to focus on the desired aspect of music, such as the drums track or
the lead vocals, it is often necessary to extract that part from a polyphonic music
signal. Section 2 of this paper will discuss methods for separating meaningful
musical objects from polyphonic recordings.
Contrary to speech, there is no global dictionary of patterns or ”words” that
would be common to all music pieces, but in a certain sense, the dictionary of
patterns is created anew in each music piece. The term pattern induction here
refers to the process of learning to recognize sequential structures from repeated
exposure [63]. Repetition plays an important role here: rhythmic patterns are
repeated, melodic phrases recur and vary, and even entire sections, such as the
chorus in popular music, are repeated. This kind of self-reference is crucial for
imposing structure on a music piece and enables the induction of the underlying
prototypical patterns. Pattern induction will be discussed in Sec. 3.
Pattern matching, in turn, consists of searching a database of music for seg-
ments that are similar to a given query pattern. Since the target matches can
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 188–204, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Pattern Induction and Matching in Music Signals 189
in principle be located at any temporal position and are not necessarily scaled
to the same length as the query pattern, temporal alignment of the query and
target patterns poses a significant computational challenge in large databases.
Given that the alignment problem can be solved, another pre-requisite for mean-
ingful pattern matching is to define a distance measure between musical patterns
of different kinds. These issues will be discussed in Sec. 4.
Pattern processing in music has several interesting applications, including
music information retrieval, music classification, cover song identification, and
creation of mash-ups by blending matching excerpts from different music pieces.
Given a large database of music, quite detailed queries can be made, such as
searching for a piece that would work as an accompaniment for a user-created
melody.
Musical sounds, like most natural sounds, tend to be sparse in the time-frequency
domain, meaning that the sounds can be approximated using a small number of
non-zero elements in the time-frequency domain. This facilitates sound source
separation and audio content analysis. Usually the short-time Fourier transform
(STFT) is used to represent a given signal in the time-frequency domain. A
viable alternative for STFT is the constant-Q transform (CQT), where the center
frequencies of the frequency bins are geometrically spaced [9,68]. CQT is often
ideally suited for the analysis of music signals, since the fundamental frequencies
(F0s) of the tones in Western music are geometrically spaced.
Spatial information can sometimes be used to organize time-frequency com-
ponents to their respective sound sources [83]. In the case of stereophonic audio,
time-frequency components can be clustered based on the ratio of left-channel
amplitude to the right, for example. This simple principle has been demonstrated
to be quite effective for some music types, such as jazz [4], despite the fact that
overlapping partials partly undermine the idea. Duda et al. [18] used stereo
information to extract the lead vocals from complex audio for the purpose of
query-by-humming.
It is often desirable to analyze the drum track of music separately from the
harmonic part. The sinusoids+noise model is the most widely-used technique
for this purpose [71]. It produces quite robust quality for the noise residual,
although the sinusoidal (harmonic) part often suffers quality degradation for
music with dense sets of sinusoids, such as orchestral music.
Ono et al. proposed a method which decomposes the power spectrogram
X(F ×T ) of a mixture signal into a harmonic part H and percussive part P so
that X = H + P [52]. The decomposition is done by minimizing an objective
function that measures variation over time n for the harmonic part and varia-
tion over frequency k for the percussive part. The method is straightforward to
implement and produces good results.
Non-negative matrix factorization (NMF) is a technique that decomposes the
spectrogram of a music signal into a linear sum of components that have a
fixed spectrum and time-varying gains [41,76]. Helen and Virtanen used the
NMF to separate the magnitude spectrogram of a music signal into a couple of
dozen components and then used a support vector machine (SVM) to classify
each component either to pitched instruments or to drums, based on features
extracted from the spectrum and the gain function of each component [31].
Vocal melody is usually the main focus of attention for an average music listener,
especially in popular music. It tends to be the part that makes music memorable
and easily reproducible by singing or humming [69].
Pattern Induction and Matching in Music Signals 191
Several different methods have been proposed for the main melody extraction
from polyphonic music. The task was first considered by Goto [28] and later
various methods for melody tracking have been proposed by Paiva et al. [54],
Ellis and Poliner [22], Dressler [17], and Ryynänen and Klapuri [65]. Typically,
the methods are based on framewise pitch estimation followed by tracking or
streaming over time. Some methods involve a timbral model [28,46,22] or a
musicological model [67]. For comparative evaluations of the different methods,
see [61] and [www.music-ir.org/mirex/].
Melody extraction is closely related to vocals separation: extracting the melody
faciliatates lead vocals separation, and vice versa. Several different approaches
have been proposed for separating the vocals signal from polyphonic music, some
based on tracking the pitch of the main melody [24,45,78], some based on timbre
models for the singing voice and for the instrumental background [53,20], and
yet others utilizing stereo information [4,18].
Bass line is another essential part in many music types and usually contains
a great deal of repetition and note patterns that are rhythmically and tonally
interesting. Indeed, high-level features extracted from the bass line and the play-
ing style have been successfully used for music genre classification [1]. Methods
for extracting the bass line from polyphonic music have been proposed by Goto
[28], Hainsworth [30], and Ryynänen [67].
3 Pattern Induction
Pattern induction deals with the problem of detecting repeated sequential struc-
tures in music and learning the pattern underlying these repetitions. In the
following, we discuss the problem of musical pattern induction from a general
192 A. Klapuri
85
80
Pitch (MIDI)
75
70
0 1 2 3 4 5 6 7
Time (s)
Fig. 1. A “piano-roll” representation for an excerpt from Mozart’s Turkish March. The
vertical lines indicate a possible grouping of the component notes into phrases.
good news here is that meter analysis is a well-understood and feasible problem
for audio signals too (see e.g. [39]). Furthermore, melodic phrase boundaries of-
ten co-incide with strong beats, although this is not always the case. For melodic
patterns, for example, this segmenting rule effectively requires two patterns to be
similarly positioned with respect to the musical measure boundaries in order for
them to be similar, which may sometimes be a too strong assumption. However,
for drum patterns this requirement is well justified.
Bertin-Mahieux et al. performed harmonic pattern induction for a large
database of music in [7]. They calculated a 12-dimensional chroma vector for
each musical beat in the target songs. The beat-synchronous chromagram data
was then segmented at barline positions and the resulting beat-chroma patches
were vector quantized to obtain a couple of hundred prototype patterns.
A third strategy is to avoid segmentation altogether by using shift-invariant
features. As an example, let us consider a sequence of one-dimensional features
x1 , x2 , . . . , xT . The sequence is first segmented into partly-overlapping frames
that have length approximately the same as the patterns being sought. Then the
sequence within each frame is Fourier transformed and the phase information is
discarded in order to make the features shift-invariant. The resulting magnitude
spectra are then clustered to find repeated patterns. The modulation spectrum
features (aka fluctuation patterns) mentioned in the beginning of Sec. 2 are an
example of such a shift-invariant feature [15,34].
30
Time (s)
20
10
0
0 10 20 30
Time (s)
Fig. 2. A self-distance matrix for Chopin’s Etude Op 25 No 9, calculated using beat-
synchronous chroma features. As the off-diagonal dark stripes indicate, the note se-
quence between 1s and 5s starts again at 5s, and later at 28s and 32s in a varied
form.
exactly parallel to the main diagonal. Figure 2 shows an example SDM calculated
using beat-synchronous chroma features.
Self-distance matrices have been widely used for audio-based analysis of the
sectional form (structure) of music pieces [12,57]. In that domain, several dif-
ferent methods have been proposed for localizing the off-diagonal stripes that
indicate repeating sequences in the music [59,27,55]. Goto, for example, first
calculates a marginal histogram which indicates the diagonal bands that con-
tain considerable repetition, and then finds the beginning and end points of the
repeted segments at a second step [27]. Serra has proposed an interesting method
for detecting locally similar sections in two feature sequences [70].
Repeated patterns are heavily utilized in universal lossless data compression al-
gorithms. The Lempel-Ziv-Welch (LZW) algorithm, in particular, is based on
matching and replacing repeated patterns with code values [80]. Let us denote a
sequence of discrete symbols by s1 , s2 , . . . , sT . The algorithm initializes a dictio-
nary which contains codes for individual symbols that are possible at the input.
At the compression stage, the input symbols are gathered into a sequence until
the next character would make a sequence for which there is no code yet in the
dictionary, and a new code for that sequence is then added to the dictionary.
The usefulness of the LZW algorithm for musical pattern matching is limited
by the fact that it requires a sequence of discrete symbols as input, as opposed to
real-valued feature vectors. This means that a given feature vectore sequence has
Pattern Induction and Matching in Music Signals 195
4 Pattern Matching
This section considers the problem of searching a database of music for segments
that are similar to a given pattern. The query pattern is denoted by a feature
196 A. Klapuri
Time (query)
flexibility in pattern scaling and to mitigate the effect of tempo estimation errors,
it is sometimes useful to further time-scale the beat-synchronized query pattern
by factors 12 , 1, and 2, and match each of these separately.
A remaining problem to be solved is the temporal shift: if the target database
is very large, comparing the query pattern at every possible temporal position
in the database can be infeasible. Shift-invariant features are one way of dealing
with this problem: they can be used for approximate pattern matching to prune
the target data, after which the temporal alignment is computed for the best-
matching candidates. This allows the first stage of matching to be performed an
order of magnitude faster.
Another potential solution for the time-shift problem is to segment the target
database by meter analysis or grouping analysis, and then match the query
pattern only at temporal positions determined by estimated bar lines or group
boundaries. This approach was already discussed in Sec. 3.
Finally, efficient indexing techniques exist for dealing with extremely large
databases. In practice, these require that the time-scale problem is eliminated
(e.g. using beat-synchronous features) and the number of time-shifts is greatly
reduced (e.g. using shift-invariant features or pre-segmentation). If these con-
ditions are satisfied, the locality sensitive hashing (LSH) for example, enables
sublinear search complexity for retrieving the approximate nearest neighbours
of the query pattern from a large database [14]. Ryynanen et al. used LSH for
melodic pattern matching in [64].
pattern and the target matches to differ: 1) low quality of the sung queries (espe-
cially in the case of musically untrained users), 2) errors in extracting the main
melodies automatically from music recordings, and 3) musical variation, such as
fragmentation (elaboration) or consolidation (reduction) of a given melody [43].
One approach that works quite robustly in the presence of all these factors is
to calculate Euclidean distance between temporally aligned log-pitch trajecto-
ries. Musical key normalization can be implemented simply by normalizing the
two pitch contours to zero mean. More extensive review of research on melodic
similarity can be found in [74].
Instead of using only the main melody for music retrieval, polyphonic pitch
data can be processed directly. Multipitch estimation algorithms (see [11,38] for
review) can be used to extract multiple pitch values in successive time frames, or
alternatively, a mapping from time-frequency to a time-pitch representation can
be employed [37]. Both of these approaches yield a representation in the time-
pitch plane, the difference being that multipitch estimation algorithms yield a
discrete set of pitch values, whereas mapping to a time-pitch plane yields a
more continuous representation. Matching a query pattern against a database
of music signals can be carried out by a two-dimensional correlation analysis in
the time-pitch plane.
F# B E A D G
F#m Bm Em Am Dm Gm
A D G C F A#
Am Dm Gm Cm Fm A#m
F A# D# G# C#
Fig. 4. Major and minor triads arranged in a two dimensional chord space. Here the
Euclidean distance between each two points can be used to approximate the distance
between chords. The dotted lines indicate the four distance parameters that define this
particular space.
Pattern Induction and Matching in Music Signals 199
Measuring the distance between two chord sequences requires that the distance
between each pair of different chords is defined. Often this distance is approxi-
mated by arranging chords in a one- or two-dimensional space, and then using
the geometric distance between chords in this space as the distance measure [62],
see Fig. 4 for an example. In the one-dimensional case, the circle of fifths is often
used.
It is often useful to compare two chord sequences in a key-invariant manner.
This can be done by expressing chords in relation to tonic (that is, using chord
degrees instead of the “absolute” chords), or by comparing all the 12 possible
transformations and choosing the minimum distance.
5 Conclusions
This paper has discussed the induction and matching of sequential patterns
in musical audio. Such patterns are neglected by the commonly used ”bag-of-
features” approach to music retrieval, where statistics over feature vectors are
calculated to collapse the time structure altogether. Processing sequentical struc-
tures poses computational challenges, but also enables musically interesting re-
trieval tasks beyond those possible with the bag-of-features approach. Some of
these applications, such as query-by-humming services, are already available for
consumers.
Acknowledgments. Thanks to Jouni Paulus for the Matlab code for comput-
ing self-distance matrices. Thanks to Christian Dittmar for the idea of using
repeated patterns to improve the accuracy of source separation and analysis.
References
1. Abesser, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classification us-
ing bass-related high-level features and playing styles. In: Intl. Society on Music
Information Retrieval Conference, Kobe, Japan (2009)
2. Badeau, R., Emiya, V., David, B.: Expectation-maximization algorithm for multi-
pitch estimation and separation of overlapping harmonic spectra. In: Proc. IEEE
ICASSP, Taipei, Taiwan, pp. 3073–3076 (2009)
3. Barbour, J.: Analytic listening: A case study of radio production. In: International
Conference on Auditory Display, Sydney, Australia (July 2004)
4. Barry, D., Lawlor, B., Coyle, E.: Sound source separation: Azimuth discrimination
and resynthesis. In: 7th International Conference on Digital Audio Effects, Naples,
Italy, pp. 240–244 (October 2004)
5. Bartsch, M.A., Wakefield, G.H.: To catch a chorus: Using chroma-based repre-
sentations for audio thumbnailing. In: IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz, USA, pp. 15–18 (2001)
6. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov
models. J. of Artificial Intelligence Research 22, 385–421 (2004)
7. Bertin-Mahieux, T., Weiss, R.J., Ellis, D.P.W.: Clustering beat-chroma patterns in
a large music database. In: Proc. of the Int. Society for Music Information Retrieval
Conference, Utrecht, Netherlands (2010)
8. Bever, T.G., Chiarello, R.J.: Cerebral dominance in musicians and nonmusicians.
The Journal of Neuropsychiatry and Clinical Neurosciences 21(1), 94–97 (2009)
9. Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc.
Am. 89(1), 425–434 (1991)
10. Burred, J., Röbel, A., Sikora, T.: Dynamic spectral envelope modeling for the
analysis of musical instrument sounds. IEEE Trans. Audio, Speech, and Language
Processing (2009)
11. de Cheveigné, A.: Multiple F0 estimation. In: Wang, D., Brown, G.J. (eds.) Compu-
tational Auditory Scene Analysis: Principles, Algorithms and Applications. Wiley–
IEEE Press (2006)
12. Dannenberg, R.B., Goto, M.: Music structure analysis from acoustic signals. In:
Havelock, D., Kuwano, S., Vorländer, M. (eds.) Handbook of Signal Processing in
Acoustics, pp. 305–331. Springer, Heidelberg (2009)
Pattern Induction and Matching in Music Signals 201
13. Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. Journal
of New Music Research 32(2), 153–163 (2003)
14. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing
scheme based on p-stable distributions. In: ACM Symposium on Computational
Geometry, pp. 253–262 (2004)
15. Dixon, S., Pampalk, E., Widmer, G.: Classification of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, Balti-
more, MD, pp. 159–165 (2003)
16. Downie, J.S.: The music information retrieval evaluation exchange (2005–2007): A
window into music information retrieval research. Acoustical Science and Technol-
ogy 29(4), 247–255 (2008)
17. Dressler, K.: An auditory streaming approach on melody extraction. In: Intl. Conf.
on Music Information Retrieval, Victoria, Canada (2006); MIREX evaluation
18. Duda, A., Nürnberger, A., Stober, S.: Towards query by humming/singing on audio
databases. In: International Conference on Music Information Retrieval, Vienna,
Austria, pp. 331–334 (2007)
19. Durrieu, J.L., Ozerov, A., Févotte, C., Richard, G., David, B.: Main instrument
separation from stereophonic audio signals using a source/filter model. In: Proc.
EUSIPCO, Glasgow, Scotland (August 2009)
20. Durrieu, J.L., Richard, G., David, B., Fevotte, C.: Source/filter model for unsu-
pervised main melody extraction from polyphonic audio signals. IEEE Trans. on
Audio, Speech, and Language Processing 18(3), 564–575 (2010)
21. Ellis, D., Arroyo, J.: Eigenrhythms: Drum pattern basis sets for classification
and generation. In: International Conference on Music Information Retrieval,
Barcelona, Spain
22. Ellis, D.P.W., Poliner, G.: Classification-based melody transcription. Machine
Learning 65(2-3), 439–456 (2006)
23. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tenson factorisation
models for musical source separation. Computational Intelligence and Neuroscience
(2008)
24. Fujihara, H., Goto, M.: A music information retrieval system based on singing voice
timbre. In: Intl. Conf. on Music Information Retrieval, Vienna, Austria (2007)
25. Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer Aca-
demic Publishers, Dordrecht (1991)
26. Ghias, A., Logan, J., Chamberlin, D.: Query by humming: Musical information
retrieval in an audio database. In: ACM Multimedia Conference 1995. Cornell
University, San Fransisco (1995)
27. Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, Hong Kong,
China, vol. 5, pp. 437–440 (April 2003)
28. Goto, M.: A real-time music scene description system: Predominant-F0 estimation
for detecting melody and bass lines in real-world audio signals. Speech Communi-
cation 43(4), 311–329 (2004)
29. Guo, L., He, X., Zhang, Y., Lu, Y.: Content-based retrieval of polyphonic music
objects using pitch contour. In: IEEE International Conference on Audio, Speech
and Signal Processing, Las Vegas, USA, pp. 2205–2208 (2008)
30. Hainsworth, S.W., Macleod, M.D.: Automatic bass line transcription from poly-
phonic music. In: International Computer Music Conference, Havana, Cuba, pp.
431–434 (2001)
202 A. Klapuri
31. Helén, M., Virtanen, T.: Separation of drums from polyphonic music using non-
negtive matrix factorization and support vector machine. In: European Signal Pro-
cessing Conference, Antalya, Turkey (2005)
32. Jang, J.S.R., Gao, M.Y.: A query-by-singing system based on dynamic program-
ming. In: International Workshop on Intelligent Systems Resolutions (2000)
33. Jang, J.S.R., Hsu, C.L., Lee, H.R.: Continuous HMM and its enhancement for
singing/humming query retrieval. In: 6th International Conference on Music Infor-
mation Retrieval, London, UK (2005)
34. Jensen, K.: Multiple scale music segmentation using rhythm, timbre, and harmony.
EURASIP Journal on Advances in Signal Processing (2007)
35. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New
Jersey (2000)
36. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Instrogram: Prob-
abilistic representation of instrument existence for polyphonic music. IPSJ Jour-
nal 48(1), 214–226 (2007)
37. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Intl. Society on Music Information Retrieval Conference, Kobe, Japan (2009)
38. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer, New York (2006)
39. Klapuri, A., Eronen, A., Astola, J.: Analysis of the meter of acoustic musical sig-
nals. IEEE Trans. Speech and Audio Processing 14(1) (2006)
40. Lartillot, O., Dubnov, S., Assayag, G., Bejerano, G.: Automatic modeling of mu-
sical style. In: International Computer Music Conference (2001)
41. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401, 788–791 (1999)
42. Lemström, K.: String Matching Techniques for Music Retrieval. Ph.D. thesis, Uni-
versity of Helsinki (2000)
43. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
44. Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specific harmonic
atoms for mid-level music representation. IEEE Trans. Audio, Speech, and Lan-
guage Processing 16(1), 116–128 (2008)
45. Li, Y., Wang, D.L.: Separation of singing voice from music accompaniment for
monaural recordings. IEEE Trans. on Audio, Speech, and Language Process-
ing 15(4), 1475–1487 (2007)
46. Marolt, M.: Audio melody extraction based on timbral similarity of melodic frag-
ments. In: EUROCON (November 2005)
47. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: Proc. 10th Intl. Society for Music Information Retrieval
Conference, Kobe, Japan (2009)
48. McNab, R., Smith, L., Witten, I., Henderson, C., Cunningham, S.: Towards the
digital music library: Tune retrieval from acoustic input. In: First ACM Interna-
tional Conference on Digital Libraries, pp. 11–18 (1996)
49. Meek, C., Birmingham, W.: Applications of binary classification and adaptive
boosting to the query-by-humming problem. In: Intl. Conf. on Music Information
Retrieval, Paris, France (2002)
50. Müller, M., Ewert, S., Kreuzer, S.: Making chroma features more robust to timbre
changes. In: Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Taipei, Taiwan, pp. 1869–1872 (April 2009)
Pattern Induction and Matching in Music Signals 203
51. Nishimura, T., Hashiguchi, H., Takita, J., Zhang, J.X., Goto, M., Oka, R.: Music
signal spotting retrieval by a humming query using start frame feature dependent
continuous dynamic programming. In: 2nd Annual International Symposium on
Music Information Retrieval, Bloomington, Indiana, USA, pp. 211–218 (October
2001)
52. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of
a monaural audio signal into harmonic/percussive components by complementary
diffucion on spectrogram. In: European Signal Processing Conference, Lausanne,
Switzerland, pp. 240–244 (August 2008)
53. Ozerov, A., Philippe, P., Bimbot, F., Gribonval, R.: Adaptation of Bayesian models
for single-channel source separation and its application to voice/music separation
in popular songs. IEEE Trans. on Audio, Speech, and Language Processing 15(5),
1564–1578 (2007)
54. Paiva, R.P., Mendes, T., Cardoso, A.: On the detection of melody notes in poly-
phonic audio. In: 6th International Conference on Music Information Retrieval,
London, UK, pp. 175–182
55. Paulus, J.: Signal Processing Methods for Drum Transcription and Music Structure
Analysis. Ph.D. thesis, Tampere University of Technology (2009)
56. Paulus, J., Klapuri, A.: Measuring the similarity of rhythmic patterns. In: Intl.
Conf. on Music Information Retrieval, Paris, France (2002)
57. Paulus, J., Müller, M., Klapuri, A.: Audio-based music structure analysis. In: Proc.
of the Int. Society for Music Information Retrieval Conference, Utrecht, Nether-
lands (2010)
58. Paulus, J., Virtanen, T.: Drum transcription with non-negative spectrogram fac-
torisation. In: European Signal Processing Conference, Antalya, Turkey (Septem-
ber 2005)
59. Peeters, G.: Sequence representations of music structure using higher-order similar-
ity matrix and maximum-likelihood approach. In: Intl. Conf. on Music Information
Retrieval, Vienna, Austria, pp. 35–40 (2007)
60. Peeters, G.: A large set of audio features for sound description (similarity and
classification) in the CUIDADO project. Tech. rep., IRCAM, Paris, France (April
2004)
61. Poliner, G., Ellis, D., Ehmann, A., Gómez, E., Streich, S., Ong, B.: Melody tran-
scription from music audio: Approaches and evaluation. IEEE Trans. on Audio,
Speech, and Language Processing 15(4), 1247–1256 (2007)
62. Purwins, H.: Profiles of Pitch Classes – Circularity of Relative Pitch and Key:
Experiments, Models, Music Analysis, and Perspectives. Ph.D. thesis, Berlin Uni-
versity of Technology (2005)
63. Rowe, R.: Machine musicianship. MIT Press, Cambridge (2001)
64. Ryynänen, M., Klapuri, A.: Query by humming of MIDI and audio using locality
sensitive hashing. In: IEEE International Conference on Audio, Speech and Signal
Processing, Las Vegas, USA, pp. 2249–2252
65. Ryynänen, M., Klapuri, A.: Transcription of the singing melody in polyphonic
music. In: Intl. Conf. on Music Information Retrieval, Victoria, Canada, pp. 222–
227 (2006)
66. Ryynänen, M., Klapuri, A.: Automatic bass line transcription from streaming poly-
phonic audio. In: IEEE International Conference on Audio, Speech and Signal
Processing, pp. 1437–1440 (2007)
67. Ryynänen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32(3), 72–86 (2008)
204 A. Klapuri
68. Schörkhuber, C., Klapuri, A.: Constant-Q transform toolbox for music processing.
In: 7th Sound and Music Computing Conference, Barcelona, Spain (2010)
69. Selfridge-Field, E.: Conceptual and representational issues in melodic comparison.
Computing in Musicology 11, 3–64 (1998)
70. Serra, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local
alignment applied to cover song identification. IEEE Trans. on Audio, Speech, and
Language Processing 16, 1138–1152 (2007)
71. Serra, X.: Musical sound modeling with sinusoids plus noise. In: Roads, C., Pope,
S., Picialli, A., Poli, G.D. (eds.) Musical Signal Processing, Swets & Zeitlinger
(1997)
72. Song, J., Bae, S.Y., Yoon, K.: Mid-level music melody representation of polyphonic
audio for query-by-humming system. In: Intl. Conf. on Music Information Retrieval,
Paris, France, pp. 133–139 (October 2002)
73. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis
– a unified approach to speech spectral estimation. In: IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Adelaide, Australia (1994)
74. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Universiteit
Utrecht (2007)
75. Vincent, E., Bertin, N., Badeau, R.: Harmonic and inharmonic nonnegative matrix
factorization for polyphonic pitch transcription. In: IEEE ICASSP, Las Vegas, USA
(2008)
76. Virtanen, T.: Unsupervised learning methods for source separation in monaural
music signals. In: Klapuri, A., Davy, M. (eds.) Signal Processing Methods for Music
Transcription, pp. 267–296. Springer, Heidelberg (2006)
77. Virtanen, T.: Monaural sound source separation by non-negative matrix factoriza-
tion with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech,
and Language Processing 15(3), 1066–1074 (2007)
78. Virtanen, T., Mesaros, A., Ryynänen, M.: Combining pitch-based inference and
non-negative spectrogram factorization in separating vocals from polyphonic music.
In: ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition,
Brisbane, Australia (September 2008)
79. Wang, L., Huang, S., Hu, S., Liang, J., Xu, B.: An effective and efficient method
for query by humming system based on multi-similarity measurement fusion. In:
International Conference on Audio, Language and Image Processing, pp. 471–475
(July 2008)
80. Welch, T.A.: A technique for high-performance data compression. Computer 17(6),
8–19 (1984)
81. Wu, X., Li, M., Yang, J., Yan, Y.: A top-down approach to melody match in pitch
countour for query by humming. In: International Conference of Chinese Spoken
Language Processing (2006)
82. Yeh, C.: Multiple fundamental frequency estimation of polyphonic recordings.
Ph.D. thesis, University of Paris VI (2008)
83. Yilmaz, O., Richard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Trans. on Signal Processing 52(7), 1830–1847 (2004)
Unsupervised Analysis and Generation of Audio
Percussion Sequences
1 Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 205–218, 2011.
c Springer-Verlag Berlin Heidelberg 2011
206 M. Marchini and H. Purwins
with musicians, producing jazz-style music. Another system with the same char-
acteristics as the Continuator, called OMax, was able to learn an audio stream
employing an indexing procedure explained in [5]. Hazan et al. [8] built a system
which first segments the musical stream and extracts timbre and onsets. An un-
supervised clustering process yields a sequence of symbols that is then processed
by n-grams. The method by Marxer and Purwins [12] consists of a conceptual
clustering algorithm coupled with a hierarchical N-gram. Our method presented
in this article was first described in detail in [11].
First, we define the system design and the interaction of its parts. Starting
from low-level descriptors, we translate them into a “fuzzy score representation”,
where two sounds can either be discretized yielding the same symbol or yielding
different symbols according to which level of interpretation is chosen (Section 2).
Then we perform skeleton subsequence extraction and tempo detection to align
the score to a grid. At the end, we get a homogeneous sequence in time, on which
we perform the prediction. For the generation of new sequences, we reorder the
parts of the score, respecting the statistical properties of the sequence while at
the same time maintaining the metrical structure (Section 3). In Section 4, we
discuss an example.
Audio Segments
Segmentation Generation of
Audio Sequences
Aligned
Continuation
Symbolization Multilevel Statistic Model
indices
Representation
2.1 Segmentation
First, the audio input signal is analyzed by an onset detector that segments
the audio file into a sequence of musical events. Each event is characterized by
its position in time (onset) and an audio segment, the audio signal starting at
the onset position and ending at the following contiguous onset. In the further
processing, these events will serve two purposes. On one side, the events are
stored as an indexed sequence of audio fragments which will be used for the re-
synthesis in the end. On the other side, these events will be compared with each
other to generate a reduced score-like representation of the percussion patterns
to base a tempo analysis on (cf. Fig. 1 and Sec. 2.2).
Analysis and Generation of Percussion Sequences 207
We used the onset detector implemented in the MIR toolbox [9] that is based
only on the energy envelope, which proves to be sufficient for our purpose of
analyzing percussion sounds.
2.2 Symbolization
Feature Extraction. We have chosen to define the salient part of the event as
the first 200 ms after the onset position. This duration value is a compromise
between capturing enough information about the attack for representing the
sound reliably and still avoiding irrelevant parts at the end of the segment which
may be due to pauses or interfering other instruments. In the case that the
segment is shorter than 200 ms, we use the entire segment for the extraction
of the feature vector. Across the salient part of the event we calculate the Mel
Frequency Cepstral Coefficient (MFCC) vector frame-by-frame. Over all MFCCs
of the salient event part, we take the weighted mean, weighted by the RMS
energy of each frame. The frame rate is 100 frame for second, the FFT size is
512 samples and the window size 256.
We used the single linkage algorithm to discover event clusters in this space
(cf. [6] for details). This algorithm recursively performs clustering in a bottom-
up manner. Points are grouped into clusters. Then clusters are merged with
additional points and clusters are merged with clusters into super clusters. The
distance between two clusters is defined as the shortest distance between two
points, each being in a different cluster, yielding a binary tree representation of
the point similarities (cf. Fig. 2). The leaf nodes correspond to single events. Each
node of the tree occurs at a certain height, representing the distance between
the two child nodes. Figure 2 (top) shows an example of a clustering tree of the
onset events of a sound sequence.
3.5
3
Cluster Distance
2.5
2 Threshold
1.5
0.5
2 4 8 6 1 5 3 7
0 1 2 3
Time (s)
The height threshold controls the (number of) clusters. Clusters are generated
with inter-cluster distances higher than the height threshold. Two thresholds
lead to the same cluster configuration if and only if their values are both within
the range delimited by the previous lower node and the next upper node in
the tree. It is therefore evident that by changing the height threshold, we can
get as many different cluster configurations as the number of events we have
in the sequence. Each cluster configuration leads to a different symbol alphabet
Analysis and Generation of Percussion Sequences 209
size and therefore to a different symbol sequence representing the original audio
file. We will refer to those sequences as representation levels or simply levels.
These levels are implicitly ordered. On the leaf level at the bottom of the tree
we find the lowest inter-cluster distances, corresponding to a sequence with each
event being encoded by a unique symbol due to weak quantization. On the root
level on top of the tree we find the cluster configuration with the highest inter-
cluster distances, corresponding to a sequence with all events denoted by the
same symbol due to strong quantization. Given a particular level, we will refer
to the events denoted by the same symbol as the instances of that symbol. We do
not consider the implicit inheritance relationships between symbols of different
levels.
Fig. 3. A continuous audio signal (top) is discretized via clustering yielding a sequence
of symbols (bottom). The colors inside the colored triangles denote the cluster of the
event, related to the type of sound, i.e. bass drum, hi-hat, or snare.
onsets given by this subsequence. This sequence can be seen as a set of points
on a time line. We are interested to quantify the degree of temporal regularity of
those onsets. Firstly, we compute the histogram1 of the time differences (CIOIH)
between all possible combinations of two onsets (middle Fig. 4). What we obtain
is a sort of harmonic series of peaks that are more or less prominent according
to the self-similarity of the sequence on different scales. Secondly, we compute
the autocorrelation ac(t) (where t is the time in seconds) of the CIOIH which, in
case of a regular sequence, has peaks at multiples of its tempo. Let tusp be the
positive time value corresponding to its upper side peak. Given the sequence of
m onsets x = (x1 , . . . , xm ) we define the regularity of the sequence of onsets x
to be:
ac(tusp )
Regularity(x) = 1 tusp log(m)
tusp 0 ac(t)dt
This definition was motivated by the observation that the higher this value the
more equally the onsets are spaced in time. The logarithm of the number of
onsets was multiplied by the ratio to give more importance to symbols with
more instances.
0
0 2 4 6 8 10 12 14 16
Cross Correlation of Histogram
1K Energy
Self Correlation
Upper Side
Histogram
Peak
−1 0 0.5 t usp 2 3 4 5
Time Interval (s)
Fig. 4. The procedure applied for computing the regularity value of an onset sequence
(top) is outlined. Middle: the histogram of the complete IOI between onsets. Bottom:
the autocorrelation of the histogram is shown for a subrange of IOI with relevant peaks
marked.
Then we extended, for each level, the regularity concept to an overall regularity
of the level. This simply corresponds to the mean of the regularities for all the
appropriate symbols of the level. The regularity of the level is defined to be zero
in case there is no appropriate symbol.
1
We used a discretization of 100 ms for the onset bars.
Analysis and Generation of Percussion Sequences 211
After the regularity value has been computed for each level, we yield the level
where the maximum regularity is reached. The resulting level will be referred so
as the regular level.
We also decided to keep the levels where we have a local maximum because
they generally refer to the levels where a partially regular interpretation of the
sequence is achieved. In the case where consecutive levels of a sequence share
the same regularity only the one is kept that is derived from a higher cluster
distance threshold. Figure 5 shows the regularity of the sequence for different
levels.
3.8
3.6
Regularity of the Sequence
3.4
3.2
2.8
2.6
2.4
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
Cluster Distance Threshold
Fig. 5. Sequence regularity for a range of cluster distance thresholds (x-axis). An ENST
audio excerpt was used for the analysis. The regularity reaches its global maximum
value in a central position. Towards the right, regularity increases and then remains
constant. The selected peaks are marked with red crosses implying a list of cluster
distance threshold values.
Tempo Alignment (Inter Beat Interval and Beat Phase). Once the skele-
ton subsequence is found, the inter beat interval is estimated with the procedure
explained in [4]. The tempo is detected considering the intervals between all
possible onset pairs of the sequence using a score voting criterion. This method
gives higher scores to the intervals that occur more often and that are related
by integer ratios to other occurring inter onset intervals.
Then the onsets of the skeleton subsequence are parsed in order to detect
a possible alignment of the grid to the sequence. We allow a tolerance of 6%
the duration of the inter beat interval for the alignment of an onset to the
grid position. We chose the interpretation that aligns the highest number of
instances to the grid. After discarding the onsets that are not aligned we obtain
the preliminary skeleton grid. In Fig. 6 the procedure is visually explicated.
2 1 1 1 1
1 2
Fig. 7. The event sequence derived from a segmentation by onset detection is indi-
cated by triangles. The vertical lines show the division of the sequence into blocks of
homogeneous tempo. The red solid lines represent the beat position (as obtained by
the skeleton subsequence). The other black lines (either dashed if aligned to a detected
onset or dotted if no close onset is found) represent the subdivisions of the measure
into four blocks.
Because of the tolerance used for building such a grid it could be noticed that
sometimes the effective measure duration could be slightly longer or slightly
shorter. This implements the idea that the grid should be elastic in the sense
that, up to a certain degree, it adapts to the (expressive) timing variations of
the actual sequence.
The skeleton grid catches a part of the complete list of onsets, but we would
like to built a grid where most of the onsets are aligned. Thereafter, starting from
the skeleton grid, the intermediate point between every two subsequent beats is
found and aligned with an onset (if it exists in a tolerance region otherwise a
place-holding onset is added). The procedure is recursively repeated until at least
80% of the onsets are aligned to a grid position or the number of created onsets
exceeds the number of total onsets. In Fig. 7, an example is presented along
with the resulting grid where the skeleton grid, its aligned, and the non-aligned
subdivisions are indicated by different line markers.
Note that, for the sake of simplicity, our approach assumes that the metrical
structure is binary. This causes the sequence to be eventually split erroneously.
However, we will see in a ternary tempo example that this is not a limiting factor
for the generation because the statistical representation somehow compensates
for it even if less variable generations are achieved. A more general approach
could be implemented with little modifications.
The final grid is made of blocks of time of almost equal duration that can
contain none, one, or more onset events. It is important that the sequence given
to the statistical model is almost homogeneous in time so that a certain number
of blocks corresponds to a defined time duration.
We used the following rules to assign a symbol to a block (cf. Fig 7):
– blocks starting on an aligned onset are denoted by the symbol of the aligned
onset,
– blocks starting on a non-aligned grid position are denoted by the symbol of
the previous block.
Finally, a metrical phase value is assigned to each block describing the number of
grid positions passed after the last beat position (corresponding to the metrical
214 M. Marchini and H. Purwins
position of the block). For each representation level the new representation of
the sequence will be the Cartesian product of the instrument symbol and the
phase.
– Set a maximal context length ˆ l and compute the list of indices for each level
using the appropriate suffix tree. Store the achieved length of the context
for each level.
– Count the number of indices provided by each level. Select only the levels
that provide less than 75% the total number of blocks.
– Among these level candidates, select only the ones that have the longest
context.
– Merge all the continuation indices across the selected levels and remove the
trivial continuation (the next onset).
– In case there is no level providing such a context and the current block is
not the last, use the next block as a continuation.
– Otherwise, decide randomly with probability p whether to select the next
block or rather to generate the actual continuation by selecting randomly
between the merged indices.
We tested the system on two audio data bases. The first one is the ENST
database (see [7]) that provided a collection of around forty drum recording
examples. For a descriptive evaluation, we asked two professional percussion-
ists to judge several examples of generations as if they were performances of a
student. Moreover, we asked one of them to record beat boxing excerpts trying
to push the system to the limits of complexity and to critically assess the se-
quences that the system had generated from these recordings. The evaluations
of the generations created from the ENST examples revealed that the style of
the original had been maintained and that the generations had a high degree of
interestingness [10].
Some examples are available on the website [1] along with graphical anima-
tions visualizing the analysis process. In each video, we see the original sound
fragment and the generation derived from it. The horizontal axis corresponds
to the time in seconds and the vertical axis to the clustering quantization res-
olution. Each video shows an animated graphical representation in which each
block is represented by a triangle. At each moment, the context and the currently
played block is represented by enlarged and colored triangles.
In the first part of the video, the original sound is played and the animation
shows the extracted block representation. The currently played block is repre-
sented by an enlarged colored triangle and highlighted by a vertical dashed red
line. The other colored triangles highlight all blocks from the starting point of
the bar up to the current block. In the second part of the video, only the skele-
ton subsequence is played. The sequence on top is derived from applying the
largest clustering threshold (smallest number of clusters) and the one on the
bottom corresponds to the lowest clustering threshold (highest number of clus-
ters). In the final part of the video, the generation is shown. The colored triangles
216 M. Marchini and H. Purwins
represent the current block and the current context. The size of the colored
triangles decreases monotonically from the current block backwards displaying
the past time context window considered by the system. The colored triangles
appear only on the levels selected by the generation strategy.
In Figure 8, we see an example of successive states of the generation. The
levels used by the generator to compute the continuation and the context are
highlighted showing colored triangles that decrease in size from the largest, cor-
responding to the current block, to the smallest that is the furthest past context
block considered by the system. In Frame I, the generation starts with block
no 4, belonging to the event class indicated by light blue. In the beginning, no
previous context is considered for the generation. In Frame II, a successive block
no 11 of the green event class has been selected using all five levels α - and a
context history of length 1 just consisting of block no 4 of the light blue event
class. Note that the context given by only one light blue block matches the con-
tinuation no 11, since the previous block (no 10) is also denoted by light blue at
all the five levels. In Frame III, the context is the bi-gram of the event classes
light blue (no 4) and green (no 11). Only level α is selected since at all other
levels the bi-gram that corresponds to the colors light blue and green appears
only once. However, at level α the system finds three matches (blocks no 6, 10
and 12) and randomly selects no 10. In Frame IV, the levels differ in the length
of the maximal past context. At level α one but only one match (no 11) is found
for the 3-gram light blue - green - light blue, and thus this level is discarded. At
levels β, γ and δ, no matches for 3-grams are found, but all these levels include
2 matches (block no 5 and 9) for the bi-gram (green - light blue). At level , no
match is found for a bi-gram either, but 3 occurrences of the light blue triangle
are found.
5 Discussion
Our system effectively generates sequences respecting the structure and the
tempo of the original sound fragment for medium to high complexity rhythmic
patterns.
A descriptive evaluation of a professional percussionist confirmed that the
metrical structure is correctly managed and that the statistical representation
generates musically meaningful sequences. He noticed explicitly that the drum
fills (short musical passages which help to sustain the listener’s attention during
a break between the phrases) were handled adequately by the system.
The critics by the percussionist were directed to the lack of dynamics, agogics
and musically meaningful long term phrasing which we did not address in our
approach.
Part of those feature could be achieved in the future by extending the system
to the analysis of non-binary meter. To achieve musically sensible dynamics and
agogics (rallentando, accelerando, rubato. . . ) of the generated musical continua-
tion for example by extrapolation [14] remains a challenge for future work.
Fig. 8. Nine successive frames of the generation. The red vertical dashed line marks the currently played event. In each frame, the largest
Analysis and Generation of Percussion Sequences
colored triangle denotes the last played event that influences the generation of the next event. The size of the triangles decreases going
back in time. Only for the selected levels the triangles are enlarged. We can see how the length of the context as well as the number of
selected levels dynamically change during the generation. Cf. Section 4 for a detailed discussion of this figure.
217
218 M. Marchini and H. Purwins
Acknowledgments
Many thanks to Panos Papiotis for his patience during lengthy recording sessions
and for providing us with beat boxing examples, the evaluation feedback, and
inspiring comments. Thanks a lot to Ricard Marxer for his helpful support. The
first author (MM) expresses his gratitude to Mirko Degli Esposti and Anna Rita
Addessi for their support and for motivating this work. The second author (HP)
was supported by a Juan de la Cierva scholarship of the Spanish Ministry of
Science and Innovation.
References
1. (December 2010), www.youtube.com/user/audiocontinuation
2. Buhlmann, P., Wyner, A.J.: Variable length markov chains. Annals of Statistics 27,
480–513 (1999)
3. Cope, D.: Virtual Music: Computer Synthesis of Musical Style. MIT Press, Cam-
bridge (2004)
4. Dixon, S.: Automatic extraction of tempo and beat from expressive performances.
Journal of New Music Research 30(1), 39–58 (2001)
5. Dubnov, S., Assayag, G., Cont, A.: Audio oracle: A new algorithm for fast learning
of audio structures. In: Proceedings of International Computer Music Conference
(ICMC), pp. 224–228 (2007)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. Wiley, Chichester
(2001)
7. Gillet, O., Richard, G.: Enst-drums: an extensive audio-visual database for drum
signals processing. In: ISMIR, pp. 156–159 (2006)
8. Hazan, A., Marxer, R., Brossier, P., Purwins, H., Herrera, P., Serra, X.: What/when
causal expectation modelling applied to audio signals. Connection Science 21, 119–
143 (2009)
9. Lartillot, O., Toiviainen, P., Eerola, T.: A matlab toolbox for music information
retrieval. In: Annual Conference of the German Classification Society (2007)
10. Marchini, M.: Unsupervised Generation of Percussion Sequences from a Sound
Example. Master’s thesis (2010)
11. Marchini, M., Purwins, H.: Unsupervised generation of percussion sound sequences
from a sound example. In: Sound and Music Computing Conference (2010)
12. Marxer, R., Purwins, H.: Unsupervised incremental learning and prediction of au-
dio signals. In: Proceedings of 20th International Symposium on Music Acoustics
(2010)
13. Pachet, F.: The continuator: Musical interaction with style. In: Proceedings of
ICMC, pp. 211–218. ICMA (2002)
14. Purwins, H., Holonowicz, P., Herrera, P.: Polynomial extrapolation for prediction of
surprise based on loudness - a preliminary study. In: Sound and Music Computing
Conference, Porto (2009)
15. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic
automata with variable memory length. Mach. Learn. 25(2-3), 117–149 (1996)
Identifying Attack Articulations
in Classical Guitar
Tan Hakan Özaslan, Enric Guaus, Eric Palacios, and Josep Lluis Arcos
1 Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 219–241, 2011.
c Springer-Verlag Berlin Heidelberg 2011
220 T.H. Özaslan et al.
hand that plucks the strings as the right hand and the hand that presses the
frets as the left hand.
As a first stage of our research, we are developing a tool able to automatically
identify, from a recording, the use of guitar articulations. According to Norton
[22], guitar articulations can be divided into three main groups related to the
place of the sound where they act: attack, sustain, and release articulations.
In this research we are focusing on the identification of attack articulations
such as legatos and glissandos. Specifically, we present an automatic detection
and classification system that uses as input audio recordings. We can divide
our system into two main modules (Figure 1): extraction and classification. The
extraction module determines the expressive articulation regions of a classical
guitar recording whereas the classification module analyzes these regions and
determines the king of articulation (legato or glissando).
In both, legato and glissando, left hand is involved in the creation of the note
onset. In the case of ascending legato, after plucking the string with the right
hand, one of the fingers of the l eft hand (not already used for pressing one of the
frets), presses a fret causing another note onset. Descending legato is performed
by plucking the string with a left-hand finger that was previously used to play
a note (i.e. pressing a fret).
The case of glissando is similar but this time after plucking one of the strings
with the right hand, the left hand finger that is pressing the string is slipped to
another fret also generating another note onset.
When playing legato or glissando on guitar, it is common for the performer to
play more notes within a beat than the stated timing enriching the music that
is played. A powerful legato and glissando can be differentiated between each
other easily by ear. However, in a musical phrase context where the legato and
glissando are not isolated, it is hard to differentiate among these two expressive
articulations.
The structure of the paper is as follows: Section 2 briefly describes the current
state of the art of guitar analysis studies. Section 3 describes our methodology
for articulation determination and classification. Section 4 focuses on the exper-
iments conducted to evaluate our approach. Last section, Section 5, summarizes
current results and presents the next research steps.
Identifying Attack Articulations in Classical Guitar 221
2 Related Work
Guitar is the one of the most popular instruments in western music. Thus, most
of the music genres include the guitar. Although plucked instruments and guitar
synthesis have been studied extensively (see [9,22]), the analysis of expressive
articulations from real guitar recordings has not been fully tackled. This analysis
is complex because guitar is an instrument with a rich repertoire of expressive
articulations and because, when playing guitar melodies, several strings may be
vibrating at the same time. As an additional statement, even the synthesis of a
single tone is a complex subject [9].
Expressive studies go back to the early twentieth century. In 1913, Johnstone
[15] analyzed piano performers. Johnstone’s analysis can be considered as one
of the first studies focusing on musical expressivity. Advances in audio process-
ing techniques risen the opportunity to analyze audio recordings in a finer level
(see [12] for an overview). Up to now, there are several studies focused on the
analysis of expressivity of different instruments. Although the instruments ana-
lyzed differ, most of them focus on analyzing monophonic or single instrument
recordings.
For instance, Mantaras et al [20] presented a survey of computer music sys-
tems based on Artificial Intelligence techniques. Examples of AI-based systems
are SaxEx [1] and TempoExpress [13]. Saxex is cased-based reasoning system
that generates expressive jazz saxophone melodies from recorded examples of
human performances. More recently, TempoExpress performs tempo transfor-
mations of audio recordings taking into account the expressive characteristics of
a performance and using a CBR approach.
Regarding guitar analysis, an interesting research is from Stanford Univer-
sity. Traube [28], estimated the plucking point on a guitar string by using a
frequency-domain technique applied to acoustically recorded signals. The pluck-
ing point of a guitar string affects the sound envelope and influences the timbral
characteristics of notes. For instance, plucking close to the guitar hole produces
more mellow and sustained sounds where plucking near to the bridge (end of the
guitar body) produces sharper and less sustained sounds. Traube also proposed
an original method to detect the fingering point, based on the plucking point
information.
In another interesting paper, Lee [17] proposes a new method for extraction of
the excitation point of an acoustic guitar signal. Before explaining the method,
three state of the art techniques are examined in order to compare with the new
one. The techniques analyzed are matrix pencil inverse-filtering, sinusoids plus
noise inverse-filtering, and magnitude spectrum smoothing. After describing and
comparing these three techniques, the author proposes a new method, statistical
spectral interpolation, for excitation signal extraction.
Although fingering studies are not directly related with expressivity, their re-
sults may contribute to clarify and/or constrain the use of left-hand expressive
articulations. Hank Heijink and Ruud G. J. Meulenbroek[14] performed a be-
havioral study about the complexity of the left hand fingering of classical guitar.
Different audio and camera recordings of six professional guitarists playing the
222 T.H. Özaslan et al.
same song were used to find optimal places and fingerings for the notes. Several
constraints were introduced to calculate cost functions such as; minimization of
jerk, torque change, muscle-tension change, work, energy and neuromotor vari-
ance. As a result of the study, they found a significant effect on timing.
On another interesting study, [25] investigates the the optimal fingering posi-
tion for a given set of notes. Their method, path difference learning, uses tabla-
tures and AI techniques to obtain fingering positions and transitions. Radicioni
et al [24] also worked on finding the proper fingering position and transitions.
Specifically, they calculated the weights of the finger transitions between finger
positions by using the weights of Heijing [14]. Burns and Wanderley [4] proposed
a method to visually detect and recognize fingering gestures of the left hand of
a guitarist by using affordable camera.
Unlike the general trend in the literature, Trajano [27] investigated the right
hand fingering. Although he analyzed the right-hand, his approach has similar-
ities with left-hand studies. In his article, Trajano uses his own definitions and
cost functions to calculate the optimal selection of right hand fingers.
The first step when analyzing guitar expressivity is to identify and characterize
the way notes are played, i.e. guitar articulations. The analysis of expressive
articulations has been previously performed with image analysis techniques. Last
but not least, one of the few studies that is focusing on guitar expressivity is
the PhD thesis of Norton [22]. In his dissertation Norton proposed the use of a
motion caption system based on PhaseSpace Inc, to analyze guitar articulations.
3 Methodology
Articulation refers to how the pieces of something are joined together. In music,
these pieces are the notes and the different ways of executing them are called
articulations. In this paper we propose a new system that is able to determine
and classify two expressive articulations from audio files. For this purpose we
have two main modules: the extraction module and the classification module (see
Figure 1). In the extraction module, we determine the sound segments where
expressive articulations are present. The purpose of this module is to classify
audio regions as expressive articulations or not. Next, the classification module,
analyzes the regions that were identified as candidates of expressive articulations
by the extraction module, and label them as legato or glissando.
3.1 Extraction
The goal of the extraction module is to find the places where a performer played
expressive articulations. To that purpose, we analyzed a recording using several
audio analysis algorithms, and combined the information obtained from them to
take a decision.
Our approach is based on first determining the note onsets caused when pluck-
ing the strings. Next, a more fine grained analysis is performed inside the regions
delimited by two plucking onsets to determine whether an articulation may be
present. A simple representation diagram of extraction module is shown in
Figure 2.
Identifying Attack Articulations in Classical Guitar 223
For the analysis we used Aubio [2]. Aubio is a library designed for the anno-
tation of audio signals. Aubio library includes four main applications: abioonset,
aubionotes, aubiocut, and aubiopitch. Each application gives us the chance of
trying different algorithms and also tuning several other parameters. In the cur-
rent prototype we are using aubioonset for our plucking detection sub-module
and aubionotes for our pitch detection sub-module.
At the end we combine the outputs from both sub-modules and decide whether
there is an expressive articulation or not. In the next two sections, the plucking
detection sub-module and the pitch detection sub-module is described. Finally,
we explain how we combine the information provided by these two sub-modules
to determine the existence of expressive articulations.
Plucking Detection. Our first task is to determine the onsets caused by the
plucking hand. As we stated before, guitar performers can apply different artic-
ulations by using both of their hands. However, the kind of articulations that
we are investigating (legatos and glissandos) are performed by the left hand.
Although they can cause onsets, these onsets are not as powerful in terms of
both energy and harmonicity [28]. Therefore, we need an onset determination
algorithm suitable to this specific characterictic.
The High Frequency Content measure is a measure taken across a signal
spectrum, and can be used to characterize the amount of high-frequency content
(HFC) in the signal. The magnitudes of the spectral bins are added together, but
multiplying each magnitude by the bin position [21]. As Brossier stated, HFC
is effective with percussive onsets but less successful determining non-percussive
and legato phrases [3]. As right hand onsets are more percussive than left hand
onsets, HFC was the strongest candidate of detection algorithm for right hand
onsets. HFC is sensitive to abrupt onsets but not too much sensitive for the
changes of fundamental frequency caused by the left hand. This is the main
reason why we chose HFC to measure the changes on the harmonic content of
the signal.
Aubioonset library gave us the opportunity to tune the peak-picking thresh-
old, which we tested with a set of hand labeled recordings, including both ar-
ticulated and non-articulated notes. We used 1.7 for peak picking threshold and
224 T.H. Özaslan et al.
−95db for silence threshold. We used this set as our ground truth and tuned our
values according to this set.
An example of the resulting onsets proposed by HFC is shown in Figure 3.
Specifically, in the exemplified recording 5 plucking onsets are detected, onsets
caused by the plucking hand, which are shown with vertical lines. Between some
of two detected onsets expressive articulations are present. However, as shown
in the figure, HFC succeeds as it only determines the onsets caused by the right
hand.
Next, each portion between two plucking onsets is analyzed individually.
Specifically, we are interested in determining two points: the end of the attack
and the release start. From experimental measures, attack end position is con-
sidered 10ms after the amplitude reaches its local maximum. The release start
Identifying Attack Articulations in Classical Guitar 225
position is considered as the final point where local amplitude is equal or greater
than 3 percent of the local maximum. For example, in Figure 4, the first portion
of the Figure 3 is zoomed. The first and the last lines are the plucking onsets
identified by HFC algorithm. The first dashed line is the place where attack
finishes. The second dashed line is the place where release starts.
Pitch Detection. Our second task was to analyze the sound fragment between
two onsets. Since we know the onset values of plucking hand, what we require is
another peak detection algorithm with a lower threshold in order to capture the
changes in fundamental frequency. Specifically, if fundamental frequency is not
constant between two onsets, we consider that the possibility of the existence of
an expressive articulation is high.
In the pitch detection module, i.e to extract onsets and their corresponding
fundamental frequencies, we used aubionotes. In Aubio library, both onset de-
tection and fundamental frequency estimation algorithms can be chosen from a
bunch of alternatives. For onset detection, this time we need a more sensitive
algorithm than the one we used to detect the right hand onset detection. Thus,
we used complex domain algorithm [8] to determine the peaks and Yin [6] for
the fundamental frequency estimation. Complex domain onset detection is based
on a combination of phase approach and energy based approach.
We used 2048 bins as our window size, 512 bins as our hop size, 1 as our pick
peaking threshold and −95db as our silence threshold. With these parameters
we obtained an output like the shown in Figure 5. As shown in the figure, first
results were not as we expected. Specifically, they were noisier than expected.
There were noisy parts, especially at the beginning of the notes, which generated
226 T.H. Özaslan et al.
false-positive peaks. For instance in Figure 5, many false positive note onsets are
detected between the interval from 0 to 0.2 seconds.
A careful analysis of the results demonstrated that the false-positive peaks
were located in the region of the notes frequency borders. Therefore, we propose
a lightweight solution for the problem: to apply a chroma filtering to the regions
that are in the borders of Complex domain peaks. As shown in Figure 6, after
applying chroma conversion, the results are drastically improved.
Next, we analyzed the fragments between two onsets based on the segments
provided by the plucking detection module. Specifically, we analyzed the sound
fragment between attack ending point and release starting point (because the
noisiest part of a signal is the attack part and the release part of a signal contains
unnecessary information for pitch detection [7]). Therefore, for our analysis we
take the fragment between attack and release parts where pitch information is
relatively constant.
Figure 7 shows fundamental frequency values and right hand onsets. X-axis
represents the time domain bins and Y-axis represents the frequency. In Figure 7,
vertical lines depict the attack and release parts respectively. In the middle
there is a change in frequency, which was not determined as an onset by the
first module. Although it seems like an error, it is a success result for our model.
Specifically, in this phrase there is glissando, which is a left hand articulation, and
was not identified as an onset by plucking detection module (HFC algorithm),
but identified by the pitch detection module (Complex Domain algorithm). The
output of the pitch detection module for this recording is shown in Table 1.
Analysis and Annotation. After obtaining the results from plucking detec-
tion and pitch detection modules, the goal of the analysis and annotation module
is to determine the candidates of expressive articulations. Specifically, from the
results of the pitch detection module, we analyze the differences of fundamental
Identifying Attack Articulations in Classical Guitar 227
frequencies in the segments between attack and release parts (provided by the
plucking detection module). For instance, in Table 1 the light gray values rep-
resent the attack and release parts, which we did not take into account while
applying our decision algorithm.
The differences of fundamental frequencies are calculated by subtracting to
each bin its preceding bin. Thus, when the fragment we are examining is a non-
articulated fragment, this operation returns 0 for all bins. On the other side,
in expressively articulated fragments some peaks will arise (see Figure 8 for an
example).
In Figure 8 there is only one peak, but in other recordings some consecutive
peaks may arise. The explanation is that the left hand also causes an onset, i.e.
it generates also a transient part. As a result of this transient, more than one
change in fundamental frequency may be present. If those changes or peaks are
close to each other we consider them as a single peak.
We define this closeness with a pre-determined consecutiveness threshold.
Specifically, if the maximum distance between these peaks is 5 bins, we
228 T.H. Özaslan et al.
3.2 Classification
The classification module analyzes the regions identified by the extraction mod-
ule and label them as legato or glissando. A diagram of the classification module
is shown in Figure 9. In this section, first, we describe our research to select the
appropriate descriptor to analyze the behavior of legato and glissando. Then, we
explain the new two components, Models Builder and Detection.
Selecting a Descriptor. After extracting the regions which contain candidates
of expressive articulations, the next step was to analyze them. Because different
expressive articulations (legato vs glissando) should present different character-
istics in terms of changes in amplitude, aperiodicity, or pitch [22], we focused
the analysis on comparing these deviations.
Specifically, we built representations of these three features (amplitude, ape-
riodicity, and pitch). Representations helped us to compare different data with
different length and density. As we stated above, we are mostly interested in
changes: changes in High Frequency Content, changes in fundamental frequency,
changes in amplitude, etc. Therefore, we explored the peaks in the examined
data because peaks are the points where changes occur.
As an example, Figure 10 shows, from top to bottom, amplitude evolution,
pitch evolution, and changes in aperiodicity for both legato and glissando. As
both Figures show, glissando and legato examples, the changes in pitch are simi-
lar. However, the changes in amplitude and aperiodicity present a characteristic
slope.
Thus, as a first step we concentrated on determining which descriptor could
be used. To make this decision, we built models for both aperiodicty and
Identifying Attack Articulations in Classical Guitar 229
Fig. 10. From top to bottom, representations of amplitude, pitch and aperiodicty of
the examined regions
10000 to 1000) and round them. So they become 146, 146, 147, 150 and 150.
As shown, we have 2 peaks in 146 and 150. In order to fix this duplicity,
we choose the ones with the highest peak value. After collecting and scaling
peak positions, the peaks are linearly connected. As shown in Figure 13,
the obtained graph is an approximation of the graph shown in Figure 12b.
Linear approximation helps the system to avoid consecutive small tips and
dips.
In our case all the recordings were performed at 60bpm and all the notes in the
recordings are 8th notes. That is, each note is half a second, and each legato or
glissando portion is 1 second. We recorded with a sampling rate of 44100, and
we did our analysis by using a hop size of 32 bins, i.e. 44100/32 = 1378 bins.
We knew that this was our highest limit. For the sake of simplicity, we scaled
our x-axis to 1000 bins.
Fig. 15. Final envelope approximation of peak histograms of legato and glissando
training sets
time we used a threshold of 30 percent and collect the peaks with amplitude
values above this threshold. Notice that the threshold is different than the
used in envelope approximation. Then, we used histograms to compute the
density of the peak locations. Figure 14 shows the resulting histograms.
After constructing the histograms, as shown in Figure 14, we used our en-
velope approximation method to construct the envelopes of legato and glis-
sando histogram models (see Figure 15).
2. SAX: Symbolic Aggregate Approximation. Although the histogram
envelope approximations of legato and glissando in Figure 15 are close to our
purposes, they still include noisy sections. Rather than these abrupt changes
(noises), we are interested in a more general representation reflecting the
changes more smoothly. SAX (Symbolic Aggregate Approximation) [18], is
a symbolic representation used in time series analysis that provides a dimen-
sionality reduction while preserving the properties of the curves. Moreover,
SAX representation makes the distance measurements easier. Then, we
applied the SAX representation to histogram envelope approximations.
Identifying Attack Articulations in Classical Guitar 233
4 Experiments
The goal of the experiments realized was to test the performance of our model.
Since different modules have been designed, and they work independently of each
other, we tested Extraction and Classification modules separately. After applying
separate studies, we combined the results to assess the overall performance of
the proposed system.
234 T.H. Özaslan et al.
4.1 Recordings
Each scale contains 24 ascending and 24 descending notes. Each exercise con-
tains 12 expressive articulations (the ones connected with an arch). Since we
repeated the exercise at three different positions, we obtained 36 legato and 36
glissando examples. Notice that we also performed recordings with a neutral ar-
ticulation (neither legatos nor glissandos). We presented all the 72 examples to
our system.
As a preliminary test with more realistic recordings, we also recorded a small
set of 5-6 note phrases. They include different articulations in random places (see
Figure 20). As shown in Table 3, each phrase includes different combinations of
expressive articulations varying from 0-2. For instance, Phrase 3 (see Figure 20c)
does not have any expressive articulation and Phrase 4 (see Figure 20d) contains
the same notes of Phrase 3 but including two expressive articulations: first a
legato and next an appoggiatura.
First, we analyzed the accuracy of the extraction module to identify regions with
legatos. The hypothesis was that legatos are the articulations easiest to detect
because they are composed of two long notes. Next, we analyzed the accuracy
to identify regions with glissandos. Because in this situation the first note (the
glissando) has a short duration, it may be confused with the attack.
truth. We compared the output results with annotations. The output was the
number of determined expressive articulations in the sound fragment.
Analyzing the experiments (see Table 2), different conclusions can be ex-
tracted. First, as expected, legatos are easier to detect than glissandos. Second,
in non-steel strings the melodic direction does not cause a different performance.
Regarding steel strings, descending legatos are more difficult to detect than as-
cending legatos (90% versus 70%). This result is not surprising because the
plucking action of left hand fingers in descending legatos is slightly similar to
a right hand plucking. However, this difference does not appear in glissandos
because the finger movement is the same.
After testing the Extraction Module, we tested the same audio files (this time
using only the legato and glissando examples) to test our Classification Mod-
ule. As explained in Section 3.2, we performed experiments applying different
step sizes for the SAX representation. Specifically (see results reported in Ta-
ble 4), we may observe that a step size of 5 is the most appropriate setting. This
result corroborates that a higher resolution when discretizing is not required
and demonstrates that the SAX representation provides a powerful technique to
summarize the information about changes.
The overall performance for legato identification was 83.3% and the overall
performance for glissando identification was 80.5%. Notice that identification of
ascending legato reached a 100% of accuracy whereas descending legato achieved
only a 66.6%. Regarding glissando, there was no significant difference between
ascending or descending accuracy (83.3% versus 77.7%). Finally, analyzing the
results when considering the string type, the results presented a similar accuracy
on both nylon and metallic strings.
After testing the main modules separately, we studied the performance of the
whole system using the same recordings. From our previous experiments, an step
size of 5 gave the best analyzes results, therefore we run these experiments with
only an step size of 5.
Since we had errors both in the extraction module and classification mod-
ule, the combined results presented a lower accuracy (see results on Table 5).
238 T.H. Özaslan et al.
Step Size
Recordings 5 10
Ascending Legato 100.0 % 100.0 %
Descending Legato 66.6 % 72.2 %
Ascending Glissando 83.3 % 61.1 %
Descending Glissando 77.7 % 77.7 %
Legato Nylon Strings 80.0 % 86.6 %
Legato Metallic Strings 86.6 % 85.6 %
Glissando Nylon Strings 83.3 % 61.1 %
Glissando Metallic Strings 77.7 % 77.7 %
Recordings Accuracy
Ascending Legato 85.0 %
Descending Legato 53.6 %
Ascending Glissando 58.3 %
Descending Glissando 54.4%
Legato Nylon Strings 68.0 %
Legato Metallic Strings 69.3 %
Glissando Nylon Strings 58.3 %
Glissando Metallic Strings 54.4 %
5 Conclusions
In this paper we presented a system that combines several state of the art analysis
algorithms to identify left hand articulations such as legatos and glissandos.
Specifically, our proposal uses HFC for plucking detection and Complex Domain
and YIN algorithms for pitch detection. Then, combining the data coming from
these different sources, we developed a first decision mechanism, the Extraction
module, to identify regions where attack articulations may be present. Next, the
Classification module analyzes the regions annotated by the extraction Module
and tries to determine the articulation type. Our proposal is to use aperiodicity
Identifying Attack Articulations in Classical Guitar 239
Acknowledgments
This work was partially funded by NEXT-CBR (TIN2009-13692-C03-01), IL4LTS
(CSIC-200450E557) and by the Generalitat de Catalunya under the grant 2009-
SGR-1434. Tan Hakan Özaslan is a Phd student of the Doctoral Program in Infor-
mation, Communication, and Audiovisuals Technologies of the Universitat Pom-
peu Fabra. We also want to thank the professional guitarist Mehmet Ali Yıldırım
for his contribution with the recordings.
References
1. Arcos, J.L., López de Mántaras, R., Serra, X.: Saxex: a case-based reasoning sys-
tem for generating expressive musical performances. Journal of New Music Re-
search 27(3), 194–210 (1998)
2. Brossier, P.: Automatic annotation of musical audio for interactive systems. Ph.D.
thesis, Centre for Digital music, Queen Mary University of London (2006)
3. Brossier, P., Bello, J.P., Plumbley, M.D.: Real-time temporal segmentation of note
objects in music signals. In: Proceedings of the International Computer Music
Conference, ICMC 2004 (November 2004)
4. Burns, A., Wanderley, M.: Visual methods for the retrieval of guitarist fingering.
In: NIME 2006: Proceedings of the 2006 conference on New interfaces for musical
expression, Paris, pp. 196–199 (June 2006)
5. Carlevaro, A.: Serie didactica para guitarra. vol. 4. Barry Editorial (1974)
6. de Cheveigné, A., Kawahara, H.: Yin, a fundamental frequency estimator for speech
and music. The Journal of the Acoustical Society of America 111(4), 1917–1930
(2002)
240 T.H. Özaslan et al.
7. Dodge, C., Jerse, T.A.: Computer Music: Synthesis, Composition, and Perfor-
mance. Macmillan Library Reference (1985)
8. Duxbury, C., Bello, J., Davies, J., Sandler, M., Mark, M.: Complex domain on-
set detection for musical signals. In: Proceedings Digital Audio Effects Workshop
(2003)
9. Erkut, C., Valimaki, V., Karjalainen, M., Laurson, M.: Extraction of physical and
expressive parameters for model-based sound synthesis of the classical guitar. In:
108th AES Convention, pp. 19–22 (February 2000)
10. Gabrielsson, A.: Once again: The theme from Mozart’s piano sonata in A major
(K. 331). A comparison of five performances. In: Gabrielsson, A. (ed.) Action and
perception in rhythm and music, pp. 81–103. Royal Swedish Academy of Music,
Stockholm (1987)
11. Gabrielsson, A.: Expressive intention and performance. In: Steinberg, R. (ed.) Mu-
sic and the Mind Machine, pp. 35–47. Springer, Berlin (1995)
12. Gouyon, F., Herrera, P., Gómez, E., Cano, P., Bonada, J., Loscos, A., Amatriain,
X., Serra, X.: In: ontent Processing of Music Audio Signals, pp. 83–160. Logos
Verlag, Berlin (2008), http://smcnetwork.org/public/S2S2BOOK1.pdf
13. Grachten, M., Arcos, J., de Mántaras, R.L.: A case based approach to expressivity-
aware tempo transformation. Machine Learning 65(2-3), 411–437 (2006)
14. Heijink, H., Meulenbroek, R.G.J.: On the complexity of classical guitar play-
ing:functional adaptations to task constraints. Journal of Motor Behavior 34(4),
339–351 (2002)
15. Johnstone, J.A.: Phrasing in piano playing. Withmark New York (1913)
16. Juslin, P.: Communicating emotion in music performance: a review and a theoret-
ical framework. In: Juslin, P., Sloboda, J. (eds.) Music and emotion: theory and
research, pp. 309–337. Oxford University Press, New York (2001)
17. Lee, N., Zhiyao, D., Smith, J.O.: Excitation signal extraction for guitar tones. In:
International Computer Music Conference, ICMC 2007 (2007)
18. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic rep-
resentation of time series. Data Mining and Knowledge Discovery 15(2), 107–144
(2007)
19. Lindström, E.: 5 x oh, my darling clementine the influence of expressive intention
on music performance (1992) Department of Psychology, Uppsala University
20. de Mantaras, R.L., Arcos, J.L.: Ai and music from composition to expressive per-
formance. AI Mag. 23(3), 43–57 (2002)
21. Masri, P.: Computer modeling of Sound for Transformation and Synthesis of Mu-
sical Signal. Ph.D. thesis, University of Bristol (1996)
22. Norton, J.: Motion capture to build a foundation for a computer-controlled instru-
ment by study of classical guitar performance. Ph.D. thesis, Stanford University
(September 2008)
23. Palmer, C.: Anatomy of a performance: Sources of musical expression. Music Per-
ception 13(3), 433–453 (1996)
24. Radicioni, D.P., Lombardo, V.: A constraint-based approach for annotating music
scores with gestural information. Constraints 12(4), 405–428 (2007)
25. Radisavljevic, A., Driessen, P.: Path difference learning for guitar fingering prob-
lem. In: International Computer Music Conference (ICMC 2004) (2004)
Identifying Attack Articulations in Classical Guitar 241
26. Sloboda, J.A.: The communication of musical metre in piano performance. Quar-
terly Journal of Experimental Psychology 35A, 377–396 (1983)
27. Trajano, E., Dahia, M., Santana, H., Ramalho, G.: Automatic discovery of right
hand fingering in guitar accompaniment. In: Proceedings of the International Com-
puter Music Conference (ICMC 2004), pp. 722–725 (2004)
28. Traube, C., Depalle, P.: Extraction of the excitation point location on a string
using weighted least-square estimation of a comb filter delay. In: Procs. of the 6th
International Conference on Digital Audio Effects, DAFx 2003 (2003)
Comparing Approaches to the Similarity of
Musical Chord Sequences
1 Introduction
In the last decades Music Information Retrieval (MIR) has evolved into a broad
research area that aims at making large repositories of digital music maintain-
able and accessible. Within MIR research two main directions can be discerned:
symbolic music retrieval and the retrieval of musical audio. The first direction
traditionally uses score-based representations to research typical retrieval prob-
lems. One of the most important and most intensively studied of these is prob-
ably the problem of determining the similarity of a specific musical feature, e.g.
melody, rhythm, etc. The second direction–musical audio retrieval–extracts fea-
tures from the audio signal and uses these features for estimating whether two
pieces of music share certain musical properties. In this paper we focus on a
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 242–258, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Comparing Approaches to the Similarity of Musical Chord Sequences 243
the MIR community, e.g. [20,5]. Chord labeling algorithms extract symbolic
chord labels from musical audio: these labels can be matched directly using the
algorithms covered in this paper.
If you would ask a jazz musician to answer the third question–whether se-
quences of chord descriptions are useful–he will probably agree that they are,
since working with chord descriptions is everyday practice in jazz. However, we
will show in this paper that they are also useful for retrieving pieces with a sim-
ilar but not identical chord sequence by performing a large experiment. In this
experiment we compare two harmonic similarity measures, the Tonal Pitch Step
Distance (TPSD) [11] and the Chord Sequence Alignment System (CSAS) [12],
and test the influence of different degrees of detail in the chord description and
the knowledge of the global key of a piece on retrieval performance.
The next section gives a brief overview of the current achievements in chord
sequence similarity matching and harmonic similarity in general, Section 3 de-
scribes the data used in the experiment and Section 4 presents the results.
8
7
6
5
4
3
2
1
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Beat
Fig. 1. A plot demonstrating the comparison of two similar versions of All the Things
You Are using the TPSD. The total area between the two step functions, normalized
by the duration of the shortest song, represents the distance between both songs. A
minimal area is obtained by shifting one of the step functions cyclically.
different from the one used in [11]. The harmony grammar approach could, at
the time of writing, not compete in this experiment because in its current state
it is yet unable to parse all the songs in the used dataset.
The next section introduces the TPSD and the improvements over the im-
plementation used for the experiment here and the implementation in [11]. Sec-
tion 2.2 highlights the different variants of the CSAS. The main focus of this
paper is on the similarity of sequences of chord labels, but there exist other
relevant harmony based retrieval methods: some of these are briefly reviewed in
Section 2.3.
The TPSD uses Lerdahl’s [17] Tonal Pitch Space (TPS) as its main musical
model. TPS is a model of tonality that fits musicological intuitions, correlates
well with empirical findings from music cognition [16] and can be used to calcu-
late a distance between two arbitrary chords. The TPS model can be seen as a
scoring mechanism that takes into account the number of steps on the circle of
fifths between the roots of the chords, and the amount of overlap between the
chord structures of the two chords and their relation to the global key.
The general idea behind the TPSD is to use the TPS to compare the change
of chordal distance to the tonic over time. For every chord the TPS distance
between the chord and the key of the sequence is calculated, which results in
a step function (see Figure 1). As a consequence, information about the key
of the piece is essential. Next, the distance between two chord sequences is de-
fined as the minimal area between the two step functions over all possible hor-
izontal circular shifts. To prevent that longer sequences yield larger distances,
the score is normalized by dividing it by the duration of the shortest song.
246 W.B. de Haas et al.
The TPS is an elaborate model that allows to compare every arbitrary chord in
an arbitrary key to every other possible chord in any key. The TPSD does not
use the complete model and only utilizes the parts that facilitate the comparison
of two chords within the same key. In the current implementation of the TPSD
time is represented in beats, but generally any discrete representation could be
used.
The TPSD version used in this paper contains a few improvements compared
to the version used in [11]: by applying a different step function matching al-
gorithm from [4], and by exploiting the fact that we use discrete time units
that enable us to sort in linear time using counting sort [6], a running time of
O(nm) is achieved where n and m are the number of chord symbols in both
songs. Furthermore, to be able to use the TPSD in situations where a priori key
information is not available, the TPSD is extended with a key finding algorithm.
Key finding. The problem of finding the global key of piece of music is called
key finding. For this study, this is done on the basis of chord information only.
The rationale behind the key finding algorithm that we present here is the fol-
lowing: we consider the key that minimizes the total TPS distance and best
matches the starting and ending chord, the key of the piece.
For minimizing the total TPS distance, the TPSD key finding uses TPS based
step functions as well. We assume that when a song matches a particular key, the
TPS distances between the chords and the tonic of the key are relatively small.
The general idea is to calculate 24 step functions for a single chord sequence, one
for each major and minor key. Subsequently, all these keys are ranked by sorting
them on the area between their TPS step function and the x-axis; the smaller
the total area, the better this key fits the piece, and the higher the rank. Often,
the key at the top rank is the correct key. However, among the false positives at
rank one, non-surprisingly, the IV, V and VI relative to the ground-truth key1
are found regularly. This makes sense because, when the total of TPS distances
of the chords to C is small, the distances to F, G and Am might be small as well.
Therefore, to increase performance, an additional scoring mechanism is designed
that takes into account the IV, V and VI relative to the ground-truth key. Of
all 24 keys, the candidate key that minimizes the following sum S is considered
the key of the piece.
β if the first chord matches the key
S = αr(I) + r(IV ) + r(V ) + r(V I) + (1)
β if the last chord matches the key
Here r(.) denotes the rank of the candidate key, a parameter α determines
how important the tonic is compared to other frequently occurring scale degrees
and β controls the importance of the key matching first and last chord. The
parameters α and β were tuned by hand and an α of 2, and a β of 4 were
found to give good results. Clearly, this simple key-finding algorithm is biased
1
The roman numbers here represent the diatonic interval between the key in the
ground-truth and the predicted key.
Comparing Approaches to the Similarity of Musical Chord Sequences 247
towards western diatonic music, but for the corpus used in this paper it performs
quite well. The algorithm scores 88.8 percent correct on a subset of 500 songs
of the corpus used in the experiment below for which we manually checked the
correctness of the ground-truth key. The above algorithm takes O(n) time, where
n is the number of chord symbols, because the number of keys is constant.
Root interval step functions. For the tasks where only the chord root is
used we use a different step function representation (See Section 4). In these
tasks the interval between the chord root and the root note of the key defines
the step height and the duration of the chord again defines the step length. This
matching method is very similar to the melody matching approach by Aloupis
et al. [2]. Note that the latter was never tested in practice. The matching and
key finding methods are not different from the other variants of the TPSD. Note
that in all TPSD variants, chord inversions are ignored.
string 1 F B E A D D G C C
string 2 F F B A A A D D C C
operation I M M S M I M M D M M
score −1 +2 +2 −2 +2 −1 +2 +2 −1 +2 +2
Algorithms based on local alignment have been successfully adapted for melodic
similarity [21,13,15] and recently it has been used to determine harmonic similar-
ity [12] as well. Two steps are necessary to apply the alignment technique to the
comparison of chord progressions: the choice of the representation of a chord se-
quence, and the scores of the elementary operations between symbols. To take the
durations of the chords into account, we represent the chords at every beat. The
algorithm has therefore a complexity of O(nm), where n and m are the sizes of
248 W.B. de Haas et al.
the compared songs in beats. The score function can either be adapted to the cho-
sen representation or can simply be binary, i.e. the score is positive (+2) if the
two chords described are identical, and negative (−2) otherwise. The insertion or
deletion score is set to −1.
3, 2, 5, 0, 5, 6, 1, 4, 4
If all the notes of the chords are taken into account, the TPS or Paiement
distances can be used between the chords and the triad of the key to construct
the representation. The representation is then a sequence of distances, and we use
an alignment between these distances instead of between the chords themselves.
This representation is very similar to the representation used in the TPSD. The
score functions used to compare the resulting sequences can then be binary, or
linear in similarity regarding the difference observed in the values.
between successive chords, but this has been proven to be less accurate when
applied to alignment algorithms [13]. Another option is to consider a key rela-
tive representation, like the representation described above which is by definition
transposition invariant. However, this approach is not robust against local key
changes. With an absolute representation of chords, we use an adaptation of
the local alignment algorithm proposed in [1]. It allows to take into account an
unlimited number of local transpositions and can be applied to representations
of chord progressions to account for modulations.
According to the choice of the representation and the score function, several
variants are possible in order to build an algorithm for harmonic similarity. In
Section 4 we explain the different representations and scoring functions used in
the different tasks of the experiment and their effects on retrieval performance.
Table 1. A leadsheet of the song All The Things You Are. A dot represents a beat, a
bar represents a bar line, and the chord labels are presented as written in the Band-
in-a-Box file.
Table 2. The distribution of the song class sizes in the Chord Sequence Corpus
special introduction or ending. The richness of the chords descriptions may also
diverge, i.e. a C7
9
13 may be written instead of a C7 , and common substitutions
frequently occur. Examples of the latter are relative substitution, i.e. Am instead
of C, or tritone substitution, e.g. F#7 instead of C7 . Having multiple chord
sequences describing the same song allows for setting up a cover-song finding
experiment. The the title of the song is used as ground-truth and the retrieval
challenge is to find the other chord sequences representing the same song.
The distribution of the song class sizes is displayed in Table 2 and gives an
impression of the difficulty of the retrieval task. Generally, Table 2 shows that
the song classes are relatively small and that for the majority of the queries there
is only one relevant document to be found. It furthermore shows that 82.5% of
the songs is in the corpus for distraction only. The chord sequence corpus is
available to the research community on request.
We compared the TPSD and the CSAS in six retrieval tasks. For this experiment
we used the chord sequence corpus described above, which contains sequences
that clearly describe the same song. For each of these tasks the experimental
setup was identical: all songs that have two or more similar versions were used as
a query, yielding 1775 queries. For each query a ranking was created by sorting
the other songs on their TPSD and CSAS scores and these rankings and the
runtimes of the compared algorithms were analyzed.
4.1 Tasks
The tasks, summarized in Table 3, differed in the level of chord information used
by the algorithms and in the usage of a priori global key information. In tasks 1-3
no key information was presented to the algorithms and in the remaining three
tasks we used the key information, which was manually checked for correctness,
as stored in the Band-in-a-Box files. The tasks 1-3 and 4-6 furthermore differed
in the amount of chord detail that was presented to the algorithms: in tasks 1
252 W.B. de Haas et al.
Table 3. The TPSD and CSAS are compared in six different retrieval tasks
and 4 only the root note of the chord was available to the algorithms, in tasks
2 and 5 the root and the triad were available and in tasks 3 and 6 the complete
chord as stored in the Band-in-a-Box file was presented to the algorithms.
The different tasks required specific variants of the tested algorithms. For tasks
1-3 the TPSD used the TPS key finding algorithm as described in Section 2.1. For
the tasks 1 and 4, involving only chord roots, a simplified variant of the TPSD
was used, for the tasks 2, 3, 5 and 6 we used the regular TPSD, as described in
Section 2.1 and [11].
To measure the impact of the chord representation and substitution functions
on retrieval performance, different variants of the CSAS were built also. In some
cases the choices made did not yield the best possible results, but they allow the
reader to understand the effects of the parameters used on retrieval performance.
The CSAS algorithms in tasks 1-3 all used an absolute representation and the
algorithms in tasks 4-6 used a key relative representation. In tasks 4 and 5 the
chords were represented as the difference in semitones to the root of the key of
the piece and in task 6 as the Lerdahl’s TPS distance between the chord and the
triad from the key (as in the TPSD). The CSAS variants in tasks 1 and 2 used
a consonance based substitution function and algorithms in tasks 4-6 a binary
substitution function was used. In tasks 2 and 5 a binary substitution function
for the mode was used as well: if the mode of the substituted chords matched,
no penalty was given, if they did not match, a penalty was given.
A last parameter that was varied was the use of local transpositions. The
CSAS variants applied in tasks 1 and 3 did not consider local transpositions, but
the CSAS algorithm used in task 2 did allow local transpositions (see Section 2.2
for details).
The TPSD was implemented in Java and the CSAS was implemented in C++,
but a small Java program was used to parallelize the matching process. All runs
were done on a Intel Xeon quad-core CPU at a frequency of 1.86 GHz. and 4 Gb
of RAM running 32 bit Linux. Both algorithms were parallelized to optimally
use the multiple cores of the CPUs.
4.2 Results
For each task and for each algorithm we analyzed the rankings of all 1775 queries
with 11-point precision recall curves and Mean Average Precision (MAP). Figure 2
displays the interpolated average precision and recall chart for the TPSD and
the CSAS for all tasks listed in Table 3. We calculated the interpolated average
Comparing Approaches to the Similarity of Musical Chord Sequences 253
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall Recall
TPSD Roots only (tasks 1 and 4)
CSAS Roots + triad (tasks 2 and 5)
Complete chords (tasks 3 and 6)
Fig. 2. The 11-point interpolated precision and recall charts for the TPSD and the
CSAS for tasks 1–3, on the left, and 4–6 on the right
1 341:20
0.9 170:40
42:40
0.7
21:20
0.6
10:40
0.5 5:20
0.4 2:40
1:20
0.3
0:40
0.2
0:20
0.1 0:10
0 0:05
1) 2) 3) 1) 2) 3) 4) 5) 6) 4) 5) 6)
sk sksk sk sk sk sk sk sk ask ask sk
(ta (ta
(ta x (ta (ta (ta (ta (ta x (ta (t (t (ta
s ad x s s d s d x
tri ot a d
plle
e tot a pl
e t a
pl
e
ro p ro tri ro
o tri ro
o tri
S D ts
+
c om S s
+
c om D s
+
c om S ts
+
com
o A t S t A o
TP ro SD C
S
ro
o S TP
o
ro PSD C
S
ro SAS
D TP S SA D T S
S A C S A C
TP C
S TP C
S
Fig. 3. The MAP and Runtimes of the TPSD and the CSAS. The MAP is displayed on
the left axis and the runtimes are displayed on an exponential scale on the right axis.
On the left side of the chart the key inferred tasks are displayed and the key relative
tasks are displayed on the right side.
The overall retrieval performance of all algorithms on all tasks can be con-
sidered good, but there are some large differences between tasks and between
algorithms, both in performance and in runtime. With a MAP of .70 the over-
all best performing setup was the CSAS using triadic chord descriptions and
a key relative representation (task 5). The TPSD also performs best on task
5 with an MAP of .58. In tasks 2 and 4-6 the CSAS significantly outperforms
the TPSD. On tasks 1 and 3 the TPSD outperforms the CSAS in runtime as
well as performance. For these two tasks, the results obtained by the CSAS are
significantly lower because local transpositions are not considered. These results
show that taking into account transpositions has a high impact on the quality
of the retrieval system, but also on the runtime.
The retrieval performance of the CSAS is good, but comes at a price. On
average over six of the twelve runs, the CSAS runs need about 136 times as
much time to complete as the TPSD. The TPSD takes about 30 minutes to 1.5
hours to match all 5028 pieces, while the CSAS takes about 2 to 9 days. Due to
the fact that the CSAS run in task 2 takes 206 hours to complete, there was not
enough time to perform a run on task 1 and 3 with the CSAS variant that takes
local transpositions into account.
Comparing Approaches to the Similarity of Musical Chord Sequences 255
Table 4. This table shows for each pair of runs if the mean average precision, as
displayed in Figure 3 differed significantly (+) or not (–)
On the other hand keeping all rich chord information seems to distract the
evaluated retrieval systems. Pruning the chord structure down to the triad might
be seen as a form of syntactical noise-reduction, since the chord additions, if they
do not have a voice leading function, have a rather arbitrary character and just
add some harmonic spice.
5 Concluding Remarks
We performed a comparison of two different chord sequence similarity measures,
the Tonal Pitch Space Distance (TPSD) and the Chord Sequence Alignment
System (CSAS), on a large newly assembled corpus of 5028 symbolic chord se-
quences. The comparison consisted of six different tasks, in which we varied the
amount of detail in the chord description and the availability of a priori key in-
formation. The CSAS variants outperform the TPSD significantly in most cases,
but is in all cases far more costly to use. The use of a priori key information
improves performance and using only the triad of a chord for similarity matching
gives the best results for the tested algorithms. Nevertheless, we can positively
answer the third question that we have asked ourselves in the introduction–do
chord descriptions provided a useful and valid abstraction–because the experi-
ment presented in the previous section clearly shows that chord descriptions can
be used for retrieving harmonically related pieces.
The retrieval performance of both algorithms is good, especially if one con-
siders the size of the corpus and the relatively small class sizes (see Table 2),
but there is still room for improvement. Both algorithms cannot deal with large
structural changes, e.g. adding repetitions, a bridge, etc. A prior analysis of
the structure of the piece combined with partial matching could improve the
retrieval performance. Another important issue is that the compared systems
treat all chords as equally important. This is musicologically not plausible. Con-
sidering the musical function in the local as well as global structure of the chord
progression, like is done in [10] or with sequences of notes in [24], might still
improve the retrieval results.
With runtimes that are measured in days, the CSAS is a costly system. The
runtimes might be improved by using GPU programming [8], or with filtering
steps using algorithms such as BLAST [3].
The harmonic retrieval systems and experiments presented in this paper con-
sider a specific form of symbolic music representations only. Nevertheless, the
application of the methods here presented is not limited to symbolic music and
audio applications are currently investigated. Especially the recent developments
in chord label extraction are very promising because the output of these meth-
ods could be matched directly with the systems here presented. The good per-
formance of the proposed algorithms leads us to believe that, both in the audio
and symbolic domain, retrieval systems will benefit from chord sequence based
matching in the near future.
Comparing Approaches to the Similarity of Musical Chord Sequences 257
References
1. Allali, J., Ferraro, P., Hanna, P., Iliopoulos, C.S.: Local transpositions in alignment
of polyphonic musical sequences. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE
2007. LNCS, vol. 4726, pp. 26–38. Springer, Heidelberg (2007)
2. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuñez, Y., Rappa-
port, D., Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic
Similarity. Computer Music Journal 30(3), 67–76 (2004)
3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment
Search Tool. Journal of Molecular Biology 215, 403–410 (1990)
4. Arkin, E., Chew, L., Huttenlocher, D., Kedem, K., Mitchell, J.: An Efficiently Com-
putable Metric for Comparing Polygonal Shapes. IEEE Transactions on Pattern
Analysis and Machine Intelligence 13(3), 209–216 (1991)
5. Bello, J., Pickens, J.: A Robust Mid-Level Representation for Harmonic Content
in Music Signals. In: Proceedings of the International Symposium on Music Infor-
mation Retrieval, pp. 304–311 (2005)
6. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
7. Downie, J.S.: The Music Information Retrieval Evaluation Exchange (2005–2007):
A Window into Music Information Retrieval Research. Acoustical Science and
Technology 29(4), 247–255 (2008)
8. Ferraro, P., Hanna, P., Imbert, L., Izard, T.: Accelerating Query-by-Humming on
GPU. In: Proceedings of the Tenth International Society for Music Information
Retrieval Conference (ISMIR), pp. 279–284 (2009)
9. Gannon, P.: Band-in-a-Box. PG Music (1990), http://www.pgmusic.com/ (last
viewed February 2011)
10. de Haas, W.B., Rohrmeier, M., Veltkamp, R.C., Wiering, F.: Modeling Harmonic
Similarity Using a Generative Grammar of Tonal Harmony. In: Proceedings of the
Tenth International Society for Music Information Retrieval Conference (ISMIR),
pp. 549–554 (2009)
11. de Haas, W.B., Veltkamp, R.C., Wiering, F.: Tonal Pitch Step Distance: A Simi-
larity Measure for Chord Progressions. In: Proceedings of the Ninth International
Society for Music Information Retrieval Conference (ISMIR), pp. 51–56 (2008)
12. Hanna, P., Robine, M., Rocher, T.: An Alignment Based System for Chord Se-
quence Retrieval. In: Proceedings of the 2009 Joint International Conference on
Digital Libraries, pp. 101–104. ACM, New York (2009)
13. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for
Evaluating Similarity between Monophonic Musical Sequences. Journal of New
Music Research 36(4), 267–279 (2007)
14. Harte, C., Sandler, M., Abdallah, S., Gómez, E.: Symbolic Representation of Mu-
sical Chords: A Proposed Syntax for Text Annotations. In: Proceedings of the
Sixth International Society for Music Information Retrieval Conference (ISMIR),
pp. 66–71 (2005)
15. van Kranenburg, P., Volk, A., Wiering, F., Veltkamp, R.C.: Musical Models for
Folk-Song Melody Alignment. In: Proceedings of the Tenth International Society
for Music Information Retrieval Conference (ISMIR), pp. 507–512 (2009)
16. Krumhansl, C.: Cognitive Foundations of Musical Pitch. Oxford University Press,
USA (2001)
17. Lerdahl, F.: Tonal Pitch Space. Oxford University Press, Oxford (2001)
258 W.B. de Haas et al.
18. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
19. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering Chord Idioms
through Beatles and Real Book Songs. In: Proceedings of the Eighth International
Society for Music Information Retrieval Conference (ISMIR), pp. 255–258 (2007)
20. Mauch, M., Noland, K., Dixon, S.: Using Musical Structure to Enhance Automatic
Chord Transcription. In: Proceedings of the Tenth International Society for Music
Information Retrieval Conference (ISMIR), pp. 231–236 (2009)
21. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161–175 (1990)
22. Paiement, J.F., Eck, D., Bengio, S.: A Probabilistic Model for Chord Progressions.
In: Proceedings of the Sixth International Conference on Music Information Re-
trieval (ISMIR), London, UK, pp. 312–319 (2005)
23. Pickens, J., Crawford, T.: Harmonic Models for Polyphonic Music Retrieval. In:
Proceedings of the Eleventh International Conference on Information and Knowl-
edge Management, pp. 430–437. ACM, New York (2002)
24. Robine, M., Hanna, P., Ferraro, P.: Music Similarity: Improvements of Edit-based
Algorithms by Considering Music Theory. In: Proceedings of the ACM SIGMM
International Workshop on Multimedia Information Retrieval (MIR), Augsburg,
Germany, pp. 135–141 (2007)
25. Smith, T., Waterman, M.: Identification of Common Molecular Subsequences.
Journal of Molecular Biology 147, 195–197 (1981)
26. Temperley, D.: The Cognition of Basic Musical Structures. MIT Press, Cambridge
(2001)
27. Uitdenbogerd, A.L.: Music Information Retrieval Technology. Ph.D. thesis, RMIT
University, Melbourne, Australia (July 2002)
Songs2See and GlobalMusic2One:
Two Applied Research Projects in Music
Information Retrieval at Fraunhofer IDMT
Fraunhofer IDMT
Ehrenbergstr. 31, 98693 Ilmenau, Germany
{dmr,grn,cano,goh,lkh,abr}@idmt.fraunhofer.de
http://www.idmt.fraunhofer.de
1 Introduction
Successful exploitation of results from basic research is the indicator for the
practical relevance of a research field. During recent years, the scientific and
commercial interest in the comparatively young research discipline called Music
Information Retrieval (MIR) has grown considerably. Stimulated by the ever-
growing availability and size of digital music catalogs and mobile media players,
MIR techniques become increasingly important to aid convenient exploration
of large music collections (e.g., through recommendation engines) and to enable
entirely new forms of music consumption (e.g., through music games). Evidently,
commercial entities like online music shops, record labels and content aggregators
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 259–272, 2011.
c Springer-Verlag Berlin Heidelberg 2011
260 C. Dittmar et al.
have realized that these aspects can make them stand out among their competi-
tors and foster customer loyalty. However, the industry’s willingness to fund basic
research in MIR is comparatively low. Thus, only well described methods have
found successful application in the real world. For music recommendation and
retrieval, these are doubtlessly services based on collaborative filtering1 (CF).
For music transcription and interaction, these are successful video game titles
using monophonic pitch detection2 . The two research projects in the scope of
this paper provide the opportunity to progress in core areas of MIR, but always
with a clear focus on suitability for real-world applications.
This paper is organized as follows. Each of the two projects is described in
more detail in Sec. 2 and Sec. 3. Results from the research as well as the de-
velopment perspective are reported. Finally, conclusions are given and future
directions are sketched.
2 Songs2See
Musical education of children and adolescents is an important factor in their
personal self-development regardless if it is about learning a musical instrument
or in music courses at school. Children, adolescents and adults must be con-
stantly motivated to practice and complete learning units. Traditional forms of
teaching and even current e-learning systems are often unable to provide this
motivation. On the other hand, music-based games are immensely popular [7],
[17], but they fail to develop skills which are transferable to musical instruments
[19]. Songs2See is set out to develop educational software for music learning
which provides the motivation of game playing and at the same time develops
real musical skills. Using music signal analysis as the key technology, we want
to enable students to use popular musical instruments as game controllers for
games which teach the students to play music of their own choice. This should
be possible regardless of the representation of music they possess (audio, score,
tab, chords, etc.). As a reward, the users receive immediate feedback from the
automated analysis of their rendition. The game application will provide the
students with visual and audio feedback regarding fine-grained details of their
performance with regard to timing (rhythm), intonation (pitch, vibrato), and
articulation (dynamics). Central to the analysis is automatic music transcrip-
tion, i.e., the extraction of a scalable symbolic representation from real-world
music recordings using specialized computer algorithms [24], [11]. Such symbolic
representation allows to render simultaneous visual and audible playbacks for
the students, i.e., it can be translated to traditional notation, a piano-roll view
or a dynamical animation showing the fingering on the actual instrument. The
biggest additional advantage is the possibility to let the students have their fa-
vorite song transcribed into a symbolic representation by the software. Thus, the
students can play along to actual music they like, instead of specifically produced
and edited learning pieces. In order to broaden the possibilities when creating
1
See for example http://last.fm
2
See for example http://www.singstargame.com/
Songs2See and GlobalMusic2One 261
3 GlobalMusic2One
GlobalMusic2one is developing a new generation of adaptive music search engines
combining state-of-the-art methods of MIR with Web 2.0 technologies. It aims
at reaching better quality in automated music recommendation and browsing
inside global music collections. Recently, there has been a growing research in-
terest in music outside the mainstream popular music from the so-called western
culture group [39],[16]. For well-known mainstream music, large amounts of user
generated browsing traces, reviews, play-lists and recommendations available in
different online communities can be analyzed through CF methods in order to
reveal similarities between artists, songs and albums. For novel or niche content
one obvious solution to derive such data is content-based similarity search. Since
the early days of MIR, the search for music items related to a specific query song
or a set of those (Query by Example) has been a consistent focus of scientific
interest. Thus, a multitude of different approaches with varying degree of com-
plexity has been proposed [32]. Another challenge is the automatic annotation
(a.k.a. “auto-tagging” [8]) of world music content. It is obvious that the broad
term “World Music” is one of the most ill-defined tags when being used to lump
all “exotic genres” together. It lacks justification because this category comprises
such a huge variety of different regional styles, influences, and a mutual mix up
thereof. On the one hand, retaining the strict classification paradigm for such a
high variety of musical styles inevitably limits the precision and expressiveness
of a classification system that shall be applied to a world-wide genre taxonomy.
With GlobalMusic2One, the user may create new categories allowing the system
to flexibly adapt to new musical forms of expression and regional contexts. These
Songs2See and GlobalMusic2One 265
categories can, for example, be regional sub-genres which are defined through
exemplary songs or song snippets. This self-learning MIR framework will be
continuously expanded with precise content-based descriptors.
With automatic annotation of world music content, songs often cannot be as-
signed to one single genre label. Instead, various rhythmic, melodic and harmonic
influences conflate into multi-layered mixtures. Common classifier approaches
fail due to their immanent assumption that for all song segments, one dominant
genre exists and thus is retrievable.
Acknowledgments
References
1. Abeßer, J., Dittmar, C., Großmann, H.: Automatic genre and artist classification
by analyzing improvised solo parts from musical recordings. In: Proceedings of the
Audio Mostly Conference (AMC), Piteå, Sweden (2008)
2. Abeßer, J., Bräuer, P., Lukashevich, H., Schuller, G.: Bass playing style detection
based on high-level features and pattern similarity. In: Proceedings of the 11th In-
ternational Society for Music Information Retrieval Conference (ISMIR), Utrecht,
Netherlands (2010)
3. Abeßer, J., Lukashevich, H., Dittmar, C., Bräuer, P., Krause, F.: Rule-based clas-
sification of musical genres from a global cultural background. In: Proceedings
of the 7th International Symposium on Computer Music Modeling and Retrieval
(CMMR), Malaga, Spain (2010)
4. Abeßer, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classification using
bass-related high-level features and playing styles. In: Proceedings of the 10th
International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
5. Abeßer, J., Lukashevich, H., Schuller, G.: Feature-based extraction of plucking and
expression styles of the electric bass guitar. In: Proceedings of the IEEE Interna-
tional Conference on Acoustic, Speech, and Signal Processing (ICASSP), Dallas,
Texas, USA (2010)
6. Arndt, D., Gatzsche, G., Mehnert, M.: Symmetry model based key finding. In:
Proceedings of the 126th AES Convention, Munich, Germany (2009)
7. Barbancho, A., Barbancho, I., Tardon, L., Urdiales, C.: Automatic edition of songs
for guitar hero/frets on fire. In: Proceedings of the IEEE International Conference
on Multimedia and Expo (ICME), New York, USA (2009)
8. Bertin-Mahieux, T., Eck, D., Maillet, F., Lamere, P.: Autotagger: a model for
predicting social tags from acoustic features on large music databases. Journal of
New Music Research 37(2), 115–135 (2008)
7
See http://www.songs2see.eu
8
See http://www.globalmusic2one.net
270 C. Dittmar et al.
9. Cano, E., Cheng, C.: Melody line detection and source separation in classical sax-
ophone recordings. In: Proceedings of the 12th International Conference on Digital
Audio Effects (DAFx), Como, Italy (2009)
10. Cano, E., Schuller, G., Dittmar, C.: Exploring phase information in sound source
separation applications. In: Proceedings of the 13th International Conference on
Digital Audio Effects (DAFx 2010), Graz, Austria (2010)
11. Dittmar, C., Dressler, K., Rosenbauer, K.: A toolbox for automatic transcription
of polyphonic music. In: Proceedings of the Audio Mostly Conference (AMC),
Ilmenau, Germany (2007)
12. Dittmar, C., Großmann, H., Cano, E., Grollmisch, S., Lukashevich, H., Abeßer,
J.: Songs2See and GlobalMusic2One - Two ongoing projects in Music Information
Retrieval at Fraunhofer IDMT. In: Proceedings of the 7th International Symposium
on Computer Music Modeling and Retrieval (CMMR), Malaga, Spain (2010)
13. Duan, Z., Pardo, B., Zhang, C.: Multiple fundamental frequency estimation by
modeling spectral peaks and non-peak regions. EEE Transactions on Audio,
Speech, and Language Processing (99), 1–1 (2010)
14. Fitzgerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factoriza-
tion models for musical sound source separation. Computational Intelligence and
Neuroscience (2008)
15. Gärtner, D.: Singing / rap classification of isolated vocal tracks. In: Proceedings
of the 11th International Society for Music Information Retrieval Conference (IS-
MIR), Utrecht, Netherlands (2010)
16. Gómez, E., Haro, M., Herrera, P.: Music and geography: Content description of
musical audio from different parts of the world. In: Proceedings of the 10th In-
ternational Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
17. Grollmisch, S., Dittmar, C., Cano, E.: Songs2see: Learn to play by playing. In:
Proceedings of the 41st AES International Conference on Audio in Games, London,
UK (2011)
18. Grollmisch, S., Dittmar, C., Cano, E., Dressler, K.: Server based pitch detection
for web applications. In: Proceedings of the 41st AES International Conference on
Audio in Games, London, UK (2011)
19. Grollmisch, S., Dittmar, C., Gatzsche, G.: Implementation and evaluation of an im-
provisation based music video game. In: Proceedings of the IEEE Consumer Elec-
tronics Society’s Games Innovation Conference (IEEE GIC), London, UK (2009)
20. Gruhne, M., Schmidt, K., Dittmar, C.: Phoneme recognition on popular music.
In: 8th International Conference on Music Information Retrieval (ISMIR), Vienna,
Austria (2007)
21. Herrera, P., Sandvold, V., Gouyon, F.: Percussion-related semantic descriptors of
music audio files. In: Proceedings of the 25th International AES Conference, Lon-
don, UK (2004)
22. Kahl, M., Abeßer, J., Dittmar, C., Großmann, H.: Automatic recognition of tonal
instruments in polyphonic music from different cultural backgrounds. In: Proceed-
ings of the 36th Jahrestagung für Akustik (DAGA), Berlin, Germany (2010)
23. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Proceedings of the 10th International Society for Music Information Retrieval
Conference (ISMIR), Kobe, Japan (2009)
Songs2See and GlobalMusic2One 271
24. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer Science + Business Media LLC, New York (2006)
25. Lidy, T., Rauber, A., Pertusa, A., Iesta, J.M.: Improving genre classification by
combination of audio and symbolic descriptors using a transcription system. In:
Proceedings of the 8th International Conference on Music Information Retrieval
(ISMIR), Vienna, Austria (2007)
26. Lukashevich, H.: Towards quantitative measures of evaluating song segmentation.
In: Proceedings of the 9th International Conference on Music Information Retrieval
(ISMIR), Philadelphia, Pennsylvania, USA (2008)
27. Lukashevich, H.: Applying multiple kernel learning to automatic genre classifica-
tion. In: Proceedings of the 34th Annual Conference of the German Classification
Society (GfKl), Karlsruhe, Germany (2010)
28. Lukashevich, H., Abeßer, J., Dittmar, C., Großmann, H.: From multi-labeling to
multi-domain-labeling: A novel two-dimensional approach to music genre classi-
fication. In: Proceedings of the 10th International Society for Music Information
Retrieval Conference (ISMIR), Kobe, Japan (2009)
29. Mercado, P., Lukashevich, H.: Applying constrained clustering for active explo-
ration of music collections. In: Proceedings of the 1st Workshop on Music Recom-
mendation and Discovery (WOMRAD), Barcelona, Spain (2010)
30. Mercado, P., Lukashevich, H.: Feature selection in clustering with constraints: Ap-
plication to active exploration of music collections. In: Proceedings of the 9th Int.
Conference on Machine Learning and Applications (ICMLA), Washington DC,
USA (2010)
31. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of a
monaural audio signal into harmonic/percussive components by complememntary
diffusion on spectrogram. In: Proceedings of the 16th European Signal Processing
Conferenc (EUSIPCO), Lausanne, Switzerland (2008)
32. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., Widmer, G.: On rhythm and gen-
eral music similarity. In: Proceedings of the 10th International Society for Music
Information Retrieval Conference (ISMIR), Kobe, Japan (2009)
33. Ryynänen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32, 72–86 (2008)
34. Sagayama, S., Takahashi, K., Kameoka, H., Nishimoto, T.: Specmurt anasylis:
A piano-roll-visualization of polyphonic music signal by deconvolution of log-
frequency spectrum. In: Proceedings of the ISCA Tutorial and Research Workshop
on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea (2004)
35. Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as
nonnegative factorizations. Computational Intelligence and Neuroscience (2008)
36. Smaragdis, P., Mysore, G.J.: Separation by “humming”: User-guided sound extrac-
tion from monophonic mixtures. In: Proceedings of IEEE Workshop on Applica-
tions Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA
(2009)
37. Stein, M., Schubert, B.M., Gruhne, M., Gatzsche, G., Mehnert, M.: Evaluation
and comparison of audio chroma feature extraction methods. In: Proceedings of
the 126th AES Convention, Munich, Germany (2009)
38. Stober, S., Nürnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 53–65. Springer, Heidelberg (2010)
272 C. Dittmar et al.
39. Tzanetakis, G., Kapur, A., Schloss, W.A., Wright, M.: Computational ethnomusi-
cology. Journal of Interdisciplinary Music Studies 1(2), 1–24 (2007)
40. Uhle, C.: Automatisierte Extraktion rhythmischer Merkmale zur Anwendung in
Music Information Retrieval-Systemen. Ph.D. thesis, Ilmenau University, Ilmenau,
Germany (2008)
41. Vinyes, M., Bonada, J., Loscos, A.: Demixing commercial music productions via
human-assisted time-frequency masking. In: Proceedings of the 120th AES con-
venction, Paris, France (2006), http://www.mtg.upf.edu/files/publications/
271dd4-AES120-mvinyes-jbonada-aloscos.pdf (last viewed February 2011)
42. Völkel, T., Abeßer, J., Dittmar, C., Großmann, H.: Automatic genre classification
of latin american music using characteristic rhythmic patterns. In: Proceedings of
the Audio Mostly Conference (AMC), Piteå, Sweden (2010)
MusicGalaxy:
A Multi-focus Zoomable Interface for
Multi-facet Exploration of Music Collections
1 Introduction
There is a lot of ongoing research in the field of music retrieval aiming to improve
retrieval results for queries posed as text, sung, hummed or by example as well
as to automatically tag and categorize songs. All these efforts facilitate scenarios
where the user is able to somehow formulate a query – either by describing the
song or by giving examples. But what if the user cannot pose a query because
the search goal is not clearly defined? E.g., he might look for background music
for a photo slide show but does not know where to start. All he knows is that he
can tell if it is the right music the moment he hears it. In such a case, exploratory
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 273–302, 2011.
c Springer-Verlag Berlin Heidelberg 2011
274 S. Stober and A. Nürnberger
similar
dissimilar
retrieval systems can help by providing an overview of the collection and letting
the user decide which regions to explore further.
When it comes to get an overview of a music collection, neighborhood-preser-
ving projection techniques have become increasingly popular. Beforehand, the
objects to be projected – depending on the approach, this may be artists, albums,
tracks or any combination thereof – are analyzed to extract a set of descriptive
features. (Alternatively, feature information may also be annotated manually or
collected from external sources.) Based on these features, the objects can be
compared – or more specifically: appropriate distance- or similarity measures
can be defined. The general objective of the projection can then be paraphrased
as follows: Arrange the objects in two or three dimensions (on the display) in
such a way that neighboring objects are very similar and the similarity decreases
with increasing object distance (on the display). As the feature space of the ob-
jects to be projected usually has far more dimensions than the display space,
the projection inevitably causes some loss of information – irrespective of which
dimensionality reduction techniques is applied. Consequently, this leads to a dis-
torted display of the neighborhoods such that some objects will appear closer
than they actually are (type I error), and on the other hand some objects that
are distant in the projection may in fact be neighbors in feature space (type
II error). Such neighborhood distortions are depicted in Figure 1. These “pro-
jection errors” cannot be fixed on a global scale without introducing new ones
elsewhere as the projection is already optimal w.r.t. some criteria (depending
on the technique used). In this sense, they should not be considered as errors
made by the projection technique but of the resulting (displayed) arrangement.
When a user explores a projected collection, type I errors increase the number
of dissimilar (i.e. irrelevant) objects displayed in a region of interest. While this
might become annoying, it is much less problematic than type II errors. They
result in similar (i.e. relevant) objects to be displayed away from the region of
interest – the neighborhood they actually belong to. In the worst case they could
even be off-screen if the display is limited to the currently explored region. This
way, a user could miss objects he is actually looking for.
Zoomable Interface for Multi-facet Exploration of Music Collections 275
2 Related Work
There exists a variety of approaches that in some way give an overview of a music
collection. For the task of music discovery which is closely related to collection
exploration, a very broad survey of approaches is given in [7]. Generally, there are
several possible levels of granularity that can be supported, the most common
being: track, album, artist and genre. Though a system may cover more than
one granularity level (e.g. in [51] visualized as disc or TreeMap [41]), usually a
single one is chosen. The user-interface presented in this paper focuses on the
track level as do most of the related approaches. (However, like most of the other
techniques, it may as well be applied on other levels such as for albums or artist.
All that is required is an appropriate feature representation of the objects of
interest.) Those approaches focusing on a single level can roughly be categorized
into graph-based and similarity-based overviews.
Graphs facilitate a natural navigation along relationship-edges. They are espe-
cially well-suited for the artist level as social relations can be directly visualized
(as, e.g., in the Last.fm Artist Map 1 or the Relational Artist Map RAMA [39]).
However, building a graph requires relations between the objects – either from
domain knowledge or artificially introduced. E.g., there are some graphs that use
similarity-relations obtained from external sources (such as the APIs of Last.fm 2
1
http://sixdegrees.hu/last.fm/interactive map.html
2
http://www.last.fm/api
276 S. Stober and A. Nürnberger
or EchoNest 3 ) and not from an analysis of the objects themselves. Either way,
this results in a very strong dependency and may quickly become problematic
for less main-stream music where such information might not be available. This
is why a similarity-based approach is chosen here instead.
Similarity-based approaches require the objects to be represented by one or
more features. They are in general better suited for track level overviews due
the vast variety of content-based features that can be extracted from tracks.
For albums and artists, either some means for aggregating the features of the
individual tracks are needed or non-content-based features, e.g. extracted from
knowledge resources like MusicBrainz 4 and Wikipedia 5 or cultural meta-data
[54], have to be used. In most cases the overview is then generated using some
metric defined on these features which leads to proximity of similar objects in the
feature space. This neighborhood should be preserved in the collection overview
which usually has only two dimensions. Popular approaches for dimensionality
reduction are Self-Organizing Maps (SOMs) [17], Principle Component Analysis
(PCA) [14] and Multidimensional Scaling (MDS) techniques [18].
In the field of music information retrieval, SOMs are widely used. SOM-based
systems comprise the SOM-enhanced Jukebox (SOMeJB) [37], the Islands of
Music [35,34] and nepTune [16], the MusicMiner [29], the PlaySOM - and Pock-
etSOM -Player [30] (the latter being a special interface for mobile devices), the
BeatlesExplorer [46] (the predecessor prototype of the system presented here),
the SoniXplorer [23,24], the Globe of Music [20] and the tabletop applications
MUSICtable [44], MarGrid [12], SongExplorer [15] and [6]. SOMs are prototype-
based and thus there has to be a way to initially generate random prototypes
and to modify them gradually when objects are assigned. This poses special
requirements regarding the underlying feature space and distance metric. More-
over, the result depends on the random initialization and the neural network
gradient descend algorithm may get stuck in a local minimum and thus not
produce an optimal result. Further, there are several parameters that need to
be tweaked according to the data set such as the learning rate, the termination
criterion for iteration, the initial network structure, and (if applicable) the rules
by which the structure should grow. However, there are also some advantages
of SOMs: Growing versions of SOMs can adapt incrementally to changes in the
data collection whereas other approaches may always need to generate a new
overview from scratch. Section 4.2 will address this point more specifically for
the approach taken here. For the interactive task at hand, which requires a real-
time response, the disadvantages of SOMs outweigh their advantages. Therefore,
the approach taken here is based on MDS.
Given a set of data points, MDS finds an embedding in the target space
that maintains their distances (or dissimilarities) as far as possible – without
having to know their actual values. This way, it is also well suited to compute a
layout for spring- or force-based approaches. PCA identifies the axes of highest
3
http://developer.echonest.com
4
http://musicbrainz.org
5
http://www.wikipedia.org
Zoomable Interface for Multi-facet Exploration of Music Collections 277
Fig. 2. In SoundBite [22], a seed song and its nearest neighbors are connected by lines
3 Outline
The goal of our work is to provide a user with an interactive way of exploring
a music collection that takes into account the above described inevitable lim-
itations of a low-dimensional projection of a collection. Further, it should be
applicable for realistic music collections containing several thousands of tracks.
The approach taken can be outlined as follows:
– An overview of the collection is given, where all tracks are displayed as points
at any time. For a limited number of tracks that are chosen to be spatially
well distributed and representative, an album cover thumbnail is shown for
orientation.
– The view on the collection is generated by a neighborhood-preserving pro-
jection (e.g. MDS, SOM, PCA) from some high-dimensional feature space
onto two dimensions. I.e., in general tracks that are close in feature space
will likely appear as neighbors in the projection.
6
http://www.yamaha.com/bodibeat
Zoomable Interface for Multi-facet Exploration of Music Collections 279
– Users can adapt the projection by choosing weights for several aspects of
music (dis-)similarity. This gives them the possibility to look at a collection
from different perspectives. (This adaptation is purely manual, i.e. the visu-
alization as described in this paper is only adaptable w.r.t. music similarity.
Techniques to further enable adaptive music similarity are, e.g., discussed in
[46,49].)
– In order to allow immediate visual feedback in case of similarity adaptation,
the projection technique needs to guarantee near real-time performance –
even for large music collections. The quality of the produced projection is
only secondary – a perfect projection that correctly preserves all distances
between all tracks is extremely unlikely anyways.
– The projection will inevitably contain distortions of the actual distances of
the tracks. Instead of trying to improve the quality of the projection method
and trying to fix heavily distorted distances, they are exploited during in-
teraction with the projection:
The user can zoom into a region of interest. The space for this region is
increased, thus allowing to display more details. At the same time the sur-
rounding space is compacted but not hidden from view. This way, there re-
mains some context for orientation. To accomplish such a behavior the zoom
is based on a non-linear distortion similar to so called “fish-eye” lenses.
At this point the original (type II) projection errors come into play: Instead
of putting a single lens focus on the region of interest, additional focuses are
introduced in regions that contain tracks similar to those in primary focus.
The resulting distortion brings original neighbors back closer to each other.
This gives the user another option for interactive exploration.
Figure 3 depicts the outline of the approach. The following sections cover the
underlying techniques (Section 4) and the user-interaction (Section 5) in detail.
4 Underlying Techniques
4.1 Features and Facets
The prototype system described here uses collections of music tracks. As a pre-
requisite, it is assumed that the tracks are represented by some descriptive fea-
tures that can, e.g., be extracted, manually annotated or obtained form external
sources. In the current implementation, content-based features are extracted
utilizing the capabilities of the frameworks CoMIRVA [40] and JAudio [28].
Specifically, Gaussian Mixture Models of the Mel Frequency Cepstral Coeffi-
cients (MFCCs) according to [2] and [26] and “fluctuation patterns” describing
how strong and fast beats are played within specific frequency bands [35] are
computed with CoMIRVA. JAudio is used to extract a global audio descriptor
“MARSYAS07” as described in [52]. Further, lyrics for all songs were obtained
through the web service of LyricWiki7 , filtered for stop words, stemmed and de-
scribed by document vectors with TFxIDF term weights [38]. Additional features
7
http://lyricwiki.org
280 S. Stober and A. Nürnberger
feature extraction
high-dimensional
feature space
facet
S1 S2 …S
l subspaces
d1(i
(i,
( jj) d2((i
(i, jj) dl((i
(i, j)
N images facet
Nxmxl distance
index
distances cuboid
s
et
ac
offline m landmarks l
f
online projection
distance
projection
rojec
e aggregator
agg
facet
adjust weights
user
lens distortion
distance
neighborhood facet
tance aggregator
zoom
aggre
display
ing
filtering
Fig. 3. Outline of the approach showing the important processing steps and data struc-
tures. Top: preprocessing. Bottom: interaction with the user with screenshots of the
graphical user interface.
Zoomable Interface for Multi-facet Exploration of Music Collections 281
that are currently only used for the visualization are ID3 tags (artist, album,
title, track number and year) extracted from the audio files, track play counts
obtained from a Last.fm profile, and album covers gathered through web search.
Distance Facets. Based on the features associated with the tracks, facets are
defined (on subspaces of the feature space) that refer to different aspects of music
(dis-)similarity. This is depicted in Figure 3 (top).
Definition 1. Given a set of features F , let S be the space determined by the
feature values for a set of tracks T . A facet f is defined by a facet distance
measure δf on a subspace Sf ⊆ S of the feature space, where δf satisfies the
following conditions for any x, y ∈ T :
– δ(x, y) ≥ 0 and δ(x, y) = 0 if and only if x = y
– δ(x, y) = δ(y, x) (symmetry)
Optionally, δ is a distance metric if it additionally obeys the triangle inequality
for any x, y, z ∈ T :
– δ(x, z) ≤ δ(x, y) + δ(y, z) (triangle inequality)
E.g., a facet “timbre” could be defined on the MFCC-based feature described in
[26] whereas a facet “text” could compare the combined information from the
features “title” and “lyrics”.
It is important to stress the difference to common faceted browsing and search
approaches that rely on a faceted classification of objects to support users in
exploration by filtering available information. Here, no such filtering by value is
applied. Instead, we employ the concept of facet distances to express different
aspects of (dis-)similarity that can be used for filtering.
4.2 Projection
In the projection step shown in Figure 3 (bottom), the position of all tracks on
the display is computed according to their (aggregated) distances in the high-
dimensional feature space. Naturally, this projection should be neighborhood-
preserving such that tracks close to each other in feature space are also close in
the projection. We propose to use a landmark- or pivot-based Multidimensional
Scaling approach (LMDS) for the projection as described in detail in [42,43].
This is a computationally efficient approximation to classical MDS. The general
idea of this approach is as follows: A representative sample of objects – called
“landmarks” – is drawn randomly from the whole collection.8 For this landmark
sample, an embedding into low-dimensional space is computed using classical
MDS. The remaining objects can then be located within this space according to
their distances to the landmarks.
8
Alternatively, the MaxMin heuristic (greedily seeking out extreme, well-separated
landmarks) could be used – with the optional modification to replace landmarks
with a predefined probability by randomly chosen objects (similar to a mutation op-
erator in genetic programming). Neither alternative seems to produce less distorted
projections while having much higher computational complexity. However, there is
possibly some room for improvement here but this is out of the scope of this paper.
Zoomable Interface for Multi-facet Exploration of Music Collections 283
However, it still allows objects to be added or removed from the data set to
some extend without the need to compute a new projection: If a new track
is added to the collection, an additional “layer” has to be appended to the
facet distance cuboid containing the facet distances of the new track with all
landmarks. The new track can then be projected according to these distances. If
a track is removed, the respective “layer” of the cuboid can be deleted. Neither
operation does further alter the projection.9 Adding or removing many tracks
may however alter the distribution of the data (and thus the covariances) in such
a way that the landmark sample may no longer be representative. In this case,
a new projection based on a modified landmark sample should be computed.
However, for the scope of this paper, a stable landmark set is assumed and this
point is left for further work.
chosen with a maximum of 50 cells in each dimension for the overlay mesh which
yields sufficient distortion accuracy while real-time capability is maintained. The
distorted position of the projection points is obtained by barycentric coordinate
transformation with respect to the particle points of the mesh. Additionally,
z-values are derived from the rest-lengths that are used in the visualization to
decide whether an object has to be drawn below or above another one.
Nearest Neighbor Indexing. For the adaptation of the lens distortion, the
nearest neighbors of a track need to be retrieved. Here, the two major challenges
are:
1. The facet weights are not known at indexing time and thus the index can
only be built using the facet distances.
2. The choice of an appropriate indexing method for each facet depends on the
respective distance measure and the nature of the underlying features.
As the focus lies here on the visualization and not the indexing, only a very basic
approach is taken and further developments are left for future work: A limited list
of nearest neighbors in pre-computed for each track. This way, nearest neighbors
can be retrieved by simple lookup in constant time (O(1)). However, updating
the lists after a change of the facet weights is computationally expensive. While
the resulting delay of the display update is still acceptable for collections with a
few thousands tracks, it becomes infeasible for larger N .
For more efficient index structures, it may be possible to apply generic mul-
timedia indexing techniques such as space partition trees [5] or approximate ap-
proaches based on locality sensitive hashing [13] that may even be kernelized [19]
to allow for more complex distance metrics. Another option is to generate mul-
tiple nearest neighbor indexes – each for a different setting of the facet weights
– and interpolate the retrieved result lists w.r.t. to the actual facet weights.
4.5 Filtering
Fig. 5. Available filter modes: collapse all (top left), focus (top right), sparse (bottom
left), expand all (bottom right). The SpringLens mesh overlay is hidden.
can choose between different filters that decide whether a track is displayed col-
lapsed or expanded – i.e. as a star or album cover respectively. While album
covers help for orientation, the displayed stars give information about the data
distribution. Trivial filters are those displaying no album covers (collapseAll) or
all (expandAll). Apart from collapsing or expanding all tracks, it is possible to
expand only those tracks in magnified regions (i.e. with a z-level above a pre-
defined threshold) or to apply a sparser filter. The results of using these filter
modes are shown in Figure 5.
A sparser filter selects only a subset of the collection to be expanded that
is both, sparse (well distributed) and representative. Representative tracks are
those with a high importance (described in Section 4.4). The first sparser version
used a Delaunay triangulation and was later substituted by a raster-based ap-
proach that produces more appealing results in terms of the spatial distribution
of displayed covers.
Zoomable Interface for Multi-facet Exploration of Music Collections 287
Originally, the set of expanded tracks was updated after any position changes
caused by the distortion overlay. However, this was considered irritating during
early user tests and the sparser strategy was changed to update only if the
projection or the displayed region changes.
Delaunay Sparser Filter. This sparser filter constructs a Delaunay triangula-
tion incrementally top-down starting with the track with the highest importance
and some virtual points at the corners of the display area. Next, the size of all
resulting triangles given by the radius of their circumcircle is compared with
a predefined threshold sizemin . If the size of a triangle exceeds this threshold,
the most important track within this triangle is chosen for display and added
as a point for the triangulation. This process continues recursively until no tri-
angle that exceeds sizemin contains anymore tracks that could be added. All
tracks belonging to the triangulation are then expanded (i.e. displayed as album
thumbnail).
The Delaunay triangulation can be computed in O(n log n) and the number
of triangles is at most O(n) with n N being the number of actually displayed
album cover thumbnails. To reduce lookup time, projected points are stored
in a quadtree data structure [5] and sorted by importance within the tree’s
quadrants. A triangle’s size may change through distortion caused by the multi-
focal zoom. This change may trigger an expansion of the triangle or a removal
of the point that caused its creation originally. Both operations are propagated
recursively until all triangles meet the size condition again. Figure 3 (bottom)
shows a triangulation and the resulting display for a (distorted) projection of a
collection.
Raster Sparser Filter. The raster sparser filter divides the display into a grid
of quadratic cells. The size of the cells depends on the screen resolution and
the minimal display size of the album covers. Further, it maintains a list of the
tracks ranked by importance that is precomputed and only needs to be updated
when the importance values change. On an update, the sparser runs through
its ranked list. For each track it determines the respective grid cell. If the cell
and the surrounding cells are empty, the track is expanded and its cell blocked.
(Checking surrounding cells avoids image overlap. The necessary radius for the
surrounding can be derived from the cell and cover sizes.)
The computational complexity of this sparser approach is linear in the number
of objects to be considered but also depends on the radius of the surrounding
that needs to be checked. The latter can be reduced by using a data structure
for the raster that has O(1) look-up complexity but higher costs for insertions
which happen far less frequently. This approach has further the nice property
that it handles the most important objects first and thus even if interrupted
returns a useful result.
5 Interaction
While the previous section covered the underlying techniques, this section de-
scribes how users can interact with the user-interface that is built on top of
288 S. Stober and A. Nürnberger
These are very common interaction techniques that can e.g. be found in programs
for geo-data visualization or image editing that make use of the map metaphor.
Panning shifts the displayed region whereas zooming decreases or increases it.
(This does not affect the size of the thumbnails which can be controlled sepa-
rately using the PageUp and PageDn keys.) Using the keyboard, the user can pan
with the cursor keys and zoom in and out with + and – respectively. Alterna-
tively, the mouse can be used: Clicking and holding the left button while moving
the mouse pans the display. The mouse wheel controls the zoom level. If not the
whole collection can be displayed, an overview window indicating the current
section is shown in the top left corner, otherwise it is hidden. Clicking into the
overview window centers the display around the respective point. Further, the
user can drag the section indicator around which also results in panning.
5.2 Focusing
Fig. 6. Screenshot of the MusicGalaxy prototype with visible overview window (top left), player (bottom) and SpringLens mesh overlay
(blue). In this example, a strong album effect can be observed as for the track in primary focus, four tracks of the same album are nearest
neighbors in secondary focus.
289
290 S. Stober and A. Nürnberger
Fig. 7. SpringLens distortion with only primary focus (left) and additional secondary
focus (right)
is applied to derive the track-landmark distances from the facet distance cuboid
(Section 4.2). These distances are then used to compute the projection of the
collection. The second facet distance aggregation function is applied to identify
the nearest neighbors of a track and thus indirectly controls the secondary focus.
Changing the aggregation parameters results in a near real-time update of
the display so that the impact of the change becomes immediately visible: In
case of the parameters for the nearest neighbor search, some secondary focus
region may disappear while somewhere else a new one appears with tracks now
considered more similar. Here, the transitions are visualized smoothly due to
the underlying physical simulation of the SpringLens grid. In contrast to this, a
change of the projection similarity parameters has a more drastic impact on the
visualization possibly resulting in a complete re-arrangement of all tracks. This
is because the LMDS projection technique produces solutions that are unique
only up to translation, rotation, and reflection and thus, even a small parameter
change may, e.g., flip the visualization. As this may confuse users, one direction
of future research is to investigate how the position of the landmarks can be
constrained during the projection to produce more gradual changes.
The two facet distance aggregation functions are linked by default as it is most
natural to use the same distance measure for projection and neighbor retrieval.
However, unlinking them and using e.g. orthogonal distance measures can lead
to interesting effects: For instance, one may choose to compute the collection
based solely on acoustic facets and find nearest neighbors for the secondary
focus through lyrics similarity. Such a setting would help to uncover tracks with
a similar topic that (most likely) sound very different.
6 Evaluation
The development of MusicGalaxy followed a user-driven design approach [31] by
iteratively alternating between development and evaluation phases. The first pro-
totype [47] was presented at the CeBIT 2010 11 a German trade fair specialized
on information technology, in early March 2010. During the fair, feedback was
collected from a total of 112 visitors aged between 16 and 63 years. The general
reception was very positive. The projection-based visualization was generally
welcomed as an alternative to common list views. However, some remarked that
additional semantics of the two display axis would greatly improve orientation.
Young visitors particularly liked the interactivity of the visualization whereas
older ones tended to have problems with this. They stated that the reason lay
in the amount of information displayed which could still be overwhelming. To
address the problem, they proposed to expand only tracks in focus, increase the
size of objects in focus (compared to the others) and hide the mesh overlay as
the focus would be already visualized by the expanded and enlarged objects. All
of these proposals have been integrated into the second prototype.
The second prototype was tested thoroughly by three testers. During these
tests, the eye movements of the users were recorded with an Tobii T60
11
http://www.cebit.de
292 S. Stober and A. Nürnberger
eye-tracker that can capture where and how long the gaze of the participants rests
for some time (referred to as “fixation points”). Using the adaptive SpringLens
focus, the mouse generally followed the gaze that scans the border of the fo-
cus in order to decide on the direction to explore further. This resulted in a
much smoother gaze-trajectory than the one observed during usage of panning
and zooming where the gaze frequently switched between the overview window
and the objects of interest – as not to loose orientation. This indicates that the
proposed approach is less tiring for the eyes. However, the testers criticized the
controls used to change the focus – especially having to hold the right mouse
button all the time. This lead to the introduction of the focus lock mode and
several minor interface improvements in the third version of the prototype [48]
that are not explicitly covered here.
The remainder of this section describes the evaluation of the third Music-
Galaxy prototype in a user study [45] with the aim to proof that the user-
interface indeed helps during exploration. Screencasts of 30 participants
solving an exploratory retrieval task were recorded together with eye-tracking
data (again using a Tobii T60 eye-tracker) and web cam video streams. This
data was used to identify emerging search strategies among all users and to
analyze to what extent the primary and secondary focus was used. Moreover,
first-hand impressions of the usability of the interface were gathered by letting
the participants say aloud whatever they think, feel or remark as they go about
their task (think-aloud protocol).
In order to ease the evaluation, the study was not conducted with the origi-
nal MusicGalaxy user-interface prototype but with a modified version that can
handle photo collections depicted in Figure 8. It relies on widely used MPEG-7
visual descriptors (EdgeHistogram, ScalableColor and ColorLayout ) [27,25] to
compute the visual similarity (see [49] for further details) – replacing the origi-
nally used music features and respective similarity facets. Using photo collections
for evaluation instead of music has several advantages: It can be assured that
none of the participants knows any of the photos in advance what could oth-
erwise introduce some bias. Dealing with music, this would be much harder to
realize. Furthermore, similarity and relevance of photos can be assessed in an
instant. This is much harder for music tracks and requires additional time for
listening – especially if the tracks are previously unknown.
The following four questions were addressed in the user study:
Table 2. Photo collections and topics used during the user study
for adapting the facet distance aggregation functions described in Section 5.3 was
deactivated for the whole experiment.) After the completion of the last task,
the participants were asked to assess the usability of the different approaches.
Furthermore, feedback was collected pointing out, e.g., missing functionality.
Test Collections. Four image collection were used during the study. They
were drawn from a personal photo collection of the authors.13 Each collection
comprises 350 images – except the first collection (used for the introduction of
the user-interface) which only contains 250 images. All images were scaled down
to fit 600x600 pixels. For each of the collections 2 to 4, five non-overlapping topics
were chosen and the images annotated accordingly. These annotation served as
ground truth and were not shown to the participants. Table 2 shows the topics
for each collection. In total, 264 of the 1050 images belong to one of the 15 topics.
Retrieval Task. For the collections 2 to 4, the participants had to find five (or
more) representative images for each of the topics listed in Table 2. As guidance,
handouts were prepared that showed the topics – each one printed in a different
color –, an optional brief description and two or three sample images giving an
impression what to look for. Images representing a topic had to be marked with
the topic’s color. This was done by double clicking on the thumbnail what opened
a floating dialog window presenting the image at big scale and allowing the
participant to classify the image to a predefined topic by clicking a corresponding
button. As a result, the image was marked with the color representing the topic.
Further, the complete collection could be filtered by highlighting all thumbnails
classified to one topic. This was done by pressing the numeric key (1 to 5) for the
respective topic number. Highlighting was done by focusing a fish-eye lens on
every marked topic member and thus enlarging the corresponding thumbnails.
It was pointed out that the decision whether an image was representative for
a group was solely up to the participant and not judged otherwise. There was
no time limit for the task. However, the participants were encouraged to skip to
13
The collections and topic annotations are publicly available under
the Creative Commons Attribution-Noncommercial-Share Alike license,
http://creativecommons.org/licenses/by-nc-sa/3.0/ Please contact [email protected].
Zoomable Interface for Multi-facet Exploration of Music Collections 295
the next collection after approximately five minutes as during this time already
enough information would have been collected.
6.2 Results
The user study was conducted with 30 participants – all of them graduate or
post-graduate students. Their age was between 19 and 32 years (mean 25.5)
and 40% were female. Most of the test persons (70%) were computer science
students, with half of them having a background in computer vision or user
interface design. 43% of the participants stated that they take photos on a regular
basis and 30% use software for archiving and sorting their photo collection. The
majority (77%) declared that they are open to new user interface concepts.
Table 3. Percentage of marked images (N = 914) categorized by focus region and topic
of the image in primary focus at the time of marking
Usage of Secondary Focus. For this part, we restrict ourselves to the interac-
tion with the last photo collection where both, P&Z and the lens, could be used
and the participants had had plenty of time (approximately 15 to 30 minutes
depending on the user) for practice. The question to be answered is, how much
the users actually make use of the secondary focus which always contains some
relevant images if the image in primary focus has a ground truth annotation.14
For each image marked by a participant, the location of the image at the time
of marking was determined. There are four possible regions: primary focus (only
the central image), extended primary focus (region covered by primary lens ex-
cept primary focus image), secondary focus and the remaining region. Further,
there are up to three cases for each region with respect to the (user-annotated or
ground truth) topic of the image in primary focus. Table 3 shows the frequencies
of the resulting eight possible cases. (Some combinations are impossible. E.g.,
the existence of a secondary focus implies some image in primary focus.) The
most interesting number is the one referring to images in secondary focus that
belong to the same topic as the primary because this is what the secondary focus
14
Ground truth annotation were never visible to the users.
Zoomable Interface for Multi-facet Exploration of Music Collections 297
is supposed to bring up. It comes close to the percentage of the primary focus
that – not surprisingly – is the highest. Ignoring the topic, (extended) primary
and secondary almost contribute equally and only less than 10% of the marked
images were not in focus – i.e. discovered only through P&Z.
Emerging Search Strategies. For this part we again analyze only interac-
tion with the combined interface. A small group of participants excessively used
P&Z. They increased the initial thumbnail size in order to better perceive the
depicted contents and chose to display all images as thumbnails. To reduce the
overlap of thumbnails, they operated on a deeper zoom level and therefore had
to pan a lot. The gaze data shows a tendency for systematic sequential scans
which were however difficult due to the scattered and irregular arrangement
of the thumbnails. Further, some participants occasionally marked images not
in focus because of being attracted by dominant colors (e.g. for the aquarium
topic). Another typical strategy was to quickly scan through the collection by
moving the primary focus – typically with small thumbnail size and at a zoom
level that showed most of the collection but the outer regions. In this case the
attention was mostly at the (extended) primary focus region with the gaze scan-
ning in which direction to explore further and little to moderate attention at the
secondary focus. Occasionally, participants would freeze the focus or slow down
for some time to scan the whole display. In contrast to this rather continuous
change of the primary focus, there was a group of participants that browsed
the collection mostly by moving (in a single click) the primary focus to some
secondary focus region – much like navigating an invisible neighborhood graph.
Here, the attention was concentrated onto the secondary focus regions.
In fact, it is even possible to adapt the similarity metric used for the nearest
neighbor queries automatically to the task of finding more images of the same
to topic as shown in recent experiments [49]. This opens an interesting research
direction for future work.
7 Conclusion
A common approach for exploratory retrieval scenarios is to start with an over-
view from where the user can decide which regions to explore further. The focus-
adaptive SpringLens visualization technique described in this paper addresses
the following three major problems that arise in this context:
1. Approaches that rely on dimensionality reduction techniques to project the
collection from high-dimensional feature space onto two dimensions inevitably
face projection errors: Some tracks will appear closer than they actually are
and on the other side, some tracks that are distant in the projection may in
fact be neighbors in the original space.
2. Displaying all tracks at once becomes infeasible for large collections because
of limited display space and the risk of overwhelming the user with the
amount of information displayed.
3. There is more than one way to look at a music collection – or more specifically
to compare two music pieces based on their features. Each user may have a
different way and a retrieval system should account for this.
The first problem is addressed by introducing a complex distortion of the vi-
sualization that adapts to the user’s current region of interest and temporarily
alleviates possible projection errors in the focused neighborhood. The amount
of displayed information can be adapted by the application of several sparser
filters. Concerning the third problem, the proposed user-interface allows users
to (manually) adapt the underlying similarity measure used to compute the ar-
rangement of the tracks in the projection of the collection. To this end, weights
can be specified that control the importance of different facets of music similarity
and further an aggregation function can be chosen to combine the facets.
Following a user-centered design approach with focus on usability, a prototype
system has been created by iteratively alternating between development and
evaluation phases. For the final evaluation, an extensive user study including
gaze analysis using an eye-tracker has been conducted with 30 participants. The
results prove that the proposed interface is helpful while at the same time being
easy and intuitive to use.
Acknowledgments
This work was supported in part by the German National Merit Foundation,
the German Research Foundation (DFG) under the project AUCOMA, and the
European Commission under FP7-ICT-2007-C FET-Open, contract no. BISON-
211898. The user study was conducted in collaboration with Christian Hentschel
Zoomable Interface for Multi-facet Exploration of Music Collections 299
who also took care of the image feature extraction. The authors would further like
to thank all testers and the participants of the study for their time and valuable
feedback for further development, Tobias Germer for sharing his ideas and code
of the original SpringLens approach [9], Sebastian Loose who has put a lot of
work into the development of the filter and zoom components, the developers of
CoMIRVA [40] and JAudio [28] for providing their feature extractor code and
George Tzanetakis for providing insight into his MIREX ’07 submission [52].
The Landmark MDS algorithm has been partly implemented using the MDSJ
library [1].
References
1. Algorithmics Group: MDSJ: Java library for multidimensional scaling (version 0.2),
University of Konstanz (2009)
2. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
3. Baumann, S., Halloran, J.: An ecological approach to multimodal subjective music
similarity perception. In: Proc. of 1st Conf. on Interdisciplinary Musicology (CIM
2004), Graz, Austria (April 2004)
4. Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E.: On the use of fastmap for
audio retrieval and browsing. In: Proc. of the 3rd Int. Conf. on Music Information
Retrieval (ISMIR 2002) (2002)
5. De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational geom-
etry: algorithms and applications. Springer, New York (2008)
6. Diakopoulos, D., Vallis, O., Hochenbaum, J., Murphy, J., Kapur, A.: 21st century
electronica: Mir techniques for classification and performance. In: Proc. of the 10th
Int. Conf. on Music Information Retrieval (ISMIR 2009), pp. 465–469 (2009)
7. Donaldson, J., Lamere, P.: Using visualizations for music discovery. Tutorial at the
10th Int. Conf. on Music Information Retrieval (ISMIR 2009) (October 2009)
8. Gasser, M., Flexer, A.: Fm4 soundpark: Audio-based music recommendation in
everyday use. In: Proc. of the 6th Sound and Music Computing Conference (SMC
2009), Porto, Portugal (2009)
9. Germer, T., Götzelmann, T., Spindler, M., Strothotte, T.: Springlens: Distributed
nonlinear magnifications. In: Eurographics 2006 - Short Papers, pp. 123–126. Eu-
rographics Association, Aire-la-Ville (2006)
10. Gleich, M.R.D., Zhukov, L., Lang, K.: The World of Music: SDP layout of high
dimensional data. In: Info Vis 2005 (2005)
11. van Gulik, R., Vignoli, F.: Visual playlist generation on the artist map. In: Proc.
of the 6th Int. Conf. on Music Information Retrieval (ISMIR 2005), pp. 520–523
(2005)
12. Hitchner, S., Murdoch, J., Tzanetakis, G.: Music browsing using a tabletop display.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
175–176 (2007)
13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the
curse of dimensionality. In: Proc. of the 13th ACM Symposium on Theory of Com-
puting (STOC 1998), pp. 604–613. ACM, New York (1998)
14. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
300 S. Stober and A. Nürnberger
15. Julia, C.F., Jorda, S.: SongExplorer: a tabletop application for exploring large
collections of songs. In: Proc. of the 10th Int. Conf. on Music Information Retrieval
(ISMIR 2009), pp. 675–680 (2009)
16. Knees, P., Pohle, T., Schedl, M., Widmer, G.: Exploring Music Collections in Vir-
tual Landscapes. IEEE MultiMedia 14(3), 46–54 (2007)
17. Kohonen, T.: Self-organized formation of topologically correct feature maps. Bio-
logical Cybernetics 43(1), 59–69 (1982)
18. Kruskal, J., Wish, M.: Multidimensional Scaling. Sage, Thousand Oaks (1986)
19. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: Proc. 12th Int. Conf. on Computer Vision (ICCV 2009) (2009)
20. Leitich, S., Topf, M.: Globe of music - music library visualization using geosom.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
167–170 (2007)
21. Lillie, A.S.: MusicBox: Navigating the space of your music. Master’s thesis, MIT
(2008)
22. Lloyd, S.: Automatic Playlist Generation and Music Library Visualisation with
Timbral Similarity Measures. Master’s thesis, Queen Mary University of London
(2009)
23. Lübbers, D.: SoniXplorer: Combining visualization and auralization for content-
based exploration of music collections. In: Proc. of the 6th Int. Conf. on Music
Information Retrieval (ISMIR 2005), pp. 590–593 (2005)
24. Lübbers, D., Jarke, M.: Adaptive multimodal exploration of music collections. In:
Proc. of the 10th Int. Conf. on Music Information Retrieval (ISMIR 2009), pp.
195–200 (2009)
25. Lux, M.: Caliph & emir: Mpeg-7 photo annotation and retrieval. In: Proc. of the
17th ACM Int. Conf. on Multimedia (MM 2009), pp. 925–926. ACM, New York
(2009)
26. Mandel, M., Ellis, D.: Song-level features and support vector machines for music
classification. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 594–599 (2005)
27. Martinez, J., Koenen, R., Pereira, F.: MPEG-7: The generic multimedia content
description standard, part 1. IEEE MultiMedia 9(2), 78–87 (2002)
28. McEnnis, D., McKay, C., Fujinaga, I., Depalle, P.: jAudio: An feature extraction
library. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 600–603 (2005)
29. Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C.: Databionic visualization of music
collections according to perceptual distance. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 396–403 (2005)
30. Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and PocketSOMPlayer, al-
ternative interfaces to large music collections. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 618–623 (2005)
31. Nielsen, J.: Usability engineering. In: Tucker, A.B. (ed.) The Computer Science
and Engineering Handbook, pp. 1440–1460. CRC Press, Boca Raton (1997)
32. Nürnberger, A., Klose, A.: Improving clustering and visualization of multimedia
data using interactive user feedback. In: Proc. of the 9th Int. Conf. on Information
Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
2002), pp. 993–999 (2002)
33. Oliver, N., Kreger-Stickles, L.: PAPA: Physiology and purpose-aware automatic
playlist generation. In: Proc. of the 7th Int. Conf. on Music Information Retrieval
(ISMIR 2006) (2006)
Zoomable Interface for Multi-facet Exploration of Music Collections 301
34. Pampalk, E., Dixon, S., Widmer, G.: Exploring music collections by browsing dif-
ferent views. In: Proc. of the 4th Int. Conf. on Music Information Retrieval (ISMIR
2003), pp. 201–208 (2003)
35. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization
of music archives. In: Proc. of the 10th ACM Int. Conf. on Multimedia (MULTI-
MEDIA 2002), pp. 570–579. ACM Press, New York (2002)
36. Pauws, S., Eggen, B.: PATS: Realization and user evaluation of an automatic
playlist generator. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002) (2002)
37. Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic models and self-
organizing maps to create a hierarchical structuring of music by musical styles. In:
Proc. of the 3rd Int. Conf. on Music Information Retrieval (ISMIR 2002) (2002)
38. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval.
Information Processing & Management 24(5), 513–523 (1988)
39. Sarmento, L., Gouyon, F., Costa, B., Oliveira, E.: Visualizing networks of music
artists with RAMA. In: Proc. of the Int. Conf. on Web Information Systems and
Technologies, Lisbon (2009)
40. Schedl, M.: The CoMIRVA Toolkit for Visualizing Music-Related Data. Technical
report, Johannes Kepler University Linz (June 2006)
41. Shneiderman, B.: Tree visualization with tree-maps: 2-d space-filling approach.
ACM Trans. Graph 11(1), 92–99 (1992)
42. de Silva, V., Tenenbaum, J.: Sparse multidimensional scaling using landmark
points. Tech. rep., Stanford University (2004)
43. de Silva, V., Tenenbaum, J.B.: Global versus local methods in nonlinear dimen-
sionality reduction. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002), pp. 705–712 (2002)
44. Stavness, I., Gluck, J., Vilhan, L., Fels, S.S.: The mUSICtable: A map-based ubiq-
uitous system for social interaction with a digital music collection. In: Kishino,
F., Kitamura, Y., Kato, H., Nagata, N. (eds.) ICEC 2005. LNCS, vol. 3711, pp.
291–302. Springer, Heidelberg (2005)
45. Stober, S., Hentschel, C., Nürnberger, A.: Evaluation of adaptive springlens - a
multi-focus interface for exploring multimedia collections. In: Proc. of the 6th
Nordic Conference on Human-Computer Interaction (NordiCHI 2010), Reykjavik,
Iceland (October 2010)
46. Stober, S., Nürnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 53–65. Springer, Heidelberg (2010)
47. Stober, S., Nürnberger, A.: A multi-focus zoomable interface for multi-facet ex-
ploration of music collections. In: Proc. of the 7th Int. Symposium on Computer
Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 339–354 (June
2010)
48. Stober, S., Nürnberger, A.: MusicGalaxy - an adaptive user-interface for ex-
ploratory music retrieval. In: Proc. of the 7th Sound and Music Computing Con-
ference (SMC 2010), Barcelona, Spain, pp. 382–389 (July 2010)
49. Stober, S., Nürnberger, A.: Similarity adaptation in an exploratory retrieval sce-
nario. In: Detyniecki, M., Knees, P., Nürnberger, A., Schedl, M., Stober, S. (eds.)
Post- Proceedings of the 8th International Workshop on Adaptive Multimedia Re-
trieval (AMR 2010), Linz, Austria (2010)
302 S. Stober and A. Nürnberger
50. Stober, S., Steinbrecher, M., Nürnberger, A.: A survey on the acceptance of listen-
ing context logging for mir applications. In: Baumann, S., Burred, J.J., Nürnberger,
A., Stober, S. (eds.) Proc. of the 3rd Int. Workshop on Learning the Semantics of
Audio Signals (LSAS), Graz, Austria, pp. 45–57 (December 2009)
51. Torrens, M., Hertzog, P., Arcos, J.L.: Visualizing and exploring personal music
libraries. In: Proc. of the 5th Int. Conf. on Music Information Retrieval (ISMIR
2004) (2004)
52. Tzanetakis, G.: Marsyas submission to MIREX 2007. In: Proc. of the 8th Int. Conf.
on Music Information Retrieval (ISMIR 2007) (2007)
53. Vignoli, F., Pauws, S.: A music retrieval system based on user driven similarity
and its evaluation. In: Proc. of the 6th Int. Conf. on Music Information Retrieval
(ISMIR 2005), pp. 272–279 (2005)
54. Whitman, B., Ellis, D.: Automatic record reviews. In: Proc. of the 5th Int. Conf.
on Music Information Retrieval (ISMIR 2004) (2004)
55. Williams, C.K.I.: On a connection between kernel pca and metric multidimensional
scaling. Machine Learning 46(1-3), 11–19 (2002)
56. Wolter, K., Bastuck, C., Gärtner, D.: Adaptive user modeling for content-based
music retrieval. In: Detyniecki, M., Leiner, U., Nürnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 40–52. Springer, Heidelberg (2010)
A Database Approach to Symbolic Music
Content Management
1 Introduction
The presence of music on the web has grown exponentially over the past decade.
Music representation is multiple (audio files, midi files, printable music scores. . . )
and is easily accessible through numerous platforms. Given the availability of sev-
eral compact formats, the main representation is by means of audio files, which
provide immediate access to music content and are easily spreaded, sampled and
listened to. However, extracting structured information from an audio file is a
difficult (if not impossible) task, since it is submitted to the subjectivity of in-
terpretation. On the other hand, symbolic music representation, usually derived
from musical scores, enables exploitation scenarios different from what audio files
may offer. The very detailed and unambiguous description of the music content
is of high interest for communities of music professionals, such as musicologists,
music publishers, or professional musicians. New online communities of users
arise, with an interest in a more in-depth study of music than what average
music lovers may look for.
The interpretation of the music content information (structure of musical
pieces, tonality, harmonic progressions. . . ) combined with meta-data (historic
and geographic context, author and composer names. . . ) is a matter of human
expertise. Specific music analysis tools have been developped by music profes-
sionals for centuries, and should now be scaled to a larger level in order to provide
scientific and efficient analysis of large collections of scores.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 303–320, 2011.
c Springer-Verlag Berlin Heidelberg 2011
304 P. Rigaux and Z. Faget
Another fundamental need of such online communities, this time shared with
more traditional platforms, is the ability to share content and knowledge, as
well as annotate, compare and correct all this available data. This Web 2.0
space with user generated content helps improve and accelerate research, with
the added bonus to make available to a larger audience sources which would
otherwise remain confidential. Questions of copyright, security and controled
contributions are part of the social networks problematic.
To summarize, a platform designed to be used mainly, but not only, by music
professionals, should offer classic features such as browsing and rendering, but
also the ability to upload new content, annotate partitions, search by content
(exact and similarity search), and several tools of music content manipulation
and analysis.
Most of the proposals devoted so far to analysis methods or similarity searches
on symbolic music focus on the accuracy and/or relevancy of the result, and
implicitly assume that these procedures apply to a small collection [2,3,13,16].
While useful, this approach gives rise to several issues when the collection consists
of thousands of scores, with heterogeneous descriptions.
A first issue is related to software engineering and architectural concerns. A
large score digital library provides several services to many differents users or
applications. Consistency, reliability, and security concerns call for the definition
of a single consistent data management interface for these services. In particu-
lar, one can hardly envisage the publication of ad-hoc search procedures that
merely exhibit the bunch of methods and algorithms developed for each spe-
cific retrieval task. The multiplication of these services would quickly overwhelm
external users. Worse, the combination of these functions, which is typically a
difficult matter, would be left to external applications. In complex systems, the
ability to compose fluently the data manipulation operators is a key to both
expressive power and computational efficiency.
Therefore, on-line communities dealing with scores and/or musical content are
often limited either by the size of their corpus or the range of possible operations,
with only one publicized strong feature. Examples of on-line communities include
Mutopia [26], MelodicMatch [27] or Musipedia [28]. Wikifonia [29] offers a wider
range of services, allowing registered users to publish and edit sheet music. One
can also cite the OMRSYS platform described in [8].
A second issue pertains to scalability. With the ongoing progress in digiti-
zation, optical recognition and user content generation, we must be ready to
face an important growth of the volume of music content that must be handled
by institutions, libraries, or publishers. Optimizing accesses to large datasets is
a delicate matter which involves many technical aspects that embrace physical
storage, indexing and algorithmic strategies. Such techniques are usually sup-
ported by a specialized data management system which releases applications
from the burden of low-level and intricate implementation concerns.
To our knowledge, no system able to handle large heterogeneous Music Digi-
tal Libraries while smoothly combining data manipulation operators exists at this
moment. The HumDrum toolkit is a widely used automated musicological
A Database Approach to Symbolic Music Content Management 305
analysis tool [14,22], but representation remains at a low level. A HumDrum based
system will lack in flexibility and will depend too much on how files are stored. This
makes difficult the development of indexing of optimization techniques. Another
possible approach would be a system based on MusicXML, an XML based file
format [12,24]. It has been suggested recently that XQuery may be used over Mu-
sicXML for music queries [11], but XQuery is a general-purpose query language
which hardly adapts to the specifics of symbolic music manipulation.
Our objective in this paper is to lay the ground for a score management sys-
tem with all the features of a Digital Scores Library combined with content
manipulation operators. Among other things, a crucial component of such a sys-
tem is a logical data model specifically designed for symbolic music management
and its associated query language. Our approach is based on the idea that the
management of structured scores corresponds, at the core level, to a limited set
of fundamental operations that can be defined and implemented once for all. We
also take into account the fact that the wide range of user needs calls for the
ability to associate these operations with user-defined functions at early steps of
the query evaluation process. Modeling the invariant operators and combining
them with user-defined operations is the main goal of our design effort. Among
numerous advantages, this allows the definition of a stable and robust query()
service which does not need ad-hoc extensions as new requirements arrive.
We do not claim (yet) that our model and its implementation will scale easily,
but a high level representation like our model is a pre-requisit in order to allow
the necessary flexibility for such futur optimization.
Section 3.2 describes in further details the Neuma platform, a Digital Score
Library [30] devoted to large collections of monodic and polyphonic music from
the French Modern Era (16th – 18th centuries). One of the central piece of
the architecture is the data model that we present in this paper. The language
described in section 5 offers a generic mechanism to search and transform music
notation.
The rest of this paper first discusses related work (Section 2). Section 3
presents the motivation and the context of our work . Section 4 then exposes the
formal fundations of our model. Section 6 concludes the paper.
2 Related Work
The past decade has witnessed a growing interest in techniques for representing,
indexing and searching (by content) music documents. The domain is commonly
termed “Music Information Retrieval” (MIR) although it covers many aspects
beyond the mere process of retrieving documents. We refer the reader to [19]
for an introduction. Systems can manipulate music either as audio files or in
symbolic form. The symbolic representation offers a structured representation
which is well suited for content-based accesses, sophisticated manipulations, and
analysis [13].
An early attempt to represent scores as structured files and to develop search
and analysis functions is the HumDrum format. Both the representation and
306 P. Rigaux and Z. Faget
the procedures are low-level (text files, Unix commands) which make them dif-
ficult to integrate in complex application. Recent works try to overcome these
limitations [22,16]. Musipedia proposes several kinds of interfaces to search the
database by content. MelodicMatch is a similar software analysing music through
pattern recognition, enabling search for musical phrases in one or more pieces.
MelodicMatch can search for melodies, rhythms and lyrics in MusicXML files.
The computation of similarity between music fragments is a central issue in
MIR systems [10]. Most proposal focus on comparisons of the melodic profiles.
Because music is subject to many small variations, approximate search is of or-
der, and the problem is actually that of finding nearest neighbors to a given
pattern. Many techniques have been experimented, that vary depending on the
melodic encoding and the similarity measure. See [9,4,1,7] for some recent pro-
posals. The Dynamic Time Warping (DTW) distance is a well-known popular
measure in speech recognition [21,20]. It allows the non-linear mapping of one
signal to another by minimizing the distance between the two. The DTW dis-
tance is usually chosen over the less flexible Euclidian distance for time series
alignment [5]. The DTW computation is rather slow but recent works show that
it can be efficiently indexed [25,15].
We are not aware of any general approach to model and query music nota-
tion. A possible approach would be to use XQuery over MusicXML documents
as suggested in [11]. XQuery is a general-purpose query language, and its use for
music scores yields complicated expressions, and hardly adapts to the specifics
of the objects representation (e.g., temporal sequences). We believe that a ded-
icated language is both more natural and more efficient. The temporal function
approach outlined here can be related to time series management [17].
3 Architecture
3.1 Approach Overview
Figure 1 outlines the main components of a Score Management System built
around our data model. Basically, the architecture is that of a standard DBMS,
The Neuma platform is meant to interact with distant web applications with
local database that store corpus-specific informations. The purpose of Neuma
is to manage all music content informations and leave contextual data to the
client application (author, composer, date of publication, . . . ).
To send a new document, an application calls the register() service. So far
only MusicXML documents can be used to exchange musical description, but
any format could be used provided the corresponding mapping function is in
place. Since MusicXML is widely used, it is sufficient for now. The mapping
function extracts a representation of the music content of the document which
complies to our data model.
To publish a score -wether it’s a score featured in the database or a modified
one- the render() service is called. The render() service is based on the Lilypond
package. The generator takes an instance of our model as input and converts it
into a Lilypond file. The importance of a unified data model appears clearly in
such an example : the render service is based on the model, making it rather
easy to visualize a transformed score, when it would be a lot more difficult to
do so if it was instead solely based on the document format.
A large collection of scores would be useless if there was no appropriate query()
service allowing reliable search by content. As explained before, the Neuma dig-
ital library stores all music content (originated from different collections, po-
tentially with heterogenous description) in its repository and leaves descriptive
contextual data specific to collections in local databases. Regardless of their
original collection, music content complies to our data model so that it can be
queried accordingly. Several query types are offered : exact, transposed, with or
without rythm, or contour which only takes in account the shape of the input
melody. The query() service combines content search with descriptive data. A
virtual keyboard is provided to enter music content, and research fields can be
filled to adress the local databases.
The Neuma platform also provides and annotate() service. The annotations
are a great way to enrich the digital library and make sure it keeps growing and
improving. In order to use the annotate() service one first selects part of a score
(a set of elements of the score) and enters information about this portion. There
are different kinds of annotations : free text (useful for performance indications),
or pre-selected terms from an ontology (for identifying music fragments). Anno-
tations can be queried alongside other criterias previously mentioned.
Note that the time domain is shared by the vocal part and the piano part.
For the same score, one could have made the different choice of schema where
vocal and piano are represented in the same time series of type TS([vocal ×
polyMusic])
The domain vocal adds the lyrics domain to the classic music domain:
domvocals = dompitch × domrythm × domlyrics .
We will now define two sets of operators gathered into two algebras: the
(extended) relational algebra, and the times series algebra.
A Database Approach to Symbolic Music Content Management 311
Projection, π. We want the vocal parts from the Score schema without the
piano part. We project the piano out:
πvocals (Score)
Product, ×. Consider a collection of duets, split into the individual vocal parts
of male and female singers, with the following schemas
Note that the time domain is implicitly shared. In itself, the product doesn’t
have much interest, but together with the selection operator it becomes the join
operators 1. In the previous example, we shouldn’t blindly associate any male
and female vocal parts, but only the ones sharing the same Id.
The time series equivalent of the null attribute is the empty time series, for
which each event is ⊥. Beyond classical relational operators, we introduce an
emptyness test ∅? that operates on voices and is modeled as: ∅? (s) = f alse if
∀t, s(t) = ⊥, else ∅? (s) = true. The emptynes test can be introduced in selection
formulas of the σ operator.
312 P. Rigaux and Z. Faget
Consider once more the relation Score. We want to select all scores featuring
the word ’Ave’ in the lyrics part of V. We need a user function m : (lyrics) →
(⊥, ) such that m( Ave ) = , else ⊥. Lyrics not containing ’Ave’ are trans-
formed into the “empty” time series t → ⊥, ∀t. The algebraic expression is:
We now present the operators of the time series algebra Alg(T S) (◦, ⊕, A). Each
operator takes one or more time series as input and produces a time series.
This way, operating in closed form, operators can be composed. They allow, in
particular: alteration of the time domain in order to focus on specific instants
(external composition); apply a user function to one or more times series to form
a new one (addition operator); windowing of time series fragments for matching
purposes (aggregation).
In what follows, we take an instance s of the Score schema to run several
examples.
The external composition ◦ composes a time series s with an internal temporal
function l. Assume our score s has two movements, and we only want the second
one. Let shif tn be an element of the shift family functions parametrized by a
constant n ∈ N. For any t ∈ T , s◦shif tn (t) = s(t+n). In other words, s◦shif tn is
the time series extracted from s where the first n events are ignored. We compose
s with the shif tL function, where L is the length of the first movement, and the
resulting time series s ◦ shif tL is our result.
Imagine now that we want only the first note of every measure. Assuming
we are in 4/4 and the time unit is one fourth, we compose s with warp16 . The
time series s ◦ warp16 is the time series where only one out of sixteen events are
considered, therefore the first note of every measure.
We now give an example of the addition operator ⊕. Let dompitch be the
domain of all musical notes and domint the domain of all musical intervals. We
can define an operation from dompitch × dompitch to domint , called the harm
operator, which takes two notes as input and computes the interval between
these two notes. Given two time series representing each a vocal part, for instance
V1 =soprano, V2 =alto, we can define the time series
V1 ⊕harm V2
of the harmonic progression (i.e., the sequence of intervals realized by the jux-
taposition of the two voices).
Last, we present the aggregation mechanism A. A typical operation that we
cannot yet obtained is the “windowing” which, at each instant, considers a local
part of a voice s and derives a value from this restricted view. A canonical
example is pattern matching. So far we have no generic way to compare a pattern
P with all subsequences of a time series s. The intuitive way to do pattern
matching is to build all subsequences from s and compare P with each of them,
A Database Approach to Symbolic Music Content Management 313
derivation step
t t
τ τ − 10 τ τ + 10
The aggregation step takes the family of time series obtained with the deriva-
tion step and applies a user function to all of them. For our ongoing example,
this translates to applying the user function DT WP (which computes the DTW
distance between the input time series s and the pattern P ) to dshif t (s). We
denote this two steps procedure by the following expression:
t where the Dynamic Time Warping (DTW) distance between P and V is less
than 5. First, we compute the DTW distance between P and the voice V1 of a
score, at each instant, thanks to the following Alg(T S) expression:
where dtwP is the function computing the DTW distance between a given time
series and the pattern P . Expression e defines a time series that gives, at each
instant α, the DTW distance between P and the sub-series of V1 that begins at
α. A selection (from Alg(T S) ) keeps the values in e below 5, all others being set
to ⊥. Let ψ be the formula that expresses this condition. Then:
e = σψ (e).
Finally, expression e is applied to all the scores in the Score relation with the Π
operator. An emptyness test can be used to eliminate those for which the DTW
is always higher that 5 (hence, e is empty).
5.1 Overview
The language is implemented in Ocaml. Time series are represented by lists of
array of elements, where elements can be integers, float, string, booleans, or any
type previously defined. This way we allow ourselves to synchronize voices of
different types. The only restriction is that we do not allow an element to be a
time series. We define the type time series ts_t by a couple (string*element),
where string is the name of the voice, allowing fast access when searching for a
specific voice.
The from clause should list at least one table from the database, and the alias
is optional.
The let clause is optional. There can be as many let clause as desired. Voices
modified in a let clause are from attributes present in tables listed in the from
clause.
The construct clause should list at least one attribute, either from one the
listed tables in the from clause, or a modified attribute from a let clause.
The where clause is optional. It consists of a list of predicates, connected by
logical operators (And, Or, Not).
Query analysis. The expression entered by the user (either a query or a user
defined function) is analyzed by the system.
The system verifies the query’s syntax. The query is turned into an abstract
syntax tree in which each node is an algebraic operation (from bottom to top :
316 P. Rigaux and Z. Faget
– Projection
algebraic notation : ΠA (R), where A is a set of attributes, R a relation
syntaxic equivalent : construct
Example : the expression
Πid,voice (σcomposer= Faure (Psalms))
is equivalent to
from Score
construct id, voice
where composer=’Faure’
– Selection
algebraic notation : σF (V ), where F is a formula, V is a set of voices of a
time series S.
syntaxic equivalent : where S->V . . . contains
Example : the expression
Πid,Π (σΠlyrics (voice)⊃ Heureux les hommes ,composer = Faure (Psalms))
pitch,rythm (voice)
is equivalent to
from Score
construct id, voice->(pitch,rythm)
where voice->lyrics contains ’Heureux les hommes’
and composer = ’Faure’
is equivalent to
Πpitch(voice)⊕ ( Psalms)
transpose(1)
is equivalent to
from Psalms
let $transpose := map(transpose(1), voice->pitch)
construct $transpose
318 P. Rigaux and Z. Faget
• the expression
Πtrumpet(voice)⊕ (Duets)
harm clarinet(voice)
is equivalent to
from Duets
let $harmonic_progression :=
map2(harm, trumpet->pitch, clarinet->pitch)
construct $harmonic_progression
– Composition
algebraic notation : S ◦ γ where γ is an internal temporal function and S a
time series.
syntaxic equivalent : comp(S, γ)
– Aggregation - derivation
algebraic notation : Aλ,Γ (S), where S is a time series, Γ is a family of internal
time function and λ is an agregation function
syntaxic equivalent : derive(S, Γ, λ).
The family of internal time functions Γ is a mapping from the time domain
into the set of internal time functions. Precisely, for each instant n, Γ (n) = γ,
an internal time function. The two mostly used family of time functions Shift
and Warp are provided.
Example the expression
Πid,Adtw(P),shift (Psalm)
is equivalent to
from Psalm
let $dtwVal := derive(voice, Shift, dtw(P))
construct id, $dtwVal
References
1. Allan H., Müllensiefen D. and Wiggins,G.A.: Methodological Considerations in
Studies of Musical Similarity. In: Proc. Intl. Society for Music Information Retrieval
(ISMIR) (2007)
2. Anglade, A. and Dixon, S.: Characterisation of Harmony with Inductive Logic
Programming. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
3. Anglade, A. and Dixon, S.: Towards Logic-based Representations of Musical Har-
mony for Classification Retrieval and Knowledge Discovery. In: MML (2008)
4. Berman, T., Downie, J., Berman, B.: Beyond Error Tolerance: Finding Thematic
Similarities in Music Digital Libraries. In: Proc. European. Conf. on Digital Li-
braries, pp. 463–466 (2006)
5. Berndt, D., Clifford, J.: Using dynamic time warping to find patterns in time series.
In: AAAI Workshop on Knowledge Discovery in Databases, pp. 229–248 (1994)
6. Brockwell, P.J., Davis, R.: Introduction to Time Series and forecasting. Springer,
Heidelberg (1996)
7. Cameron, J., Downie, J.S., Ehmann, A.F.: Human Similarity Judgments: Impli-
cations for the Design of Formal Evaluations. In: Proc. Intl. Society for Music
Information Retrieval, ISMIR (2007)
8. Capela, A., Rebelo, A., Guedes, C.: Integrated recognition system for music scores.
In: Proc. of the 2008 Internation Computer Music Conferences (2008)
9. Downie, J., Nelson, M.: Evaluation of a simple and effective music information
retrieval method. In: Proc. ACM Symp. on Information Retrieval (2000)
10. Downie, J.S.: Music Information Retrieval. Annual review of Information Science
and Technology 37, 295–340 (2003)
11. Ganseman, J., Scheunders, P., D’haes, W.: Using XQuery on MusicXML Databases
for Musicological Analysis. In: Proc. Intl. Society for Music Information Retrieval,
ISMIR (2008)
12. Good, M.: MusicXML in practice: issues in translation and analysis. In: Proc. 1st
Internationl Conference on Musical Applications Using XML, pp. 47–54 (2002)
13. Haus, G., Longari, M., Pollstri, E.: A Score-Driven Approach to Music Information
Retrieval. Journal of American Society for Information Science and Technology 55,
1045–1052 (2004)
14. Huron, D.: Music information processing using the HumDrum toolkit: Concepts,
examples and lessons. Computer Music Journal 26, 11–26 (2002)
15. Keogh, E.J., Ratanamahatana, C.A.: Exact Indexing of Dynamic Time Warping.
Knowl. Inf. Syst. 7(3), 358–386 (2003)
16. Knopke, I. : The Perlhumdrum and Perllilypond Toolkits for Symbolic Music Infor-
mation Retrieval. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
17. Lee, J.Y., Elmasri, R.: An EER-Based Conceptual Model and Query Language for
Time-Series Data. In: Proc. Intl.Conf. on Conceptual Modeling, pp. 21–34 (1998)
18. Lerner, A., Shasha, D.: A Query : Query language for ordered data, optimiza-
tion techniques and experiments. In: Proc. of the 29th VLDB Conference, Berlin,
Germany (2003)
19. Muller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg
(2004)
20. Rabiner, L., Rosenberg, A., Levinson, S.: Considerations in dynamic time warping
algorithms for discrete word recognition. IEEE Trans. Acoustics, Speech and Signal
Proc. ASSP-26, 575–582 (1978)
320 P. Rigaux and Z. Faget
21. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Trans. Acoustics, Speech and Signal Proc. ASSP-26, 43–
49 (1978)
22. Sapp, C.S.: Online Database of Scores in the Humdrum File Format. In: Proc. Intl.
Society for Music Information Retrieval, ISMIR (2005)
23. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey Of Music Information Retrieval
Systems. In: Proc. Intl. Society for Music Information Retrieval, ISMIR (2005)
24. Viglianti, R.: MusicXML : An XML based approach to automatic musicological
analysis. In: Conference Abstracts of the Digital Humanities (2007)
25. Zhu, Y., Shasha, D.: Warping Indexes with Envelope Transforms for Query by
Humming. In: Proc. ACM SIGMOD Symp. on the Management of Data, pp. 181–
192 (2003)
26. Mutopia, http://www.mutopiaproject.org (last viewed February 2011)
27. Melodicmatch, http://www.melodicmatch.com (last viewed February 2011)
28. Musipedia, http://www.musipedia.org (last viewed February 2011)
29. Wikifonia, http://www.wikifonia.org (last viewed February 2011)
30. Neuma, http://neuma.fr (last viewed February 2011)
Error-Tolerant Content-Based Music-Retrieval
with Mathematical Morphology
University of Helsinki
Department of Computer Science
[email protected]
{mika.laitinen,kjell.lemstrom,juho.vikman}@helsinki.fi
http://www.cs.helsinki.fi
1 Introduction
The snowballing number of multimedia data and databases publicly available
for anyone to explore and query has made the conventional text-based query
approach insufficient. To effectively query these databases in the digital era,
content-based methods tailored for the specific media have to be available.
In this paper we study the applicability of a mathematical framework for
retrieving music in symbolic, polyphonic music databases in a content-based
fashion. More specifically, we harness the mathematical morphology methodol-
ogy for locating approximate occurrences of a given musical query pattern in
a larger music database. To this end, we represent music symbolically using
the well-known piano-roll representation (see Fig. 1(b)) and cast it into a two-
dimensional binary image. The representation used resembles that of a previ-
ously used technique based on point-pattern matching [14,11,12,10]; the applied
methods themselves, however, are very different. The advantage of using our
novel approach is that it enables more flexible matching for polyphonic music,
allowing local jittering on both time and pitch values of the notes. This has been
problematic to achieve with the polyphonic methods based on the point-pattern
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 321–337, 2011.
c Springer-Verlag Berlin Heidelberg 2011
322 M. Karvonen et al.
Fig. 1. (a) The first two measures of Bach’s Invention 1. (b) The same polyphonic
melody cast into a 2-D binary image. (c) A query pattern image with one extra note
and various time and pitch displacements. (d) The resulting image after a blur rank
order filtering operation, showing us the potential matches.
matching. Moreover, our approach provides the user with an intuitive, visual way
of defining the allowed approximations for a query in hand. In [8], Karvonen and
Lemström suggested the use of this framework for music retrieval purposes. We
extend and complement their ideas, introduce and implement new algorithms,
and carry out experiments to show their efficiency and effectiveness.
The motivation to use symbolic methods is twofold. Firstly, there is a mul-
titude of symbolic music databases where audio methods are naturally not of
use. In addition, the symbolic methods allow for distributed matching, i.e., oc-
currences of a query pattern are allowed to be distributed across the instruments
(voices) or to be hidden in some other way in the matching fragments of the poly-
phonic database. The corresponding symbolic and audio files may be aligned by
using mapping tools [7] in order to be able to play back the matching part in an
audio form.
2 Background
2.1 Related Work
Let us denote by P + f a translation of P by vector f , i.e., vector f is added
to each m component of P separately: P + f = p1 + f, p2 + f, . . . , pm + f .
Problem AP1 can then be expressed as the search for a subset I of T such that
P + f I for some f and some similarity relation ; in the original P1 setting
the relation is to be replaced by the equality relation =. It is noteworthy that
the mathematical translation operation corresponds to two musically distinct
phenomena: a vertical move corresponds to transposition while a horizontal move
corresponds to aligning the pattern and the database time-wise.
In [15], Wiggins et al. showed how to solve P1 and P2 in O(mn log(mn)) time.
First, translations that map the maximal number of the m points of P to some
points of T (of n points) are to be collected. Then the set of such translation
vectors is to be sorted based on the lexicographic order, and finally the transla-
tion vector that is the most frequent is to be reported. If the reported vector f
appears m times, it is also an occurrence for P1. With careful implementation
of the sorting routine, the running time can be improved to O(mn log m) [14].
For P1, one can use a faster algorithm working in O(n) expected time and O(m)
space [14].
In [5], Clifford et al. showed that problem P2 is 3SUM-hard, which means that it
is unlikely that one could find an algorithm for the problem with a subquadratic
324 M. Karvonen et al.
images in a very similar way to conventional image filters. However, the focus
in MM-based methods is often in extracting attributes and geometrically mean-
ingful data from images, as opposite to generating filtered versions of images.
In MM, sets are used to represent objects in an image. In binary images, the
sets are members of the 2-D integer space Z2 . The two fundamental morpholog-
ical operations, dilation and erosion, are non-linear neighbourhood operations
on two sets. They are based on the Minkowski addition and subtraction [6]. Out
of the two sets, the typically smaller one is called the structuring element (SE).
Dilation performs a maximum on the SE, which has a growing effect on the
target set, while erosion performs a minimum on the SE and causes the target set
to shrink. Dilation can be used to fill gaps in an image, for instance, connecting
the breaks in letters in a badly scanned image of a book page. Erosion can be
used, for example, for removing salt-and-pepper type noise. One way to define
dilation is
A ⊕ B = {f ∈ Z2 | (B̂ + f ) ∩ A = ∅}, (1)
where A is the target image, B is the SE, and B̂ its reflection (or rotation by
180 degrees). Accordingly, erosion can be written
A B = {f ∈ Z2 | (B + f ) ⊆ A}. (2)
Erosion itself can be used for pattern matching. Foreground pixels in the
resulting image mark the locations of the matches. Any shape, however, can be
found in an image filled with foreground. If the background also needs to match,
erosion has to be used separately also for the negations of the image and the
structuring element. Intersecting these two erosions leads to the desired result.
This procedure is commonly known as the hit-or-miss transform or hit-miss
transform (HMT):
3 Algorithms
In [8] Karvonen and Lemström introduced four algorithms based on the mathe-
matical morphology framework and gave their MATLAB implementations. Our
326 M. Karvonen et al.
closer examination revealed common principles behind the four algorithms; three
of them were virtually identical to each other.
The principles on which our two algortihms to be introduced rely are explained
by Bloomberg and Maragos [2]. Having HMT as the main means of generalizing
erosion, they present three more, which can be combined in various ways. They
also name a few of the combinations. Although we can find some use for HMT,
its benefit is not significant in our case. But two of the other tricks proved to be
handy, and particularly their combination, which is not mentioned by Bloomberg
and Maragos.
We start with erosion as the basic pattern matching operation. The problem
with erosion is its lack of flexibility: every note must match and no jittering is
tolerated. Performing the plain erosion solves problem P1. We present two ways
to gain flexibility.
– Allow partial matches. This is achieved by moving from P1 to P2.
– Handle jittering. This is achieved by moving from P1 to AP1.
Out of the pointset problems, only AP2 now remains unconsidered. It can be
solved, however, by combining the two tricks above. We will next explain how
these improvements can be implemented. First, we concentrate on the pointset
representation, then we will deal with line segments.
BHMT(A, B1 , B2 , R1 , R2 )
= [(A ⊕ R1 ) B1 ] ∩ [(AC ⊕ R2 ) B2 ], (4)
where A is the database and AC its complement, B1 and B2 are the query
foreground and background and R1 and R2 are the blur SEs. The technique is
also eligible for the plain erosion. We choose this method for jitter toleration
and call it blur erosion:
A b (B, R) = (A ⊕ R) B. (5)
The shape of the preprocessive dilation SE does not have to be a disc. In our
case, where the dimensions under consideration are time and pitch, a natural
setting comprises user-specified thresholds for the dimensions. This leads us to
rectangular SEs with efficient implementations. In practice, dilation operations
are useful in the time dimension, but applying it in pitch dimension often results
in false (positive) matches. Instead, a blur of just one semitone is very useful
because the queries often contain pitch quantization errors.
Erosion
Hit-miss transform
Rank order filter Blur erosion
Hit-miss Blur
rank order filter hit-miss transform
Blur
rank order filter
Blur hit-miss
rank order filter
be able to solve AP2, we combine these two. In order to correctly solve AP2, the
dilation has to be applied to the database image. With blurred ROF a speed-up
can be obtained (with the cost of false positive matches) by dilating the query
pattern instead of the database image. If there is no need to adjust the dilation
SE, the blur can be applied to the database in a preprocessing phase. Note also
that if both the query and the database were dilated, it would grow the distance
between the query elements and the corresponding database elements which,
subsequently, would gradually decrease the overlapping area.
Figure 2 illustrates the relations between the discussed methods. Our interest
is on blur erosion and blur ROF (underlined in the Figure), because they can be
used to solve the approximate problems AP1 and AP2.
BHMT*(A, B1 , B2 , R1 , R2 )
= [(A ⊕ R1 ) B1 ] ∩ [AC (B2 R2 )], (6)
where A is the database and AC its complement. B1 and B2 are the query
foreground and background, R1 and R2 are the blur SEs. If B2 is the complement
of B1 , we can write B = B1 and use the form
BHMT*(A, B, R1 , R2 )
= [(A ⊕ R1 ) B1 ] ∩ [AC (B ⊕ R2 )C ]. (7)
4 Experiments
The algorithms presented in this paper set new standards for finding approxima-
tive occurrences of a query pattern from a given database. There are no rivaling
algorithms in this sense, so we are not able to fairly compare the performance
of our algorithms to any existing algorithm. However, to give the reader a sense
of the real-life performance of these approximative algorithms, we compare the
running times of these to the existing, nonapproximative algorithms. Essentially
this means that we are comparing the performance of the algorithms being able
to solve AP1-AP3 to the ones that can solve only P1, P2 and P3 [14].
330 M. Karvonen et al.
6000 70000
Pointset
Line segment
5000 60000
50000
4000
Time (ms)
Time (ms)
40000
3000
30000
2000
20000
1000 10000
0 0
4 8 12 16 20 24 28 32 36 4 8 12 16 20 24 28 32 36
Time resolution (pixel columns per second) Time resolution (pixel columns per second)
Fig. 3. The effect of changing the time resolution a) on blur erosion (on the left) and
b) on blur correlation (on the right)
100000 1e+06
100000
10000
10000
1000
Time (ms)
Time (ms)
1000
100
100
10
10
1
1 0.1
8 16 32 64 128 256 512 1024 8 16 32 64 128 256 512
Pattern size (notes) Database size (thousands of notes)
P1 Blur Corr.
P2 MSM
Blur Er.
pointset and line segment types. On the experiments of the effects of varying
pattern sizes, we selected randomly 16 pieces out of the whole database, each
containing 16,000 notes. Five distinct queries were randomly chosen, and the me-
dian of their execution times was reported. When experimenting with varying
database sizes, we chose a pattern size of 128 notes.
The size of the images is also a major concern for the performance of our
algorithms. We represent the pitch dimension as 128 pixels, since the MIDI
pitch value range consists of 128 possible values. The time dimension, however,
poses additional problems: it is not intuitively clear what would make a good
time resolution. If we use too many pixel columns per second, the performances
of our algorithms will be slowed down significantly. On the flip side of the coin,
not using enough pixels per second would result in a loss of information as we
would not be able to distinguish separate notes in rapid passages anymore. Before
running the actual experiments, we decided to experiment on finding a suitable
time resolution efficiency-wise.
We tested the effect of increasing time resolution on both blur erosion and blur
correlation, and the results can be seen in Figure 3. With blur erosion, we can
see a clear difference between the pointset representation and the line segment
representation: in the line segment case, the running time of blur erosion seems
to grow quadratically in relation to the growing time resolution, while in the
pointset case, the growth rate seems to be clearly slower. This can be explained
with the fact that the execution time of erosion depends on the size of the query
foreground. In the pointset case, we still only mark the beginning point of the
notes, so only SEs require extra space. In the line segment case, however, the
growth is clearly linear.
332 M. Karvonen et al.
In the case of blur correlation, there seems to be nearly no effect whether the
input is in pointset or line segment form. The pointset and line segment curves
in the figure of blur correlation collide, so we depicted only one of the curves in
this case.
Looking at the results, one can note that the time usage of blur erosion begins
to grow quickly around 12 and 16 pixel columns per second. Considering that we
do not want the performance of our algorithms to suffer too much, and the fact
that we are deliberately getting rid of some nuances of information by blurring,
we were encouraged to set the time resolution as low as 12 pixel columns per
second. This time resolution was used in the further experiments.
1e+06 100000
100000
10000
Time (ms)
Time (ms)
10000
1000
1000
100 100
8 16 32 64 128 256 512 1024 8 16 32 64 128 256 512
Pattern size (notes) Database size (thousands of notes)
P3 Blur Corr.
Blur Er.
Fig. 7. The subject used as a search pattern and the first approximate match
(a)
28
(b)
(c)
Fig. 8. A match found by blur erosion (a). An exact match found by both P1 and blur
erosion (b). This entry has too much variation even for blur erosion (c).
Our experiments also confirmed our claim of blur correlation handling jittering
better than P 3. We were expecting this, and Figure 6 illustrates an idiomatic
case where P 3 will not find a match, but blur correlation will. In this case we
have a query pattern that is an excerpt of the database, with the distinction that
some of the notes have been tilted either time-wise or pitch-wise. Additionally
one note has been split into two. Blur correlation finds a perfect match in this
case, whereas P 3 cannot, unless the threshold for the total common length is
exceeded. We were expecting this kind of results, since intuitively P 3 cannot
handle this kind of jittering as well as morphological algorithms do.
6/11
13
6/11
16
4/11
7/11
19
11/11
Fig. 9. Some more developed imitations of the theme with their proportions of exactly
matching notes
half of the notes match exactly the original form of the theme (see Figure 9).
Nevertheless, these imitations are fairly easily recognized visually and audibly.
Our blur-erosion algorithm found them all.
5 Conclusions
In this paper, we have combined existing image processing methods based on
mathematical morphology to construct a collection of new pattern matching
algorithms for symbolic music represented as binary images. Our aim was to gain
an improved error tolerance over the existing pointset-based and line-segment-
based algorithms introduced for related problems.
Our algorithms solve three existing music retrieval problems, P1, P2 and P3.
Our basic algorithm based on erosion solves the exact matching problem P1. To
successfully solve the other two, we needed to relax the requirement of exact
matches, which we did by applying a rank order filtering technique. Using this
relaxation technique, we can solve both the partial pointset matching problem
P2, and also the line segment matching problem P3. By introducing blurring in
the form of preprocessive dilation, the error tolerance of these morphological al-
gorithms can be improved. That way the algorithms are able to tolerate jittering
in both the time and the pitch dimension.
Comparing to the solutions of the non-approximative problems, our new
algorithms tend to be somewhat slower. However, they are still comparable
performance-wise, and actually even faster in some cases. As the most impor-
tant novelty of our algorithms is the added error tolerance given by blurring, we
336 M. Karvonen et al.
think that the slowdown is rather restrained compared to the added usability
of the algorithms. We expect our error-tolerant methods to give better results
in real-world applications when compared to the rivaling algorithms. As future
work, we plan on researching and setting up a relevant ground truth, as without
a ground truth, we cannot adequately measure the precision and recall on the
algorithms. Other future work could include investigating the use of greyscale
morphology for introducing more fine-grained control over approximation.
Acknowledgements
This work was partially supported by the Academy of Finland, grants #108547,
#118653, #129909 and #218156.
References
1. Barrera Hernández, A.: Finding an o(n2 log n) algorithm is sometimes hard. In:
Proceedings of the 8th Canadian Conference on Computational Geometry, pp.
289–294. Carleton University Press, Ottawa (1996)
2. Bloomberg, D., Maragos, P.: Generalized hit-miss operators with applications to
document image analysis. In: SPIE Conference on Image Algebra and Morpholog-
ical Image Processing, pp. 116–128 (1990)
3. Bloomberg, D., Vincent, L.: Pattern matching using the blur hit-miss transform.
Journal of Electronic Imaging 9(2), 140–150 (2000)
4. Clausen, M., Engelbrecht, R., Meyer, D., Schmitz, J.: Proms: A web-based tool for
searching in polyphonic music. In: Proceedings of the International Symposium on
Music Information Retrieval (ISMIR 2000), Plymouth, MA (October 2000)
5. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A
fast, randomised, maximal subset matching algorithm for document-level music
retrieval. In: Proceedings of the 7th International Conference on Music Informa-
tion Retrieval (ISMIR 2006), Victoria, BC, Canada, pp. 150–155 (2006)
6. Heijmans, H.: Mathematical morphology: A modern approach in image processing
based on algebra and geometry. SIAM Review 37(1), 1–36 (1995)
7. Hu, N., Dannenberg, R., Tzanetakis, G.: Polyphonic audio matching and alignment
for music retrieval. In: Proc. IEEE WASPAA, pp. 185–188 (2003)
8. Karvonen, M., Lemström, K.: Using mathematical morphology for geometric music
information retrieval. In: International Workshop on Machine Learning and Music
(MML 2008), Helsinki, Finland (2008)
9. Lemström, K.: Towards more robust geometric content-based music retrieval. In:
Proceedings of the 11th International Society for Music Information Retrieval Con-
ference (ISMIR 2010), Utrecht, pp. 577–582 (2010)
10. Lemström, K., Mikkilä, N., Mäkinen, V.: Filtering methods for content-based
retrieval on indexed symbolic music databases. Journal of Information Re-
trieval 13(1), 1–21 (2010)
11. Lubiw, A., Tanur, L.: Pattern matching in polyphonic music as a weighted geo-
metric translation problem. In: Proceedings of the 5th International Conference on
Music Information Retrieval (ISMIR 2004), Barcelona, pp. 289–296 (2004)
Morphologic Music Retrieval 337
12. Romming, C., Selfridge-Field, E.: Algorithms for polyphonic music retrieval: The
hausdorff metric and geometric hashing. In: Proceedings of the 8th International
Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria (2007)
13. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Utrecht
University, Netherlands (2007)
14. Ukkonen, E., Lemström, K., Mäkinen, V.: Geometric algorithms for transposition
invariant content-based music retrieval. In: Proceedings of the 4th International
Conference on Music Information Retrieval (ISMIR 2003), Baltimore, MA, pp.
193–199 (2003)
15. Wiggins, G.A., Lemström, K., Meredith, D.: SIA(M)ESE: An algorithm for trans-
position invariant, polyphonic content-based music retrieval. In: Proceedings of the
3rd International Conference on Music Information Retrieval (ISMIR 2002), Paris,
France, pp. 283–284 (2002)
Melodic Similarity through Shape Similarity
1 Introduction
The problem of Symbolic Melodic Similarity, where musical pieces similar to a query
should be retrieved, has been approached from very different points of view [24][6].
Some techniques are based on string representations of music and editing distance
algorithms to measure the similarity between two pieces[17]. Later work has extended
this approach with other dynamic programming algorithms to compute global- or
local-alignments between the two musical pieces [19][11][12]. Other methods rely on
music representations based on n-grams [25][8][2], and other methods represent
music pieces as geometric objects, using different techniques to calculate the melodic
similarity based on the geometric similarity of the two objects. Some of these
geometric methods represent music pieces as sets of points in the pitch-time plane,
and then compute geometric similarities between these sets [26][23][7]. Others
represent music pieces as orthogonal polynomial chains crossing the set of pitch-time
points, and then measure the similarity as the minimum area between the two chains
[30][1][15].
In this paper we present a new model to compare melodic pieces. We adapted the
local alignment approach to work with n-grams instead of with single notes, and the
corresponding substitution score function between n-grams was also adapted to take
into consideration a new geometric representation of musical sequences. In this
geometric representation, we model music pieces as curves in the pitch-time plane,
and compare them in terms of their shape similarity.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 338–355, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Melodic Similarity through Shape Similarity 339
In the next section we outline several problems that a symbolic music retrieval
system should address, and then we discuss the general solutions given in the
literature to these requirements. Next, we introduce our geometric representation
model, which compares two musical pieces by their shape, and see how this model
addresses the requirements discussed. In section 5 we describe how we have
implemented our model, and in section 6 we evaluate it with the training and
evaluation test collections used in the MIREX 2005 Symbolic Melodic Similarity task
(for short, we will refer to these collections as Train05 and Eval05) [10][21][28].
Finally, we finish with conclusions and lines for further research. An appendix reports
more evaluation results at the end.
metadata or a simple traverrse through the sequence. We argue that users without such
a strong musical backgroun nd will be interested in the recognition of a certain piitch
contour, and such cases are a much more troublesome because some measuree of
melodic similarity has to be calculated. This is the case of query by humm ming
applications.
The tonality-degrees useed in both cases are the same, but the resultant notes are
not. Nonetheless, one would consider the second melody a version of the first oone,
because they are the sam me in terms of pitch contour. Therefore, they should be
considered the same one by
b a retrieval system, which should also consider possiible
modulations where the key changes somewhere throughout the song.
Indeed, if this piece werre played with a flute only one voice could be performmed,
even if some streaming efffect were produced by changing tempo and timbre for ttwo
voices to be perceived by a listener [16]. Therefore, a query containing only one vooice
should match with this piecce in case that voice is similar enough to any of the thhree
marked in the figure.
whole score is played twicce as fast, at 224 crotchets per minute. This two channges
result in exactly the same acctual time.
On the other hand, it might also be considered a tempo of 56 crotchets per minnute
and notes with half the durration. Moreover, the tempo can change somewhere in the
middle of the melody, and therefore change the actual time of each note afterwarrds.
Therefore, actual note lenggths cannot be considered as the only horizontal measuure,
because these three pieces would
w sound the same to any listener.
Even though the melodiic perception does actually change, the rhythm does nnot,
and neither does the pitch contour. Therefore, they should be considered as virtuaally
the same, maybe with somee degree of dissimilarity based on the tempo variation.
Variations like these arre common and they should be considered as well, jjust
like the Pitch Variation pro
oblem, allowing approximate matches instead of just exxact
ones.
344 J. Urbano et al.
Although the time signature of a performance is worth for other purposes such as
pattern search or score alignment, it seems to us that it should not be considered at all
when comparing two pieces melodically.
According to the Tempo Equivalence problem, actual time should be considered
rather than score time, since it would be probably easier for a regular user to provide
actual rhythm information. On the other hand, the Duration Equality problem requires
the score time to be used instead. Thus, it seems that both measures have to be taken
into account. The actual time is valuable for most users without a musical
background, while the score time might be more valuable for people who do have it.
However, when facing the Duration Variation problem it seems necessary to use
some sort of timeless model. The solution could be to compare both actual and score
time [11], or to use relative differences between notes, in this case with the ratio
between two notes’ durations [8]. Other approaches use a rhythmical framework to
represent note durations as multiples of a base score duration [2][19][23], which does
not meet the Tempo Equivalence problem and hence is not time scale invariant.
Same thing happens with h the horizontal requirements: the Tempo Equivalence and
Duration Equality problem ms can be solved analytically, because they imply jusst a
linear transformation in thee time dimension. For example, if the melody at the topp of
Fig. 6 is defined with curve C(t) and the one in Fig. 7 is denoted with curve D(tt), it
can be easily proved that C(2t)=D(t). Moreover, the Duration Variation probllem
could be addressed analy ytically as the Pitch Variation problem, and the Tiime
Signature Equivalence pro oblem is not an issue because the shape of the curvee is
independent of the time sign nature.
Having musical pieces represented with curves, each one of them could be defiined
with a polynomial of the formf C(t)=antn+an-1tn-1+…+a1t+a0. The first derivativee of
this polynomial measures howh much the shape of the curve is changing at a particuular
point in time (i.e. how the song changes). To measure the change of one curve w with
respect to another, the area between the first derivatives could be used.
Note that a shift in pitch h would mean just a shift in the a0 term. As it turns oout,
when calculating the first derivative
d of the curves this term is canceled, which is w
why
the vertical requirements arre met: shifts in pitch are not reflected in the shape of the
curve, so they are not reflected
r in the first derivative either. Therefore, this
representation is transpositiion invariant.
The song is actually defiined by the first derivative of its interpolating curve, C’(t).
The dissimilarity between twot songs, say C(t) and D(t), would be defined as the aarea
346 J. Urbano et al.
between their first derivatives (measured with the integral over the absolute value of
their difference):
The representation with orthogonal polynomial chains also led to the measurement
of dissimilarity as the area between the curves [30][1]. However, such representation
is not directly transposition invariant unless it used pitch intervals instead of absolute
pitch values, and a more complex algorithm is needed to overcome this problem[15].
As orthogonal chains are not differentiable, this would be the indirect equivalent to
calculating the first derivative as we do.
This dissimilarity measurement based on the area between curves turns out to be a
metric function, because it has the following properties:
• Non-negativity, diff(C, D) ≥ 0: because the absolute value is never negative.
• Identity of indiscernibles, diff(C, D) = 0 ⇔ C = D: because calculating the
absolute value the only way to have no difference is with the same exact curve1.
• Symmetry, diff(C, D) = diff(D, C): again, because the integral is over the
absolute value of the difference.
• Triangle inequality, diff(C, E) ≤ diff(C, D) + diff(D, E):
Therefore, many indexing and retrieval techniques, like vantage objects[4], could be
exploited if using this metric.
The next issue to address is the interpolation method to use. The standard Lagrange
interpolation method, though simple, is known to suffer the Runge’s Phenomenon [3].
As the number of points increases, the interpolating curve wiggles a lot, especially at
the beginning and the end of the curve. As such, one curve would be very different
from another one having just one more point at the end, the shape would be different
and so the dissimilarity metric would result in a difference when the two curves are
practically identical. Moreover, a very small difference in one of the points could
translate into an extreme variation in the overall curve, which would make virtually
impossible to handle the Pitch and Duration Variation problems properly (see top of
Fig. 11).
1
Actually, this means that the first derivatives are the same, the actual curves could still be
shifted. Nonetheless, this is the behavior we want.
Melodic Similarity through Shape Similarity 347
5 Implementation
Geometric representations of music pieces are very intuitive, but they are not
necessarily easy to implement. We could follow the approach of moving one curve
towards the other looking for the minimum area between them [1][15]. However, this
approach is very sensitive to small differences in the middle of a song, such as
repeated notes: if a single note were added or removed from a melody, it would be
impossible to fully match the original melody from that note to the end. Instead, we
follow a dynamic programming approach to find an alignment between the two
melodies [19].
Various approaches for melodic similarity have applied editing distance algorithms
upon textual representations of musical sequences that assign one character to each
interval or each n-gram [8]. This dissimilarity measure has been improved in recent
years, and sequence alignment algorithms have proved to perform better than simple
editing distance algorithms [11][12]. Next, we describe the representation and
alignment method we use.
To practically apply our model, we followed a basic n-gram approach, where each n-
gram represents one span of the spline. The pitch of each note was represented as the
relative difference to the pitch of the first note in the n-gram, and the duration was
represented as the ratio to the duration of the whole n-gram. For example, an n-gram
of length 4 with absolute pitches 〈74, 81, 72, 76〉 and absolute durations 〈240, 480,
240, 720〉, would be modeled as 〈81-74, 72-74, 76-74〉 = 〈7, -2, 2〉 in terms of pitch
and 〈240, 480, 240, 720〉⁄1680 = 〈0.1429, 0.2857, 0.1429, 0.4286〉 in terms of
duration. Note that the first note is omitted in the pitch representation as it is always 0.
This representation is transposition invariant because a melody shifted in the pitch
dimension maintains the same relative pitch intervals. It is also time scale invariant
because the durations are expressed as their relative duration within the span, and so
they remain the same in the face of tempo and actual or score duration changes. This
is of particular interest for query by humming applications and unquantized pieces, as
small variations in duration would have negligible effects on the ratios.
We used Uniform B-Splines as interpolation method [3]. This results in a
parametric polynomial function for each n-gram. In particular, an n-gram of length kn
results in a polynomial of degree kn-1 for the pitch dimension and a polynomial of
degree kn-1 for the time dimension. Because the actual representation uses the first
derivatives, each polynomial is actually of degree kn-2.
We used the Smith-Waterman local alignment algorithm [20], with the two sequences
of overlapping spans as input, defined as in (2). Therefore, the input symbols to the
alignment algorithm are actually the parametric pitch and time functions of a span,
Melodic Similarity through Shape Similarity 349
based on the above representation of n-grams. The edit operations we define for the
Smith-Waterman algorithm are as follows:
• Insertion: s(-, c).Adding a span c is penalized with the score–diff(c, ɸ(c)).
• Deletion: s(c, -). Deleting a span c is penalized with the score–diff(c, ɸ(c)).
• Substitution: s(c, d). Substituting a span c with d is penalized with –diff(c, d).
• Match: s(c, c). Matching a span c is rewarded with the score 2(kpμp+ktμt).
where ɸ(•) returns the null n-gram of • (i.e. an n-gram equal to • but with all pitch
intervals set to 0), and μp and μt are the mean differences calculated by diffp and difft
respectively over a random sample of 100,000 pairs of n-grams sampled from the set
of incipits in the Train05 collection.
We also normalized the dissimilarity scores returned by difft. From the results in
Table 1 it can be seen that pitch dissimilarity scores are between 5 and 7 times larger
than time dissimilarity scores. Therefore, the choice of kp and kt does not intuitively
reflect the actual weight given to the pitch and time dimensions. For instance, the
selection of kt=0.25, chosen in studies like [11], would result in an actual weight
between 0.05 and 0.0357. To avoid this effect, we normalized every time dissimilarity
score multiplying it by a factor λ = μp / μt. As such, the score of the match operation
is actually defined as s(c, c) = 2μp(kp+kt), and the dissimilarity function defined in (3)
is actually calculated as diff(c, d) = kp diffp(c, d) + λktdifft(c, d).
6 Experimental Results2
We evaluated the model proposed with the Train05 and Eval05 test collections used
in the MIREX 2005 Symbolic Melodic Similarity Task [21][10], measuring the mean
Average Dynamic Recall score across queries [22]. Both collections consist of about
580 incipits and 11 queries each, with their corresponding ground truths. Each ground
truth is a list of all incipits similar to each query, according to a panel of experts, and
with groups of incipits considered equally similar to the query.
However, we have recently showed that these lists have inconsistencies whereby
incipits judged as equally similar by the experts are not in the same similarity group
and vice versa [28]. All these inconsistencies result in a very permissive evaluation
where a system could return incipits not similar to the query and still be rewarded for
it. Thus, results reported with these lists are actually overestimated, by as much as
12% in the case of the MIREX 2005 evaluation. We have proposed alternatives to
arrange the similarity groups for each query, proving that the new arrangements are
significantly more consistent than the original one, leading to a more robust
evaluation. The most consistent ground truth lists were those called Any-1 [28].
Therefore, we will use these Any-1 ground truth lists from this point on to evaluate
our model, as they offer more reliable results. Nonetheless, all results are reported in
an appendix as if using the original ground truths employed in MIREX 2005, called
All-2, for the sake of comparison with previous results.
2
All system outputs and ground truth lists used in this paper can be downloaded from
http://julian-urbano.info/publications/
350 J. Urbano et al.
First, we calculated the mean dissimilarity scores μp and μt for each n-gram length kn,
according to diffp and difft over a random sample of 100,000 pairs of n-grams. Table 1
lists the results. As mentioned, the pitch dissimilarity scores are between 5 and 7
times larger than the time dissimilarity scores, suggesting the use of the normalization
factor λ defined above.
Table 1. Mean and standard deviation of the diffp and difft functions applied upon a random
sample of 100,000 pairs of n-grams of different sizes
kn μp σp μt σt λ = μp / μt
3 2.8082 1.6406 0.5653 0.6074 4.9676
4 2.5019 1.6873 0.494 0.5417 5.0646
5 2.2901 1.4568 0.4325 0.458 5.2950
6 2.1347 1.4278 0.3799 0.3897 5.6191
7 2.0223 1.3303 0.2863 0.2908 7.0636
There also appears to be a negative correlation between the n-gram length and the
dissimilarity scores. This is caused by the degree of the polynomials defining the
splines: high-degree polynomials fit the points more smoothly than low-degree ones.
Polynomials of low degree tend to wiggle more, and so their derivatives are more
pronounced and lead to larger areas between curves.
6.2 Evaluation with the Train05 Test Collection, Any-1 Ground Truth Lists
The experimental design results in 55 trials for the 5 different levels of kn and the 11
different levels of kt. All these trials were performed with the Train05 test collection,
ground truths aggregated with the Any-1 function [28]. Table 2 shows the results.
In general, large n-grams tend to perform worse. This could probably be explained
by the fact that large n-grams define the splines with smoother functions, and the
differences in shape may be too small to discriminate musically perceptual
differences. However, kn=3 seems to be the exception (see Fig. 12). This is probably
caused by the extremely low degree of the derivative polynomials. N-grams of length
kn=3 result in splines defined with polynomials of degree 2, which are then
differentiated and result in polynomials of degree 1. That is, they are just straight
lines, and so a small difference in shape can turn into a relatively large dissimilarity
score when measuring the area.
Overall, kn=4 and kn=5 seem to perform the best, although kn=4 is more stable
across levels of kt. In fact, kn=4 and kt=0.6 obtain the best score, 0.7215. This result
agrees with other studies where n-grams of length 4 and 5 were also found to perform
Melodic Similarity through Shape Similarity 351
Table 2. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.6961 0.7067 0.7107 0.7106 0.7102 0.7109 0.7148 0.711 0.7089 0.7045 0.6962
4 0.7046 0.7126 0.7153 0.7147 0.7133 0.72 0.7215 0.7202 0.7128 0.7136 0.709
5 0.7093 0.7125 0.7191 0.72 0.7173 0.7108 0.704 0.6978 0.6963 0.6973 0.6866
6 0.714 0.7132 0.7115 0.7088 0.7008 0.693 0.6915 0.6874 0.682 0.6765 0.6763
7 0.6823 0.6867 0.6806 0.6747 0.6538 0.6544 0.6529 0.6517 0.6484 0.6465 0.6432
kn = 3
0.72
kn = 4
kn = 5
Mean ADR score
kn = 6
0.7
kn = 7
0.68
0.66
0.64
Fig. 12. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
better [8]. Moreover, this combination of parameters obtains a mean ADR score of
0.8039 when evaluated with the original All-2 ground truths (see Appendix). This is
the best score ever reported for this collection.
6.3 Evaluation with the Eval05 Test Collection, Any-1 Ground Truth Lists
In a fair evaluation scenario, we would use the previous experiment to train our
system and choose the values of kn and kt that seem to perform the best (in particular,
kn=4 and kt=0.6). Then, the system would be run and evaluated with a different
collection to assess the external validity of the results and try to avoid overfitting to
the training collection. For the sake of completeness, here we show the results for all
55 combinations of the parameters with the Eval05 test collection used in MIREX
2005, again aggregated with the Any-1 function [28].Table 3 shows the results.
Unlike the previous experiment with the Train05 test collection, in this case the
variation across levels of kt is smaller (the mean standard deviation is twice as much
in Train05), indicating that the use of the time dimension does not provide better
results overall (see Fig. 13). This is probably caused by the particular queries in each
collection. Seven of the eleven queries in Train05 start with long rests, while this
352 J. Urbano et al.
Table 3. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.6522 0.6601 0.6646 0.6612 0.664 0.6539 0.6566 0.6576 0.6591 0.6606 0.662
4 0.653 0.653 0.6567 0.6616 0.6629 0.6633 0.6617 0.6569 0.65 0.663 0.6531
5 0.6413 0.6367 0.6327 0.6303 0.6284 0.6328 0.6478 0.6461 0.6419 0.6414 0.6478
6 0.6269 0.6251 0.6225 0.6168 0.6216 0.6284 0.6255 0.6192 0.6173 0.6144 0.6243
7 0.5958 0.623 0.6189 0.6163 0.6162 0.6192 0.6215 0.6174 0.6148 0.6112 0.6106
0.67
kn = 3
kn = 4
kn = 5
0.65
kn = 6
Mean ADR score
kn = 7
0.63
0.61
0.59
Fig. 13. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
happens for only three of the eleven queries in Eval05. In our model, rests are
ignored, and so the effect of the time dimension is larger when the very queries have
rests as their duration is added to the next note's.
Likewise, large n-grams tend to perform worse. In this case though, n-grams of
length kn=3 and kn=4 perform the best. The most effective combination is kn=3 and
kt=0.2, with a mean ADR score of 0.6646. However, kn=4 and kt=0.5 is very close,
with a mean ADR score of 0.6633. Therefore, based on the results of the previous
experiment and the results in this one, we believe that kn=4 and kt∈[0.5, 0.6] are the
best parameters overall.
It is also important to note that none of the 55 combinations ran result in a mean
ADR score less than 0.594, which was the highest score achieved in the actual
MIREX 2005 evaluation with the Any-1 ground truths [28]. Therefore, our systems
would have ranked first if participated.
References
1. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuñez, Y., Rappaport, D.,
Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic Similarity.
Computer Music Journal 30(3), 67–76 (2006)
2. Bainbridge, D., Dewsnip, M., Witten, I.H.: Searching Digital Music Libraries. Information
Processing and Management 41(1), 41–56 (2005)
3. de Boor, C.: A Practical guide to Splines. Springer, Heidelberg (2001)
4. Bozkaya, T., Ozsoyoglu, M.: Indexing Large Metric Spaces for Similarity Search Queries.
ACM Transactions on Database Systems 24(3), 361–404 (1999)
5. Byrd, D., Crawford, T.: Problems of Music Information Retrieval in the Real World.
Information Processing and Management 38(2), 249–272 (2002)
6. Casey, M.A., Veltkamp, R.C., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-
Based Music Information Retrieval: Current Directions and Future Challenges.
Proceedings of the IEEE 96(4), 668–695 (2008)
354 J. Urbano et al.
7. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A Fast,
Randomised, Maximal Subset Matching Algorithm for Document-Level Music Retrieval.
In: International Conference on Music Information Retrieval, pp. 150–155 (2006)
8. Doraisamy, S., Rüger, S.: Robust Polyphonic Music Retrieval with N-grams. Journal of
Intelligent Systems 21(1), 53–70 (2003)
9. Downie, J.S.: The Scientific Evaluation of Music Information Retrieval Systems:
Foundations and Future. Computer Music Journal 28(2), 12–23 (2004)
10. Downie, J.S., West, K., Ehmann, A.F., Vincent, E.: The 2005 Music Information Retrieval
Evaluation Exchange (MIREX 2005): Preliminary Overview. In: International Conference
on Music Information Retrieval, pp. 320–323 (2005)
11. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for Evaluating
Similarity Between Monophonic Musical Sequences. Journal of New Music
Research 36(4), 267–279 (2007)
12. Hanna, P., Robine, M., Ferraro, P., Allali, J.: Improvements of Alignment Algorithms for
Polyphonic Music Retrieval. In: International Symposium on Computer Music Modeling
and Retrieval, pp. 244–251 (2008)
13. Isaacson, E.U.: Music IR for Music Theory. In: The MIR/MDL Evaluation Project White
paper Collection, 2nd edn., pp. 23–26 (2002)
14. Kilian, J., Hoos, H.H.: Voice Separation — A Local Optimisation Approach. In:
International Symposium on Music Information Retrieval, pp. 39–46 (2002)
15. Lin, H.-J., Wu, H.-H.: Efficient Geometric Measure of Music Similarity. Information
Processing Letters 109(2), 116–120 (2008)
16. McAdams, S., Bregman, A.S.: Hearing Musical Streams. In: Roads, C., Strawn, J. (eds.)
Foundations of Computer Music, pp. 658–598. The MIT Press, Cambridge (1985)
17. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161–175 (1990)
18. Selfridge-Field, E.: Conceptual and Representational Issues in Melodic Comparison.
Computing in Musicology 11, 3–64 (1998)
19. Smith, L.A., McNab, R.J., Witten, I.H.: Sequence-Based Melodic Comparison: A
Dynamic Programming Approach. Computing in Musicology 11, 101–117 (1998)
20. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal
of Molecular Biology 147(1), 195–197 (1981)
21. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C.: A Ground Truth for
Half a Million Musical Incipits. Journal of Digital Information Management 3(1), 34–39
(2005)
22. Typke, R., Veltkamp, R.C., Wiering, F.: A Measure for Evaluating Retrieval Techniques
based on Partially Ordered Ground Truth Lists. In: IEEE International Conference on
Multimedia and Expo., pp. 1793–1796 (2006)
23. Typke, R., Veltkamp, R.C., Wiering, F.: Searching Notated Polyphonic Music Using
Transportation Distances. In: ACM International Conference on Multimedia, pp. 128–135
(2004)
24. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey of Music Information Retrieval
Systems. In: International Conference on Music Information Retrieval, pp. 153–160 (2005)
25. Uitdenbogerd, A., Zobel, J.: Melodic Matching Techniques for Large Music Databases. In:
ACM International Conference on Multimedia, pp. 57–66 (1999)
26. Ukkonen, E., Lemström, K., Mäkinen, V.: Geometric Algorithms for Transposition
Invariant Content-Based Music Retrieval. In: International Conference on Music
Information Retrieval, pp. 193–199 (2003)
Melodic Similarity through Shape Similarity 355
27. Urbano, J., Lloréns, J., Morato, J., Sánchez-Cuadrado, S.: MIREX 2010 Symbolic Melodic
Similarity: Local Alignment with Geometric Representations. Music Information Retrieval
Evaluation eXchange (2010)
28. Urbano, J., Marrero, M., Martín, D., Lloréns, J.: Improving the Generation of Ground
Truths based on Partially Ordered Lists. In: International Society for Music Information
Retrieval Conference, pp. 285–290 (2010)
29. Urbano, J., Morato, J., Marrero, M., Martín, D.: Crowdsourcing Preference Judgments for
Evaluation of Music Similarity Tasks. In: ACM SIGIR Workshop on Crowdsourcing for
Search Evaluation, pp. 9–16 (2010)
30. Ó Maidín, D.: A Geometrical Algorithm for Melodic Difference. Computing in
Musicology 11, 65–72 (1998)
Table 4. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.
kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.7743 0.7793 0.788 0.7899 0.7893 0.791 0.7936 0.7864 0.7824 0.777 0.7686
4 0.7836 0.7899 0.7913 0.7955 0.7946 0.8012 0.8039 0.8007 0.791 0.7919 0.7841
5 0.7844 0.7867 0.7937 0.7951 0.7944 0.7872 0.7799 0.7736 0.7692 0.7716 0.7605
6 0.7885 0.7842 0.7891 0.7851 0.7784 0.7682 0.7658 0.762 0.7572 0.7439 0.7388
7 0.7598 0.7573 0.7466 0.7409 0.7186 0.7205 0.7184 0.7168 0.711 0.7075 0.6997
Table 5. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.
kn kt=0 kt=0.1 kt=0.2 kt=0.3 kt=0.4 kt=0.5 kt=0.6 kt=0.7 kt=0.8 kt=0.9 kt=1
3 0.7185 0.714 0.7147 0.7116 0.712 0.7024 0.7056 0.7067 0.708 0.7078 0.7048
4 0.7242 0.7268 0.7291 0.7316 0.7279 0.7282 0.7263 0.7215 0.7002 0.7108 0.7032
5 0.7114 0.7108 0.6988 0.6958 0.6942 0.6986 0.7109 0.7054 0.6959 0.6886 0.6914
6 0.708 0.7025 0.6887 0.6693 0.6701 0.6743 0.6727 0.6652 0.6612 0.6561 0.6636
7 0.6548 0.6832 0.6818 0.6735 0.6614 0.6594 0.6604 0.6552 0.6525 0.6484 0.6499
It can also be observed that the results would again be overestimated by as much as
11% in the case of Train05 and as much as 13% in Eval05, in contrast with the
maximum 12% observed with the systems that participated in the actual MIREX 2005
evaluation.
Content-Based Music Discovery
Dirk Schönfuß
1 Introduction
The way music is consumed today has been changed dramatically by its increasing
availability in digital form. Online music shops have replaced traditional stores and
music collections are increasingly kept on electronic storage systems and mobile
devices instead of physical media on a shelf. It has become much faster and more
comfortable to find and acquire music of a known artist. At the same time it has
become more difficult to find one’s way in the enormous range of music that is
offered commercially, find music according to one’s taste or even manage one’s own
collection.
Young people today have a music collection with an average size of 8,159 tracks [1]
and the iTunes music store today offers more than 10 million tracks for sale. Long-tail
sales are low which is illustrated by the fact that only 1% of the catalog tracks generate
80% of sales [2][3]. A similar effect can also be seen in the usage of private music
collections. According to our own studies, only few people are actively working with
manual playlists because they consider this too time-consuming or they have simply
forgotten which music they actually possess.
This is where music recommendation technology comes in: The similarity of two
songs can be mathematically calculated based on their musical attributes. Thus, for
each song a ranked list of similar songs from the catalogue can be generated.
Editorial, user-generated or content-based data derived directly from the audio signal
can be used.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 356–360, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Content-Based Music Discovery 357
Editorial data allows a very thorough description of the musical content but this
manual process is very expensive and time-consuming and will only ever cover a
small percentage of the available music. User-data based recommendations have
become very popular through vendors such as Amazon (“People who bought this item
also bought …”) or Last.FM. However, this approach suffers from a cold-start
problem and its strong focus on popular content.
Signal-based recommenders are not affected by popularity, sales rank or user
activity. They extract low-level features directly from the audio signal. This also
offers additional advantages such as being able to work without a server connection
and being able to process any music file even if it has not been published anywhere.
However, signal-based technology alone misses the socio-cultural aspect which is not
present in the audio signal and it also cannot address current trends or lyrics content.
Fig. 1. The mufin music recommender combines audio features inside a song model and
semantic musical attributes using a music ontology. Additionally, visualization coordinates for
the mufin vision sound galaxy are generated during the music analysis process.
Mufin’s complete music analysis process is fully automated. The technology has
already proven its practical application in usage scenarios with more than 9 million
tracks. Mufin's technology is available for different platforms including Linux,
MacOS X, Windows and mobile platforms. Additionally, it can also be used via web
services.
2 Mufin Vision
Common, text-based attributes such as title or artist are not suitable to keep track of a
large music collection, especially if the user is not familiar with every item in the
collection. Songs which belong together from a sound perspective may appear very
far apart when using lists sorted by metadata. Additionally, only a limited number of
songs will fit onto the screen preventing the user from actually getting an overview of
his collection.
mufin vision has been developed with the goal to offer easy access to music
collections. Even if there are thousands of songs, the user can easily find his way
around the collection since he can learn where to find music with a certain
characteristic. By looking at the concentration of dots in an area he can immediately
assess the distribution of the collection and zoom into a section to get a closer look.
The mufin vision 3D sound galaxy displays each song as a dot in a coordinate
system. X, y and z axis as well as size and color of the dots can be assigned to
different musical criteria such as tempo, mood, instrumentation or type of singing
voice; even metadata such as release date or song duration can be used. Using the axis
configuration, the user can explore his collection the way he wants and make relations
between different songs visible. As a result, it becomes much easier to find music
fitting a mood or occasion.
Mufin vision premiered in the mufin player PC application but it can also be used
on the web and even on mobile devices. The latest version of the mufin player 1.5
allows the user to control mufin vision using a multi-touch display.
Content-Based Music Discovery 359
Fig. 2. Both songs are by the same artist. However, “Brothers in arms” is a very calm ballad
with sparse instrumentation while “Sultans of swing” is a rather powerful song with a fuller
sound spectrum. The mufin vision sound galaxy reflects that difference since it works on song
level instead of an artist or genre level.
Fig. 3. The figure displays a playlist in which the entries are connected by lines. One can see
that although the songs may be similar as a whole, their musical attributes vary over the course
of the playlist.
3 Further Work
The mufin player PC application offers a database view of the user’s music collection
including filtering, searching and sorting mechanisms. However, instead of only using
metadata such as artist or title for sorting, the mufin player can also sort any list by
similarity to a selected seed song.
360 D. Schönfuß
Additionally, the mufin player offers an online storage space for a user’s music
collection. This prevents the user from data loss and allows him to simultaneously
stream his music online and listen to it from anywhere in the world.
Furthermore, mufin works together with the German National Library in order to
establish a workflow for the protection of our cultural heritage. The main contribution
of mufin is the fully automatic annotation of the music content and the provision of
descriptive tags for the library’s ontology. Based on technology by mufin and its
partners, a semantic multimedia search demonstration was presented at IBC 2009 in
Amsterdam.
References
1. Bahanovich, D., Collopy, D.: Music Experience and Behaviour in Young People.
University of Hertfordshire, UK (2009)
2. Celma, O.: Music Recommendation and Discovery in the Long Tail. PhD-Thesis,
Universitat Pompeu Fabra, Spain (2008)
3. Nielsen Soundscan: State of the industrie (2007), http://www.narm.com/
2008Conv/StateoftheIndustry.pdf (July 22, 2009)
Author Index