Concatenative Sound Synthesis - The Early Years Diemo Schwarz
Concatenative Sound Synthesis - The Early Years Diemo Schwarz
Diemo Schwarz
Diemo Schwarz
Ircam – Centre Pompidou
1, place Igor-Stravinsky, 75003 Paris, France
http://www.ircam.fr/anasyn/schwarz http://concatenative.net
[email protected]
1
now the first commercial products available (3.2.4, 3.2.5), 1.2. Technical Overview
and, last but not least, ICMC 2004 saw the first musical
pieces using concatenative synthesis (3.4.5). Any concatenative synthesis system performs the tasks il-
lustrated in figure 2, sometimes implicitly. This list of
Section 4 finally gives some of the most urgent prob-
tasks will serve later for our taxonomy of systems in sec-
lems to be tackled for the further development of concate-
tion 3.
native synthesis.
Association of
Information Musician Selection Association of
(Knowledge level) Information
(a) General structure.
Recorded Unit
Signal level Sound Descriptors
Sound Lookup
Figure 1. Hypothesis of high level synthesis: The rela-
tions between the score and the produced sound in the
case of performing an instrument, and the synthesis tar- Segmentation Descriptors Transformation
get and the unit descriptors in the case of concatenative
data-driven synthesis are shown on their respective level
of representation of musical information. 3 Temporal Modeling Concatenation
2
tomatic methods, but can also be given as external meta- the last selected unit and all matching candidates for the
data, or be supplied by the user, e.g. for categorical de- following unit is considered. Both local possibilities can
scriptors or for subjective perceptual descriptors. (e.g. a be seen as a simplified form of the path search unit se-
“glassiness” value or “anxiousness” level could be manu- lection algorithm, which still uses the same framework of
ally attributed to units). distance functions, presented in its most general formula-
For the time-varying dynamic descriptors, temporal tion in the following
modeling reduces the evolution of the descriptor value
over the unit to a fixed-size vector of values characterizing 1.2.5. Target Distance
this evolution. Usually, only the mean value is used, but
some systems go further and store range, slope, min, max, The target distance C t corresponds to the perceptual sim-
attack, release, modulation, and spectrum of the descriptor ilarity of the database unit ui to the target unit tτ . It is
curve. given as a sum of p weighted individual descriptor dis-
tance functions Ckt as:
1.2.2. Database X
p
C t (ui , tτ ) = wkt Ckt (ui , tτ ) (1)
Source file references, units, unit descriptors, and the re- k=1
lationships between them are stored in a database. The To favour the selection of units out of the same context
subset of the database that is preselected for one partic- in the database as in the target, the context distance C x
ular synthesis is called the corpus. Often, the database considers a sliding context in a range of r units around the
is implicitly constituted by a collection of files. More current unit with weights wj decreasing with distance j.
rarely, a (relational or other) database management sys-
tem is used, which can run locally or on a server. Internet X
r
sound databases with direct access to sounds and descrip- C x (ui , tτ ) = wjx C t (ui+j , tτ +j ) (2)
tors 4 are beginning to make their appearance, e.g. with j=−r
the freesound project (see section 4.3). Mostly, a Euclidean distance normalised by the standard
deviation is used and r is zero. Some descriptors need
1.2.3. Target specialised distance functions. Symbolic descriptors, e.g.
phoneme class, require a lookup table of distances.
The target is specified as a sequence of target units with
their desired descriptor characteristics. Usually, only a 1.2.6. Concatenation Distance
subset of the available database descriptors is given. The
unspecified descriptors do not influence the selection di- The concatenation distance C c expresses the discontinuity
rectly, but can, however, be used to stipulate continuity via introduced by concatenating the units ui and uj from the
the concatenation distance (see section 1.2.6 below). The database. It is given by a weighted sum of q descriptor
target can either be generated from a symbolic score (ex- concatenation distance functions Ckc :
pressed e.g. in notes or directly in segments plus descrip- X
q
tors), or analysed from an audio score (using the same seg- C c (ui , uj ) = wkc Ckc (ui , uj ) (3)
mentation and analysis methods as for the source sounds). k=1
3
1.2.8. Unit Selection by Constraint Solving systems in terms of naturalness and intelligibility. Unit se-
lection algorithms attempt to estimate the appropriateness
Applying the formalism of constraint satisfaction to unit of a particular database speech unit using linguistic fea-
selection permits to express musical desiderata additional tures predicted from a given text to be synthesized. The
to the target match in a flexible way, such as to avoid re- units can be of any length (non-uniform unit selection),
peating units, or not to use a certain unit for the selection. from sub-phonemes to whole phrases, and are not limited
It has been first proposed for music program generation by to diphones or triphones.
Pachet, Roy, and Cazaly (2000), see section 2.4, and for Although concatenative sound synthesis is quite sim-
data-driven concatenative musical synthesis by Zils and ilar to concatenative speech synthesis and shares many
Pachet (2001) in the Musical Mosaicing system described concepts and methods, both have different goals. Even
in section 3.7.3. from a very rough comparison between musical and
It is based on the adaptive local search algorithm de- speech synthesis, some profound differences spring to
scribed in detail in (Codognet & Diaz, 2001; Truchet, As- mind, which make the application of concatenative data-
sayag, & Codognet, 2001), which runs iteratively until a driven synthesis techniques from speech to music non-
satisfactory result is achieved or a certain number of it- trivial:
erations is reached. Constraints are here given by an er-
ror function, which allows us to easily express the unit • Speech is a-priori clustered into phonemes. A mu-
selection algorithm as a constraint satisfaction problem sical analogue for this phonemic identity are pitch
(CSP) using the target and concatenation distances be- classes which are applicable for tonal music, but in
tween units. general, no a-priori clustering can be presupposed.
• In speech, the time position of synthesized units is
1.2.9. Synthesis intrinsically given by the required duration of the se-
lected units. In music, precise time-points have to be
The final waveform synthesis is done by concatenation hit when we want to keep the rhythm.
of selected units with a short cross-fade, possibly apply-
ing transformations, for instance altering pitch or loud- • In speech synthesis, intelligibility and naturalness are
ness. Depending on the application, the selected units are of prime interest, and the synthesised speech is of-
placed at the times given by the target (musical or rhyth- ten limited to “normal” informative mode. However,
mic synthesis), or are concatenated with their natural du- musical creation is based on artistic principles, uses
ration (free synthesis, speech or texture synthesis). many modes of expressivity, and needs to experi-
ment. Therefore, creative and interactive use of the
system should be possible by using any database of
2. RELATED TOPICS sounds, any descriptors, and a flexible expression of
the target for the selection.
Concatenative synthesis is at the intersection of many
fields of research, such as music information retrieval
(MIR), database technology, real-time and interactive 2.2. Singing Voice Synthesis
methods, digital signal processing (DSP), sound synthe-
sis models, musical modeling, classification, perception. Concatenative singing voice synthesis occupies an inter-
We could see concatenative synthesis as one of three mediate position between speech and sound synthesis,
variants of content-based retrieval, depending on what whereas the used methods are most often closer to speech
is queried and how it is used. When just one sound is synthesis, 5 with the limitation of fixed inventories specif-
queried, we are in the realm of descriptor- or similarity- ically recorded, such as the Lyricos system (Macon et al.,
based sound selection. Superposing retrieved sounds to 1997a, 1997b), the work by Lomax (1996), and the recent
satisfy a certain outcome is the topic of automatic orches- system developed by Bonada et al. (2001). There is one
tration tools (Hummel, 2005). Finally, sequencing re- notable exception (Meron, 1999), where an automatically
trieved sound snippets is our topic of concatenative syn- constituted large unit database is used.
thesis. See (Rodet, 2002) for an up-to-date overview of current
Other closely related research topics are given in the research in singing voice synthesis, which is out of the
following, that share many of the basic questions and scope of this article.
problems. This recent spread of data-driven singing voice synthe-
sis methods based on unit selection follows their success
in speech synthesis, and lets us anticipate a coming leap
2.1. Speech Synthesis in quality and naturalness of the singing voice. Regarding
the argument of rule-based vs. data-driven singing voice
Research in musical synthesis is heavily influenced by
synthesis, Rodet (2002) notes that:
research in speech synthesis, which can be said to be
roughly 10 years ahead. Concatenative unit selection Clearly, the units intrinsically contain the influ-
speech synthesis from large databases (Hunt & Black, ence of an implicit set of rules applied by the
1996) is used in a great number of Text-to-Speech sys- 5 Concatenative speech synthesis techniques are directly used for
tems for waveform generation (Prudon, 2003). Its intro-
singing voice synthesis in Burcas (http://www.ling.lu.se/persons/
duction resulted in a considerable gain in quality of the Marcusu/music/burcas), Flinger (http://www.cslu.ogi.edu/tts/
synthesized speech over rule-based parametric synthesis flinger), and abused in http://www.silexcreations.com/melissa.
4
singer with all his training, talent and musi- comparison. We can order these methods according to
cal skill. The unit selection and concatenation two main aspects, which combined indicate the level of
method is thus a way to replace a large and “data-drivenness” of a method. They form the axes of the
complicated set of rules by implicit rules from diagram in figure 3, the abscissa indicating the degree of
the best performers, and it is often called a data- structuredness of information obtained by analysis of the
driven concatenative synthesis. source sounds and the metadata, and the ordinate the de-
gree of automation of the selection. Further aspects ex-
pressed in the diagram are the inclusion of concatenation
2.3. Content-Based Processing quality in the selection, and real-time capabilities.
Groups of similar approaches emanate clearly from
Content-based processing is a new paradigm in digital au- this diagram that will be discussed in the following seven
dio processing that is based on symbolic or high-level ma- sub-sections, going from left to right and bottom to top
nipulations of elements of a sound, rather than using sig- through the diagram.
nal processing alone (Amatriain et al., 2003). Lindsay,
Parkes, and Fitzgerald (2003) propose context-sensitive
effects that are more aware of the structure of the sound
3.1. Group 1: Manual Approaches
than current systems by utilising content descriptions such
as those enabled by MPEG-7 (Thom, Purnhagen, Pfeif- These historical approaches to musical composition use
fer, & MPEG Audio Subgroup, 1999; Hunter, 1999). Je- selection by hand with completely subjective manual anal-
han (2004) works on object-segmentation and perception- ysis (Musique Concrète, Plunderphonics) or based on
based description of audio material and then performs given tempo and character analysis (phrase sampling). It
manipulations of the audio in terms of its musical struc- is to note that these approaches are the only ones described
ture. The Song Sampler (Aucouturier, Pachet, & Hanappe, here that aim, besides sequencing, also at layering the se-
2004) is a system which automatically samples parts of lected sounds.
a song, assigns it to the keys of a MIDI-keyboard to be
For musical sound synthesis (leaving aside the existing
played with by a user.
attempts for collage type sonic creations), we’ll start by
shedding a little light on some preceding synthesis tech-
2.4. Music Selection niques, starting from the very beginning when recorded
sound became available for manipulation:
The larger problem of music selection from a catalog has
some related aspects with selection-based sound synthe-
sis. Here, the user wants to select a sequence of songs (a
3.1.1. Musique Concrète and Early Electronic Music
compilation or playlist) according to his taste and a de-
(1948)
sired evolution of high-level features from one song to
the next, e.g. augmenting tempo and percieved energy.
Going very far back, and extending the term far beyond
The problem is well described in (Pachet et al., 2000),
reason, “concatenative” synthesis started with the inven-
and an innovative solution based on constraint satisfac-
tion of the first usable recording devices in the 1940’s: the
tion is proposed, which ultimately inspired the use of con-
phonograph and, from 1950, the magnetic tape recorder
straints for sound synthesis in (Zils & Pachet, 2001), see
(Battier, 2001, 2003). The tape cutting and splicing tech-
section 3.7.3.
niques were advanced to a point that different types of di-
Other music retrieval systems approach the problem- agonal cuts were applied to control the character of the
atic of selection: The Musescape music browser (Tzane- concatenation (from an abrupt transition to a more or less
takis, 2003) works with an intuitive and space-saving smooth cross-fade).
interface by specifying high-level musical descriptors
(tempo, genre, year) on sliders. The system then selects
in real time musical excerpts that match the desired de-
3.1.2. Pierre Schaeffer
scriptors.
The Groupe de Recherche Musicale (GRM) of Pierre
3. TAXONOMY Schaeffer used for the first time recorded segments of
sound to create their pieces of Musique Concrète. In
Approaches to musical sound synthesis that are somehow the seminal work Traité des Objets Musicaux (Schaeffer,
data-driven and concatenative can be found throughout 1966), explained in (Chion, 1995), Schaeffer defines the
history. The earlier uses are usually not identified as such, notion of sound object, which is not so far from what is
but the brief discussion in this section argues that they can here called unit: A sound object is a clearly delimited seg-
be seen as instances of fixed inventory or manual concate- ment in a source recording, and is the basic unit of compo-
native synthesis. I hope to show that all these approaches sition (Schaeffer & Reibel, 1967, 1998). Moreover, Scha-
are very closely related to, or can sometimes even be seen effer strove to base his theory of sound analysis on ob-
as a special case of the general formulation of concatena- jectively, albeit manually, observable characteristics, the
tive synthesis in section 1.2. écoute réduite (narrow listening) (GRAM, 1996), which
Table 1 lists in chronological order all the methods for corresponds to a standardised descriptor set of the percep-
concatenative musical sound synthesis that will be dis- tible qualities of mass, grain, duration, matter, volume,
cussed in the following, proposing several properties for and so on.
5
Group, Name (Author) Year Type Application Inventory Units Segmentation Descriptors Selection Concatenation Real-time
1 Musique Concrete (Schaeffer) 1948 art composition open heterogeneous manual manual manual manual no
2 Digital Sampling 1980 sound high-level fixed notes/any manual manual fixed mapping no yes
1 Phrase Sampling 1990 art composition open phrases manual musical manual no no
2 Granular Synthesis 1990 sound free open homogeneous fixed time manual no yes
1 Plunderphonics (Oswald) 1993 art composition open heterogeneous manual manual manual manual no
7 Caterpillar (Schwarz) 2000 research high-level open heterogeneous alignment high-level global yes no
7 Musaicing (Pachet et al.) 2001 research resynthesis open homogeneous blind low-level constraints no no
4 Soundmosaic (Hazel) 2001 application resynthesis open homogeneous fixed signal match local no no
4 Soundscapes (Hoskinson et al.) 2001 application texture open homogeneous automatic signal match local yes no
3 La Légende des siècles (Pasquet) 2002 sound resynthesis open frames blind spectral match spectral no yes
4 Granuloop (Xiang) 2002 rhythm free open beats beat spectral match local yes yes
5 MoSievius (Lazier and Cook) 2003 research free open homogeneous blind low-level local no yes
5 Musescape (Tzanetakis) 2003 research music selection open homogeneous blind high-level local no yes
6 MPEG-7 Audio Mosaics (Casey and Lindsay) 2003 research resynthesis open homogeneous on-the-fly low-level local no yes
3 Sound Clustering Synthesis (Kobayashi) 2003 research resynthesis open frames fixed low-level spectral no no
4 Directed Soundtrack Synthesis (Cardle et al.) 2003 research texture open heterogeneous automatic low-level constraints yes no
2 Let them sing it for you (Bunger) 2003 web art high-level fixed words manual semantic direct no no
6 Network Auralization for Gnutella (Freeman) 2003 software art high-level open homogeneous blind context-dependent local no yes
3 Input driven resynthesis (Puckette) 2004 research resynthesis open frames fixed low-level local yes yes
4 Matconcat (Sturm) 2004 research resynthesis open homogeneous fixed low-level local no no
2 Synful (Lindemann) 2004 commercial high-level fixed note parts manual high-level lookahead yes yes
6 SoundSpotter (Casey) 2005 research resynthesis open homogeneous on-the-fly morphological local no yes
7 Audio Analogies (Simon et al.) 2005 research high-level open notes/dinotes manual pitch global yes no
7 Ringomatic (Aucouturier et al.) 2005 research high-level open drum bars automatic high-level global yes yes
5 frelia (Momeni and Mandel) 2005 installation free open homogeneous none high-level+abstract local no yes
5 CataRT (Schwarz) 2005 sound free open heterogeneous alignment/blind high-level local no yes
2 Vienna Symphonic Library Instruments 2006 commercial high-level fixed note parts manual high-level lookahead yes yes
6 iTunes Signature Maker (Freeman) 2006 software art high-level open homogeneous blind context-dependent local no yes
3.1.3. Karlheinz Stockhausen are here expressed as half bar pieces of a score, stochasti-
cally selected from an (implicit) corpus according to pitch
Schaeffer (1966) also relates Karlheinz Stockhausen’s de-
group, dynamics, and density (DiScipio, 2005).
sire to cut a tape into millimeter-sized pieces to recom-
pose them, the notorious Étude des 1000 collants (study
with one thousand pieces) of 1952. The piece (actually 3.1.6. Phrase Sampling (1990’s)
simply called Étude) was composed according to a score
generated by a series for pitch, duration, dynamics, and In commercial, mostly electronic, dance music, a large
timbral content, for a corpus of recordings of hammered part of the musical material comes from specially consti-
piano strings, transposed and cropped to their steady sus- tuted sampling CDs, containing rhythmic loops and short
tained part (Manion, 1992). bass or melodic phrases. These phrases are generally la-
beled and grouped by tempo and sometimes characterised
by mood or atmosphere. As the available CDs, aimed at
3.1.4. John Cage
professional music producers, number in the tens of thou-
John Cage’s Williams Mix (1953) is a composition for 8 sands, each containing hundreds of samples, a large part
magnetic tapes that prescribes a corpus of about 600 of the work still consists in listening to the CDs and se-
recordings in 6 categories (e.g. city sounds, country lecting suitable material that is then placed on a rhythmic
sounds, electronic sounds), and how they are to be ordered grid, effectively constituting the base of a new song by
and spliced together (Cage, 1962). 6 7 concatenation of preexisting musical phrases.
6
Selection
Caterpillar
Musical Mosaicing
automatic
Audio Analogies
Ringomatic
MATConcat
MPEG−7 Audio Mosaics
targeted Input driven resynthesis Directed Soundtracks
Soundspotter
Soundmosaicing
CataRT
Let them sing it for you frelia
fixed Granular Synthesis MoSievius
Sound Sampling Musescape
Synful
Phrase Sampling VSL
manual Plunderphonics
Musique Concrete
Analysis
manual frame spectrum segmental high−level
similarity similarity descriptors
Figure 3. Comparison of musical sound synthesis methods according to selection and analysis, use of concatenation
quality (bold), and real-time capabilities (italics)
Plundered are over a thousand pop stars from played with different dynamics, to better capture the tim-
the past 10 years. [...] It starts with rapmillisy- bral variations of the acoustic instrument (Roads, 1996).
lables and progresses through the material ac- Modern software samplers can use several gigabytes of
cording to tempo (which has an interesting re- sound data 8 which makes samplers clearly a data-driven
lationship with genre). fixed-inventory synthesis system, with the sound database
analysed by instrument class, playing style, pitch, and dy-
Oswald (1993) namics, and the selection being reduced to a fixed map-
ping of MIDI-note and velocity to a sample, without pay-
ing attention to the context of the notes played before, i.e.
Cutler (1994) gives an extensive account of Oswald’s
no consideration of concatenation quality.
and related work throughout art history and addresses the
issue of the incapability of copyright laws to handle this
form of musical composition.
3.2.2. Granular Synthesis (1990’s)
3.2. Group 2: Fixed Mapping
Granular synthesis (Roads, 1988, 2001) takes short snip-
Here, the selection is performed by a predetermined map- pets out of a sound file called grains, at an arbitrary rate.
ping from a fixed inventory with no analysis at all (granu- These grains are played back with a possibly changed
lar synthesis), manual analysis (Let them sing it for you), pitch, envelope, and volume. The position and length of
some analysis in class and pitch (digital sampling), or a the snippets are controlled interactively, allowing to scan
more flexible rule-based mapping that takes care of select- through the soundfile, in any speed.
ing the appropriate transitions from the last selected unit Granular synthesis is rudimentarily data-driven, but
to the next in order to obtain a good concatenation (Synful, there is no analysis, the unit size is determined arbitrar-
Vienna Symphonic Library). ily, and the selection is limited to choosing the position in
one single sound file. However, its concept of exploring a
3.2.1. Digital Sampling (1980’s) sound interactively could be combined with a pre-analysis
of the data and thus enriched by a targeted selection and
In the widest reasonable sense of the term, digital sam- the resulting control over the output sound characteristics,
pling synthesisers or samplers for short, which appeared i.e. where to pick the grains that satisfy the wanted sound
at the beginning of the 1980’s, were the first “concatena- characteristics, as described in the free synthesis applica-
tive” sound synthesis devices. A sampler is a device that tion in section 1.1.
can digitally record sounds and play them back, apply-
ing transposition, volume changes, and filters. Usually the 8 For instance, Nemesys, the makers of Gigasampler, 9 pride them-
recorded sound would be a note from an acoustic instru- selves to have sampled every note of a grand piano in every possible
ment, that is then mapped to the sampler’s keyboard. Mul- dynamic, resulting in a 1 GB sound set.
tisampling uses several notes of different pitches, and also 9 http://www.nemesysmusic.com
7
3.2.3. Let them sing it for you (2003) frame-by-frame according to the descriptors energy and
pitch. Each FFT frame is then stored in a dictionary and
A fun web art project and application of not-quite-CSS is
is clustered using the statistics program R 13 . During the
this site 10 (Bünger, 2003), where a text given by a user is
performance, this dictionary of FFT-frames is used with
synthesised by looking up each word in a hand constituted
an inverse FFT and overlap-add to resynthesize sound ac-
monorepresented database of snippets of pop songs where
cording to a target specification of pitch and energy. The
that word is sung. The database is extended by user’s re-
continuity of the resynthesized frames is assured by a
quest for a new word. At the time of writing, it counted
Hidden Markov Model trained on the succession of FFT-
about 2000 units.
frame classes in the recordings.
3.2.5. Vienna Symphonic Library (2006) This project (Puckette, 2004) starts from a database of
FFT frames from one week of radio recording, analysed
The Vienna Symphonic Library 12 is a huge collection for loudness and 10 spectral bands as descriptors. The
(550 GB) of samples of all classical instruments in all recording then forms a trajectory through the descriptor
playing styles and moods, and including single notes, space mapped to a hypersphere. Phase vocoder overlap–
groups of notes, and transitions. Their so-called perfor- add resynthesis is controlled in real-time by audio input
mance detection algorithms offer the possibility to auto- that is analysed for the same descriptors, and the selection
matically analyse a MIDI performance input and to select algorithm tries to follow a part of the database’s trajectory
samples appropriate for the given transition and context in whenever possible, limiting jumps.
a real-time instrument plugin.
3.4. Group 4: Segmental Similarity
3.3. Group 3: Spectral Frame Similarity
This group’s units are homogeneous segments that are lo-
This subclass of data-driven synthesis uses as units short- cally selected by stochastic methods (Soundscapes and
time signal frames that are matched to the target by a spec- Textures, Granuloop), or matched to a target by segment
tral similarity analysis (Input Driven Resynthesis) or addi- similarity analysis on low-level signal processing descrip-
tionally with a partially stochastic selection (La Légende tors (Soundmosaicing, Directed Soundtracks, MATCon-
des siècles, Sound Clustering). Here, forced by the short cat).
unit length, the selection must take care of the local con-
text by stipulating certain continuity constraints, because
otherwise FFT-frame salad would result. 3.4.1. Soundscapes and Texture Resynthesis (2001)
The Soundscapes 14 project (Hoskinson & Pai, 2001) gen-
3.3.1. La Légende des siècles (2002) erates endless but never repeating soundscapes from a
La Légende des siècles is a theatre piece performed at recording for installations. This means keeping the tex-
the Comédie Française, using real-time transformation on ture of the original sound file, while being able to play it
readings of Victor Hugo. One of these effects, developed for an arbitrarily long time. The segmentation into syn-
by Olivier Pasquet, uses a data-driven synthesis method thesis units is performed by a Wavelet analysis for good
inspired by CSS: Prerecorded audio is analysed off-line join points. A similar aim and approach is described in
(Dubnov, Bar-Joseph, El-Yaniv, Lischinski, & Werman,
10 http://www.sr.se/sing
11 http://www.synful.com 13 http://www.r-project.org
12 http://www.vsl.co.at 14 http://www.cs.ubc.ca/∼reynald/applet/Scramble.html
8
2002). This generative approach means that also the syn- spectral centroid, spectral drop-off, and harmonicity, and
thesis target is generated on the fly, driven by the original selection is a match of descriptor values within a certain
structure of the recording. range of the target. The application offers many choices of
how to handle the case of a non-match (leave a hole, con-
3.4.2. Soundmosaic (2001) tinue the previously selected unit, pick a random unit), and
through the use of a large window function on the grains,
Soundmosaic (Hazel, 2001) constructs an approximation the result sounds pleasingly smooth, which amounts to a
of one sound out of small pieces of varying size from other squaring of the circle for concatenative synthesis. MAT-
sounds (called tiles). For version 1.0 of Soundmosaic, the Concat is the first system used to compose two electroa-
selection of the best source tile uses a direct match of the coustic musical works, premiered at ICMC 2004: Con-
normalised waveform (Manhatten distance). Version 1.1 catenative Variations of a Passage by Mahler, and Dedi-
introduced as distance metric the correlation between nor- cation to George Crumb, American Composer.
malized tiles (the dot product of the vectors over the prod-
uct of their magnitudes). Concatenation quality is not yet 3.5. Group 5: Descriptor analysis with direct selection
included in the selection. in real time
This group uses descriptor analysis with a direct local real-
3.4.3. Granuloop (2002)
time selection of heterogeneous units, without caring for
The data-driven probabilistic drum loop rearranger Gran- concatenation quality. The local target is given according
uloop 15 (Xiang, 2002) is a patch for Pure Data 16 , which to a subset of the same descriptors in real time (MoSievius,
constructs transition probabilities between 16th notes from Musescape (see section 2.4), CataRT, frelia).
a corpus of four drum loops. These transitions then serve
to exchange segments in order to create variation, either 3.5.1. MoSievius (2003)
autonomously or with user interaction.
The transition probabilities (i.e. the concatenation dis- The MoSievius system 18 (Lazier & Cook, 2003) is an en-
tances) are analysed by loudness and spectral similarity couraging first attempt to apply unit selection to real-time
computation, in order to favour continuity. performance-oriented synthesis with direct intuitive con-
trol.
The system is based on sound segments placed in a
3.4.4. Directed Soundtrack Synthesis (2003)
loop: According to user controlled ranges for some de-
Audio and user directed sound synthesis (Cardle, Brooks, scriptors, a segment is played when its descriptor val-
& Robinson, 2003; Cardle, 2004) is aimed at the pro- ues lie within the ranges. The descriptor set used con-
duction of soundtracks in video by replacing existing tains voicing, energy, spectral flux, spectral centroid, in-
soundtracks with sounds from a different audio source in strument class. This method of content-based retrieval is
small chunks similar in sound texture. It introduces user- called Sound Sieve and is similar to the Musescape system
definable constraints in the form of large-scale properties (Tzanetakis, 2003) for music selection (see section 2.4).
of the sound texture, e.g. preferred audio clips that shall
appear at a certain moment. For the unconstrained parts of 3.5.2. CataRT (2005)
the synthesis, a Hidden Markov Model based on the statis-
tics of transition probabilities between spectrally similar The ICMC 2005 workshop on Audio Mosaicing: Feature-
sound segments is left running freely in generative mode, Driven Audio Editing/Synthesis saw the presentation of
much similar to the approach of Hoskinson and Pai (2001) the first prototype of a real-time concatenative synthesiser
described in section 3.4.1. (Schwarz, 2005) called CataRT, loosely based on Cater-
A slighly different approach is taken by Cano et al. pillar. It implements the application of free synthesis as
(2004), where a sound atmosphere library is queried with interactive exploration of sound databases (section 1.1)
a search term. The resulting sounds, plus other semanti- and is in its present state rather close to directed, data-
cally related sounds, are then laid out in time for further driven granular synthesis (section 3.2.2).
editing. Here, we have no segmentation but a layering In CataRT, the units in the chosen corpus are laid out
of the selected sounds according to exclusion rules and in a Euclidean descriptor space, made up of pitch, loud-
heuristics. ness, spectral characteristics, modulation, etc. A (usually
2-dimensional) projection of this space serves as the user
interface that displays the units’ positions and allows to
3.4.5. MATConcat (2004)
move a cursor. The units closest to the cursor’s position
The MATConcat system 17 (Sturm, 2004a, 2004b), is an are selected and played at an arbitrary rate. CataRT is im-
open source application in Matlab to explore concatena- plemented as a Max/MSP 19 patch using the FTM and Ga-
tive synthesis. For the moment, units are homogeneous bor extensions 20 (Schnell, Borghesi, Schwarz, Bevilac-
large windows taken out of the database sounds. The qua, & Müller, 2005; Schnell & Schwarz, 2005). The
descriptors used are pitch, loudness, zero crossing rate, sound and descriptor data can be loaded from SDIF files
(see section 4.3) containing MPEG-7 descriptors, or can
15 http://crca.ucsd.edu/∼pxiang/research.htm
16 18 http://soundlab.cs.princeton.edu/research/mosievius
http://puredata.info
17 http://www.mat.ucsb.edu/∼b.sturm/sand/VLDCMCaR/ 19 http://www.cycling74.com
VLDCMCaR.html 20 http://www.ircam.fr/ftm
9
be calculated on-the-fly. It is then stored in FTM data audio target from an arbitrary-size database by matching
structures in memory. An interface to the Caterpillar da- of strings of 8 “sound lexemes”, which are basic spectro-
tabase, to the freesound repository (see section 4.3), and temporal constituents of sound. Casey reports that about
other sound databases is planned. 60 lexemes are enough to describe, in their various tem-
poral combinations, any sound. By hashing and standard
3.5.3. Frelia (2005) database indexation techniques, highly efficient lookup
is possible, even on very large sound databases. Casey
The interactive installation frelia 21 by Ali Momeni and (2005) claims that one petabyte or 3000 years of audio
Robin Mandel uses sets of uncut sounds from the free- can be searched in less than half a second.
sound repository (see section 4.3) chosen by the textual
description given by the sound creator. The sounds are
laid out on two dimensions for the user to choose accord- 3.7. Group 7: Descriptor analysis with fully automatic
ing to the two principal components of freesound’s de- high-level unit selection
scriptor space of about 170 dimensions calculated by the
This last group uses descriptor analysis with fully auto-
AudioClas 22 library.
matic global high-level unit selection and concatenation
by path-search unit selection (Caterpillar, Audio Analo-
3.6. Group 6: High-level descriptors for targeted or gies) or by real-time constraint solving unit selection (Mu-
stochastic selection sical Mosaicing, Ringomatic).
Here, high-level musical or contextual descriptors are
used for targeted or stochastic local selection (MPEG-7 3.7.1. Caterpillar (2000)
Audio Mosaics, Soundspotter, NAG) without specific han-
dling of concatenation quality. Caterpillar, first proposed in (Schwarz, 2000, 2003a,
2003b) and described in detail in (Schwarz, 2004), per-
3.6.1. MPEG-7 Audio Mosaics (2003) forms data-driven concatenative musical sound synthesis
from large heterogeneous sound databases.
In the introductory tutorial at the DAFx 2003 confer- Units are segmented by automatic alignment of music
ence 23 titled Sound replacement, beat unmixing and with its score (Orio & Schwarz, 2001) for instrument cor-
audio mosaics: Content-based audio processing with pora, and by blind segmentation for free and re-synthesis.
MPEG-7, Michael Casey and Adam Lindsay showed what In the former case, the solo instrument recordings are
they called “creative abuse” of MPEG-7: audio mosaics split into seminote units, which can then be recombined
based on pop songs, calculated by finding the best match- to dinotes, analogous to diphones from speech synthesis.
ing snippets of one Beatles song, to reconstitute another The unit boundaries are thus usually within the sustain
one. The match was calculated from the MPEG-7 low- phase and as such in a stable part of the notes, where con-
level descriptors, but no measure of concatenation quality catenation can take place with the least discontinuity. The
was included in the selection. descriptors are based on the MPEG-7 low-level descrip-
tor set, plus descriptors derived from the score and the
3.6.2. Network Auralization for Gnutella (2003) sound class. The low-level descriptors are condensed to
unit descriptors by modeling of their temporal evolution
Jason Freeman’s N.A.G. software (Freeman, 2003) selects over the unit (mean value, slope, spectrum, etc.) The da-
snippets of music downloaded from the Gnutella p2p net- tabase is implemented using the relational SQL database
work according to the descriptors search term, network management system PostGreSQL for added reliability and
bandwidth, etc. and makes a collage out of them by con- flexibility.
catenation. The unit selection algorithm is of the path-search type
The descriptors used here are partly content-dependent (see section 1.2.7) where a Viterbi algorithm finds the
like the metadata accessed by the search term, and partly globally optimal sequence of database units that best
context-dependent, i.e. changing from one selection to the match the given synthesis target units using two cost func-
next, like the network characteristics. tions: The target cost expresses the similarity of a target
A similar approach is taken in the forthcoming iTunes unit to the database units by weighted Euclidean distance,
Signature Maker 24 , which creates a short sonic signature including a context around the target, and the concatena-
from an iTunes music collection as a collage according to tion cost predicts the quality of the join of two database
descriptors like play count, rating, last play date, which units by join-point continuity of selected descriptors.
are again context-dependent descriptors. Unit corpora of violin sounds, environmental noises,
and speech have been built and used for a variety of sound
3.6.3. SoundSpotter (2004) examples of high-level synthesis and resynthesis of audio.
Casey’s system, implemented in Pure Data on a Post-
GreSQL 25 database, performs real-time resynthesis of an 3.7.2. Talkapillar (2003)
21
22
http://ali.corpuselectronica.com/projects/frelia/frelia.html The derived project Talkapillar (Kärki, 2003) adapted the
http://audioclas.iua.upf.edu
23 http://www.elec.qmul.ac.uk/dafx03 Caterpillar system for text-to-speech synthesis using spe-
24 http://www.jasonfreeman.net/itsm cialised phonetic and phonologic descriptors. One of its
25 http://www.postgresql.org applications is to recreate the voice of a defunct eminent
10
writer to read one of his texts for which no recordings ex- achieved by selecting note units by pitch from a sound
ist. The goal here is different from fully automatic text-to- base constituted by just one solo recording from a prac-
speech synthesis: highest speech quality is needed (con- tice CD. The result sounds very convincing because of the
cerning both sound and expressiveness), manual refine- good quality of the manual segmentation, the globally op-
ment is allowed. timal selection using the Viterbi algorithm as in (Schwarz,
The role of Talkapillar is to give the highest possible 2000), and transformations with a PSOLA algorithm to
automatic support for human decisions and synthesis con- perfectly attain the target pitch and the duration of each
trol, and to select a number of well matching units in a unit.
very large base (obtained by automatic alignment) accord- An interesting point is that the style and the expression
ing to high level linguistic descriptors, which reliably pre- of the song chosen as sound base is clearly perceivable in
dict the low-level acoustic characteristics of the speech the synthesis result.
units from their grammatical and prosodic context, and
emotional and expressive descriptors (Beller, 2004, 2005). 4. REMAINING PROBLEMS
In a further development, this system now allows hy-
brid concatenation between music and speech by mix- This section gives a (necessarily incomplete) selection of
ing speech and music target specifications and databases, the most urgent or interesting problems to work on.
and is applicable to descriptor-driven or context-sensitive
voice effects (Beller, Schwarz, Hueber, & Rodet,
4.1. Segmentation
2005). 26
The segmentation of the source sounds that are to form
3.7.3. Musical Mosaicing (2001) the database is fundamental because it defines the unit
base and thus the whole synthesis output. While phone
Musical Mosaicing, or Musaicing (Zils & Pachet, 2001), or note units are clearly defined and automatically seg-
performs a kind of automated remix of songs. It is aimed mentable, even more so when the corresponding text or
at a sound database of pop music, selecting pre-analysed score is available, other source material is less easy to seg-
homogeneous snippets of songs and reassembling them. ment. For general sound events, automatic segmentation
Its great innovation was to formulate unit selection as into sound objects in the Schaefferian sense is only at its
a constraint solving problem (CSP). The set of descrip- beginning (Hoskinson & Pai, 2001; Cardle et al., 2003;
tors used for the selection is: mean pitch (by zero cross- Jehan, 2004). Also, segmentation of music (used e.g. by
ing rate), loudness, percussivity, timbre (by spectral dis- Zils and Pachet) is harder to do right because of the com-
tribution). Work on adding more descriptors has picked plexity of the material. Finally, the best solution would be
up again with (Zils & Pachet, 2003, 2004) (see also sec- not to have a fixed segmentation to start from, but to be
tion 4.2) and is further advanced in section 3.7.4. able to choose the unit’s segments on the fly. However,
this means that also the unit descriptors’ temporal model-
3.7.4. Ringomatic (2005) ing has to be recalculated accordingly (see section 1.2.1),
which poses hard problems for efficiency, a possible solu-
The work of Musical Mosaicing (section 3.7.3) is adapted
tion for which is the scale tree in (de Cheveigné, 2002).
to real-time interactive high level selection of bars of drum
recordings in the recent Ringomatic system (Aucouturier
& Pachet, 2005). The constraint solving problem (CSP) 4.2. Descriptors
of Zils and Pachet (2001) is reformulated for the real-
Better descriptors are needed for more musical use of con-
time case, where the next bar of drums from a database of
catenative synthesis, and more efficient use for sound syn-
recordings of drum playing has to be selected according
thesis.
to local matching constraints and global continuity con-
Definitely needed is a descriptor for percussiveness of
straints holding on the previously selected bars.
a unit. In (Tzanetakis, Essl, & Cook, 2002), this ques-
The local match is defined by four drum-specific de-
tion is answered for musical excerpts, by calibrating au-
scriptors derived by the EDS system (see section 4.2): per-
tomatically extracted descriptors for the beat strength to
ceptive energy, onset density, presence of drums, presence
perceptive measurements.
of cymbals. Interaction takes place by analysing a MIDI
An interesting approach to the definition of new de-
performance and mapping its energy, density and mean
scriptors is the Extractor Discovery System (EDS) (Zils &
pitch to target drum descriptors. The local constraints
Pachet, 2003, 2004): Here, a genetic algorithm evolves
derived from these are then balanced with the continuity
a formula using standard DSP and mathematical building
constraints to choose between reactivity and autonomy of
blocks whose fitness is then rated using a cross validation
the generated drum accompaniment.
database with data labeled by users. This method was suc-
cessfully applied to the problem of finding an algorithm to
3.7.5. Audio Analogies (2005) calculate the perceived intensity of music.
Expressive instrument synthesis from MIDI (trumpet in
the examples) is the aim of this project by researchers 4.2.1. Musical Descriptors
from the University of Washington and Microsoft Re-
The recent progress in the establishment of a standard
search (Simon, Basu, Salesin, & Agrawala, 2005),
score representation format with MusicXML as the most
26 Examples can be heard on http://www.ircam.fr/anasyn/concat promising candidate, means that we can soon overcome
11
the limitations of MIDI and make use of the entire in- & Wessel, 1999; Schwarz & Wright, 2000). A common
formation from the score, when available and linked to database API would greatly enhance the possibilities of
the units by alignment. This means performing unit se- exchange, but it is probably still too early to define it.
lection on a higher level, exploiting musical context in- Finally, concatenative synthesis from existing song ma-
formation from the score, such as dynamics (crescendo, terial evokes tough legal questions of intellectual prop-
diminuendo), and better describing the units (e.g. we’d erty, sampling and citation practices as evoked by Oswald
know which units are trills, which ones bear an accent, (1999), Cutler (1994), and Sturm (2006) in this issue, and
etc). We can already now derive musical descriptors from summarised by John Oswald in (Cutler, 1994) as follows:
an analysis of the score, such as:
If creativity is a field, copyright is the fence.
Harmony A unit’s chord or chord class, and a measure
of consonance/dissonance can serve as powerful high- A welcome initiative is the freesound project, 28 a col-
level musical descriptors, that are easy to specify as a tar- laboratively built up online database of samples under li-
get, e.g. in MIDI. censing terms less restrictive than the standard copyright,
Rhythm Position in the measure, relative weight or ac- as provided by the Creative Commons 29 family of li-
cent of the note applies mainly to percussive sounds. This censes. Now imagine a transparent net access from a con-
information can partially be derived from the score but catenative synthesis system to this sound database, with
should be complemented by beat tracking that analyses unit descriptors already calculated 30 — an endless sup-
the signal for the properties of the percussion sounds. ply of fresh sound material.
Musical Structure Future descriptors that express the
position or function of a unit within the musical structure 4.4. Data-Driven Optimisation of Unit Selection
of a piece will make accessible for selection the subtle nu- It should be possible to exploit the data in the database to
ances that performers install in the music. This further de- analyse the natural behaviour of an underlying instrument
velops the concept of high-level synthesis (see section 1.1) or sound generation process, which enables us to better
by giving context information about the musical function predict what is natural in synthesis. The following points
of a unit in the piece, such that the selection can choose are developed in more detail in (Schwarz, 2004).
units that fulfill the same function. For speech synthesis,
this technique has had a surprisingly large effect on natu-
4.4.1. Learning Distances from the Data
ralness (Prudon, 2003).
Knowledge about similarity or distance between high-
4.2.2. Evaluation of Descriptor Salience level symbolic descriptors can be obtained from the da-
tabase by an acoustic distance function, and classifica-
Advanced standard descriptor sets like MPEG-7 propose tion. For speech, with the regular and homogeneous phone
tens of descriptors, whose temporal evolution can then be units, this is relatively clear (Macon, Cronk, & Wouters,
characterised in several parameters. This enormous num- 1998), but for music, the acoustic distance is the first prob-
ber of parameters that could be used for selection carries lem: How do we compare different pitches, how units of
of course incredible redundancies. However, as concate- completely different origins and durations?
native synthesis is to be used for musical applications, one
can not know in advance, which descriptors will be useful. 4.4.2. Learning Concatenation from the Data
The aim is to give maximum flexibility to the composer
using the system. Most applications only use a very small A corpus of recordings of instrumental performances or
subset of these descriptors. any other sound generating process can be exploited to
For the more precisely defined applications, a system- learn the concatenation distance function from the data by
atic evaluation of which descriptors are the most useful statistical analysis of pairs of consecutive units in the da-
for synthesis, would be welcome, similar to the auto- tabase. The set of each unit’s descriptors defines a point in
matic choice of descriptors for instrument classification a high-dimensional descriptor space D. The natural con-
in (Livshin, Peeters, & Rodet, 2003). catenation with the consecutive unit defines a vector to
An important open research question is how to map the that unit’s point in D. The question is now if, given any
descriptors we can automatically extract from the sound pair of points in D, we can obtain from this vector field a
data to a perceptive similarity space that allows us to ob- measure to what degree the two associated units concate-
tain distances between units. nate like if they were consecutive.
The problem of modeling a high-dimensional vector
field becomes easier if we restrict the field to clusters of
4.3. Database and Intellectual Property
units in a corpus and calculate the distances between all
The databases used for concatenative synthesis are gen- pairs of cluster centres. This will provide us with a con-
erally rather small, e.g. 1h 30 in Caterpillar. In speech catenation distance matrix between clusters that can be
synthesis, 10h are needed for only one mode of speech! used as a fast lookup table for unit selection. This allows
Standard descriptor formats and APIs are not so far 28 http://iua-freesound.upf.es
away with MPEG-7 and the SDIF Sound Description In- 29 http://creativecommons.org
terchange Format 27 (Wright, Chaudhary, Freed, Khoury, 30 The license type of each unit should be part of the descriptor set,
such that a composer could, e.g. only select units with a license permit-
27 http://www.ircam.fr/sdif ting commercial use, if she wants to sell the composition.
12
us also to use the database for synthesis by modeling the the units are stored as synthesis parameters that are easier
probabilities to go from one cluster of units to the next. to concatenate before resynthesising.
This model would prefer, in synthesis, the typical articu- Going further, hybrid concatenation of units using dif-
lations taking place in the database source, or, when left ferent signal models promises clear advantages: each type
running freely, would generate a sequence of units that of unit (transient, steady state, noise) could be represented
recreates the texture of the source sounds. in the most appropriate way for transformations of pitch
and duration.
4.4.3. Learning Weights from the Data
Finally, there is a large corpus of literature about auto- 5. CONCLUSION
matically obtaining the weights for the distance functions
by search in the weight-space with resynthesis of natu- What we tried to show in this article is that many ap-
ral recordings for speech synthesis (Hunt & Black, 1996; proaches pick up the general idea of data-driven concate-
Macon et al., 1998). A performance optimised method, native synthesis, or part of it, to achieve interesting re-
applied to singing voice synthesis, is described in (Meron, sults, without knowing about the other work in the field.
1999), and an application in Talkapillar is described in To foster exchange of ideas and experience and help the
(Lannes, 2005). fledgling community, a mailinglist [email protected] has
All these data-driven methods depend on an acoustic been created, accessible from (Schwarz, 2006). This site
or perceptual distance measure that can tell us when two also hosts the online version of this survey of research and
sounds “sound the same”. Again, for speech this might musical systems using concatenation which is continually
be relatively clear, but for music, this is itself a subject of updated.
research in musical perception and cognition. Professional and multi-media sound synthesis devices
or software show a natural drive to make use of the ad-
vanced mass storage capacities available today, and of the
4.5. Real-Time Interactive Selection easily available large amount of digital content. We can
Using concatenative synthesis in real-time allows interac- foresee this type of applications hitting a natural limit of
tive browsing of a sound database. The obvious interac- manageability of the amount of data. Only automatic sup-
tion model of a trajectory through the descriptor space port of the data-driven composition process will be able to
presents the problems of its sparse and uneven popula- surpass this limit and make the whole wealth of musical
tion. A more appropriate model might be that of navi- material accessible to the musician.
gation through a graph of clusters of units. However, a Where is concatenative sound synthesis now? The mu-
good mix of generative and user-driven behaviour of the sical applications of CSS are just starting to become con-
system has to be found. 31 vincing (Sturm 2004a, see section 3.4.5), and real-time
Globally optimal unit selection algoritms, that take explorative synthesis is around the corner (Schwarz 2005,
care of concatenation quality such as Viterbi path search see section 3.5.2). For high-level synthesis, we stand at
or constraint satisfaction, are inherently non real-time. the same position speech synthesis stood 10 years ago,
Real-time synthesis could partially make up for this by with yet too small databases, and many open research
allowing transformation of the selected units. This intro- questions. The first commercial application (Lindemann
duces the need for defining a transformation cost that pre- 2001, see section 3.2.4) is comparable to the early fixed-
dicts the loss of sound quality introduced by this. inventory diphone speech synthesisers, but its expressivity
Real-time synthesis also places more stress on the effi- and real-time capabilities are much more advanced than
ciency of the selection algorithm, which can be augmented that.
through clustering of the unit database or use of optimised Data-driven synthesis is now more feasible than ever
multi-dimensional indices (D’haes, Dyck, & Rodet, 2002, with the arrival of large sound database schemes. They
2003; Roy, Aucouturier, Pachet, & Beurivé, 2005). How- finally promise to provide large sound corpora in stan-
ever, also in the non real-time case, faster algorithms allow dardised description. It is this constellation that provided
for more experimentation and for more parameters to be the basis for great advancements in speech research: the
explored. existence of large speech databases allowed corpus-based
linguistics to enhance linguistic knowledge and the per-
formance of speech tools.
4.6. Synthesis
Where will concatenative sound synthesis be in a few
The commonly used simple crossfade concatenation is year’s time? To answer this question, we can sneak a look
enough for the first steps of concatenative sound synthe- at where speech synthesis is today: Text-to-speech synthe-
sis. Eventually, one would have to apply the findings from sis has, after 15 years of research, now become a technol-
speech synthesis about reducing discontinuities (Prudon, ogy mature to the extent that all recent commercial speech
2003) or the recent work by Osaka (2005), or use ad- synthesis systems are concatenative. This success is also
vanced signal models like additive sinusoidal plus noise, due to the database size of up to 10 hours of speech, a size
or PSOLA. This leads to parametric concatenation, where we did not yet reach for musical synthesis.
31 For instance, one particular difficulty is that in real-time synthesis,
The hypothesis of high level symbolic synthesis ex-
plained in section 1.1 proved true for speech synthesis,
the duration of a target unit is not known in advance, so that the system
must be capable of generating a pleasing stream of database units as long when the database is large enough (Prudon, 2003). How-
as there is no user input. ever, this database size is needed to adequately synthesise
13
just one “instrument” — the human voice — in just one Cage, J. (1962). Werkverzeichnis. New York: Edition
“neutral” expression. What we set out for with data-driven Peters.
concatenative sound synthesis is synthesising a multitude Cano, P., Fabig, L., Gouyon, F., Koppenberger, M.,
of instruments and sound processes, each with its idiosyn- Loscos, A., & Barbosa, A. (2004). Semi-
cratic behaviour. Moreover, research is still at its be- automatic ambiance generation. In Proceedings of
ginning on multi-emotion or expressive speech synthesis, 7th international conference on digital audio ef-
something we can’t do without for music. fects. Naples, Italy.
Cardle, M. (2004). Automated Sound Editing (Tech.
Rep.). University of Cambridge, UK: Computer
6. ACKNOWLEDGEMENTS
Laboratory.
Cardle, M., Brooks, S., & Robinson, P. (2003). Au-
Thanks go to Matt Wright, Jean-Philippe Lambert, and
dio and user directed sound synthesis. In Proceed-
Arshia Cont for pointing out interesting sites that (ab)use
ings of the international computer music conference
CSS, to Bob Sturm for the discussions and the beautiful
(icmc). Singapore.
music, to Mikhail Malt for sharing his profound knowl-
edge of the history of electronic music, to all the authors Casey, M. (2005). Acoustic Lexemes for Real-Time
of the research mentioned here for their interesting work Audio Mosaicing [Workshop]. In A. T. Lind-
in the emerging field of concatenative synthesis, and to say (Ed.), Audio Mosaicing: Feature-Driven Audio
Adam Lindsay for bringing people of this field together. Editing/Synthesis. Barcelona, Spain: International
Computer Music Conference (ICMC) workshop.
(http://www.icmc2005.org/index.php?selectedPage=120
Chion, M. (1995). Guide des objets sonores. Paris,
References France: Buchet/Chastel.
Codognet, P., & Diaz, D. (2001). Yet another local search
Amatriain, X., Bonada, J., Loscos, A., Arcos, J., & Ver-
method for constraint solving. In AAAI Symposium.
faille, V. (2003). Content-based transformations.
North Falmouth, Massachusetts.
Journal of New Music Research, 32(1), 95–114.
Aucouturier, J.-J., & Pachet, F. (2005). Ringomatic: A Cutler, C. (1994). Plunderphonia. Musicworks, 60(Fall),
Real-Time Interactive Drummer Using Constraint- 6–19.
Satisfaction and Drum Sound Descriptors. In Pro- de Cheveigné, A. (2002). Scalable metadata for search,
ceedings of the International Symposium on Music sonification and display. In International Confer-
Information Retrieval (ISMIR) (pp. 412–419). Lon- ence on Auditory Display (ICAD 2002) (pp. 279–
don, UK. 284). Kyoto, Japan.
Aucouturier, J.-J., Pachet, F., & Hanappe, P. (2004). From D’haes, W., Dyck, D. van, & Rodet, X. (2002). An effi-
sound sampling to song sampling. In Proceedings cient branch and bound seach algorithm for com-
of the international symposium on music informa- puting k nearest neighbors in a multidimensional
tion retrieval (ISMIR). Barcelona, Spain. vector space. In Ieee advanced concepts for intelli-
Battier, M. (2001). Laboratori. In J.-J. Nattiez (Ed.), Enci- gent vision systems (acivs). Gent, Belgium.
clopedia della musica (Vol. I, pp. 404–419). Milan: D’haes, W., Dyck, D. van, & Rodet, X. (2003). PCA-
Einaudi. based branch and bound search algorithms for com-
Battier, M. (2003). Laboratoires. In J.-J. Nattiez (Ed.), puting K nearest neighbors. Pattern Recognition
Musiques. Une encyclopédie pour le XXIe siècle Letters, 24(9–10), 1437-1451.
(Vols. I, Musiques du XXe siècle, pp. 558–574). DiScipio, A. (2005). Formalization and Intuition in
Paris: Actes Sud, Cité de la musique. Analogique A et B. In Proceedings of the inter-
Beller, G. (2004). Un synthétiseur vocal par sélection national symposium iannis xenakis (pp. 95–108).
d’unités. Rapport de stage DEA ATIAM, Ircam – Athens, Greece.
Centre Pompidou, Paris, France. Dubnov, S., Bar-Joseph, Z., El-Yaniv, R., Lischinski,
Beller, G. (2005). La musicalité de la voix parlée. D., & Werman, M. (2002). Synthesis of au-
Maitrise de musique, Université Paris 8, Paris, dio sound textures by learning and resampling of
France. wavelet trees. IEEE Computer Graphics and Appli-
Beller, G., Schwarz, D., Hueber, T., & Rodet, X. (2005). cations, 22(4), 38–48.
A hybrid concatenative synthesis system on the Forney. (1973). The Viterbi algorithm. Proceedings of the
intersection of music and speech. In Journées IEEE, 61, 268–278.
d’Informatique Musicale (JIM) (pp. 41–45). MSH Freeman, J. (2003). Network Auralization for Gnutella.
Paris Nord, St. Denis, France. Web page. (http://turbulence.org/Works/freeman http://
Bonada, J., Celma, O., Loscos, A., Ortola, J., Serra, X., www.jasonfreeman.net/Catalog/electronic/nag.html
Yoshioka, Y., Kayama, H., Hisaminato, Y., & Ken- GRAM (Ed.). (1996). Dictionnaire des arts médiatiques.
mochi, H. (2001). Singing voice synthesis com- Groupe de recherche en arts médiatiques, Univer-
bining excitation plus resonance and sinusoidal plus sité du Québec à Montréal. (http://www.comm.uqam.
residual models. In Proceedings of the international ca/∼GRAM
computer music conference (icmc). Havana, Cuba. Hazel, S. (2001). Soundmosaic. web page.
Bünger, E. (2003). Let Them Sing It For You. Web page. (http://thalassocracy.org/soundmosaic)
(http://www.sr.se/sing http://www.erikbunger.com/ Hoskinson, R., & Pai, D. (2001). Manipulation and
14
resynthesis with natural grains. In Proceedings of Macon, M. W., Cronk, A. E., & Wouters, J. (1998).
the International Computer Music Conference Generalization and discrimination in
(ICMC). Havana, Cuba. tree-structured unit selection. In Proceedings of the
Hummel, T. A. (2005). Simulation of Human Voice 3rd esca/cocosda international speech synthesis
Timbre by Orchestration of Acoustic Music workshop. Jenolan Caves, Australia.
Instruments. In Proceedings of the International Manion, M. (1992). From Tape Loops to Midi: Karlheinz
Computer Music Conference (ICMC). Barcelona, Stockhausen’s Forty Years of Electronic Music.
Spain: ICMA. Online article.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a (http://www.stockhausen.org/tape loops.html)
concatenative speech synthesis system using a Meron, Y. (1999). High quality singing synthesis using
large speech database. In Proceedings of the IEEE the selection-based synthesis scheme. Unpublished
international conference on acoustics, speech, and doctoral dissertation, University of Tokyo.
signal processing (ICASSP) (pp. 373–376). Orio, N., & Schwarz, D. (2001). Alignment of
Atlanta, GA. Monophonic and Polyphonic Music to a Score. In
Hunter, J. (1999). MPEG7 Behind the Scenes. D-Lib Proceedings of the International Computer Music
Magazine, 5(9). (http://www.dlib.org/) Conference (ICMC). Havana, Cuba.
Jehan, T. (2004). Event-Synchronous Music Osaka, N. (2005). Concatenation and stretch/squeeze of
Analysis/Synthesis. In Proceedings of the musical instrumental sound using sound morphing.
COST-G6 Conference on Digital Audio Effects In Proceedings of the International Computer
(DAFx). Naples, Italy. Music Conference (ICMC). Barcelona, Spain.
Kärki, O. (2003). Système talkapillar. Unpublished Oswald, J. (1993). Plexure. CD. (http:
master’s thesis, EFREI, Ircam – Centre Pompidou, //plunderphonics.com/xhtml/xdiscography.html\#plexure
Paris, France. (Rapport de stage) Oswald, J. (1999). Plunderphonics. web page.
Kobayashi, R. (2003). Sound clustering synthesis using (http://www.plunderphonics.com
spectral data. In Proceedings of the International Pachet, F., Roy, P., & Cazaly, D. (2000). A combinatorial
Computer Music Conference (ICMC). Singapore. approach to content-based music selection. IEEE
Lannes, Y. (2005). Synthèse de la parole par MultiMedia, 7(1), 44–51.
concaténation d’unités (Mastère Recherche Signal, Prudon, R. (2003). A selection/concatenation TTS
Image, Acoustique, Optimisation). Université synthesis system. Unpublished doctoral
Toulouse III Paul Sabatier. dissertation, LIMSI, Université Paris XI, Orsay,
Lazier, A., & Cook, P. (2003). MOSIEVIUS: Feature France.
driven interactive audio mosaicing. In Proceedings Puckette, M. (2004). Low-Dimensional Parameter
of the COST-G6 Conference on Digital Audio Mapping Using Spectral Envelopes. In
Effects (DAFx) (pp. 312–317). London, UK. Proceedings of the International Computer Music
Lindemann, E. (2001, November). Musical synthesizer Conference (ICMC) (pp. 406–408). Miami,
capable of expressive phrasing [United States Florida.
Patent]. US Patent 6,316,710. Roads, C. (1988). Introduction to granular synthesis.
Lindsay, A. T., Parkes, A. P., & Fitzgerald, R. A. (2003). Computer Music Journal, 12(2), 11–13.
Description-driven context-sensitive effects. In Roads, C. (1996). The computer music tutorial. In (pp.
Proceedings of the COST-G6 Conference on 117–124). Cambridge, Massachusetts: MIT Press.
Digital Audio Effects (DAFx). London, UK. Roads, C. (2001). Microsound. Cambridge, Mass: MIT
Livshin, A., Peeters, G., & Rodet, X. (2003). Studies and Press.
improvements in automatic classification of Rodet, X. (2002). Synthesis and processing of the
musical sound samples. In Proceedings of the singing voice. In Proceedings of the 1st ieee
international computer music conference (icmc). benelux workshop on model based processing and
Singapore. coding of audio (mpca). Leuven, Belgium.
Lomax, K. (1996). The development of a singing Roy, P., Aucouturier, J.-J., Pachet, F., & Beurivé, A.
synthesiser. In 3èmes journees d’informatique (2005). Exploiting the Tradeoff Between Precision
musicale (jim). Ile de Tatihou, Lower Normandy, and CPU-time to Speed up Nearest Neighbor
France. Search. In Proceedings of the international
Macon, M., Jensen-Link, L., Oliverio, J., Clements, symposium on music information retrieval
M. A., & George, E. B. (1997a). A singing voice (ISMIR). London, UK.
synthesis system based on sinusoidal modeling. In Schaeffer, P. (1966). Traité des objets musicaux (1st
Proceedings of the IEEE international conference ed.). Paris, France: Éditions du Seuil.
on acoustics, speech, and signal processing Schaeffer, P., & Reibel, G. (1967). Solfège de l’objet
(ICASSP) (pp. 435–438). Munich, Germany. sonore. Paris, France: ORTF. (Reedited as
Macon, M., Jensen-Link, L., Oliverio, J., Clements, (Schaeffer & Reibel, 1998))
M. A., & George, E. B. (1997b). Schaeffer, P., & Reibel, G. (1998). Solfège de l’objet
Concatenation-Based MIDI-to-Singing Voice sonore. Paris, France: INA Publications–GRM.
Synthesis. In 103rd meeting of the audio (Reedition on 3 CDs with booklet of (Schaeffer &
engineering society. New York. Reibel, 1967))
15
Schnell, N., Borghesi, R., Schwarz, D., Bevilacqua, F., & Thom, D., Purnhagen, H., Pfeiffer, S., & MPEG
Müller, R. (2005). FTM—Complex Data Audio Subgroup, the. (1999, December). MPEG
Structures for Max. In Proceedings of the Audio FAQ. web page. Maui. (International
International Computer Music Conference Organisation for Standardisation, Organisation
(ICMC). Barcelona, Spain. Internationale de Normalisation, ISO/IEC
Schnell, N., & Schwarz, D. (2005). Gabor, JTC1/SC29/WG11, N3084, Coding of Moving
Multi-Representation Real-Time Pictures and Audio,
Analysis/Synthesis. In Proceedings of the http://www.tnt.uni-hannover.de/project/mpeg/audio/faq)
COST-G6 Conference on Digital Audio Effects Truchet, C., Assayag, G., & Codognet, P. (2001). Visual
(DAFx). Madrid, Spain. and adaptive constraint programming in music. In
Schwarz, D. (2000). A System for Data-Driven Proceedings of the International Computer Music
Concatenative Sound Synthesis. In Proceedings of Conference (ICMC). Havana, Cuba.
the COST-G6 Conference on Digital Audio Effects Tzanetakis, G. (2003). MUSESCAPE: An interactive
(DAFx) (pp. 97–102). Verona, Italy. content-aware music browser. In Proceedings of
Schwarz, D. (2003a). New Developments in Data-Driven the COST-G6 Conference on Digital Audio Effects
Concatenative Sound Synthesis. In Proceedings of (DAFx). London, UK.
the International Computer Music Conference Tzanetakis, G., Essl, G., & Cook, P. (2002). Human
(ICMC) (pp. 443–446). Singapore. Perception and Computer Extraction of Musical
Schwarz, D. (2003b). The C ATERPILLAR System for Beat Strength. In Proceedings of the COST-G6
Data-Driven Concatenative Sound Synthesis. In Conference on Digital Audio Effects (DAFx) (pp.
Proceedings of the COST-G6 Conference on 257–261). Hamburg, Germany.
Digital Audio Effects (DAFx) (pp. 135–140). Vinet, H. (2003). The representation levels of music
London, UK. information. In Computer music modeling and
Schwarz, D. (2004). Data-driven concatenative sound retrieval (CMMR). Montpellier, France.
synthesis. Thèse de doctorat, Université Paris 6 – Viterbi, A. J. (1967). Error bounds for convolutional
Pierre et Marie Curie, Paris. codes and an asymptotically optimal decoding
algorithm. IEEE Transactions on Information
Schwarz, D. (2005). Recent Advances in Musical
Theory, IT-13, 260–269.
Concatenative Sound Synthesis at Ircam
Wright, M., Chaudhary, A., Freed, A., Khoury, S., &
[Workshop]. In A. T. Lindsay (Ed.), Audio
Wessel, D. (1999). Audio Applications of the
Mosaicing: Feature-Driven Audio
Sound Description Interchange Format Standard.
Editing/Synthesis. Barcelona, Spain: International
In AES 107th convention preprint. New York, USA.
Computer Music Conference (ICMC) workshop.
Xiang, P. (2002). A new scheme for real-time loop music
(http://www.icmc2005.org/index.php?selectedPage=120
production based on granular similarity and
Schwarz, D. (2006). Caterpillar. Web page.
probability control. In Digital audio effects (dafx)
(http://recherche.ircam.fr/anasyn/schwarz/thesis
(pp. 89–92). Hamburg, Germany.
Schwarz, D., & Wright, M. (2000). Extensions and Zils, A., & Pachet, F. (2001). Musical Mosaicing. In
Applications of the SDIF Sound Description Proceedings of the COST-G6 Conference on
Interchange Format. In Proceedings of the Digital Audio Effects (DAFx). Limerick, Ireland.
International Computer Music Conference (ICMC) Zils, A., & Pachet, F. (2003). Extracting automatically
(pp. 481–484). Berlin, Germany. () the perceived intensity of music titles. In
Simon, I., Basu, S., Salesin, D., & Agrawala, M. (2005). Proceedings of the COST-G6 Conference on
Audio analogies: Creating new music from an Digital Audio Effects (DAFx). London, UK.
existing performance by concatenative synthesis. Zils, A., & Pachet, F. (2004). Automatic extraction of
In Proceedings of the International Computer music descriptors from acoustic signals using EDS.
Music Conference (ICMC). Barcelona, Spain. In Proceedings of the 116th AES Convention.
Sturm, B. L. (2004a). MATConcat: An Application for Atlanta, GA, USA.
Exploring Concatenative Sound Synthesis Using
MATLAB. In Proceedings of the International
Computer Music Conference (ICMC). Miami,
Florida.
Sturm, B. L. (2004b). MATConcat: An Application for
Exploring Concatenative Sound Synthesis Using
MATLAB. In Proceedings of the COST-G6
Conference on Digital Audio Effects (DAFx).
Naples, Italy.
Sturm, B. L. (2006). Concatenative sound synthesis and
intellectual property: An analysis of the legal
issues surrounding the synthesis of novel sounds
from copyright-protected work. Journal of New
Music Research, 35(1), 23–34. (Special Issue on
Audio Mosaicing)
16