0% found this document useful (0 votes)

40 views17 pages

Concatenative Sound Synthesis - The Early Years Diemo Schwarz

Uploaded by

thenlundberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views17 pages

Concatenative Sound Synthesis - The Early Years Diemo Schwarz

Uploaded by

thenlundberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Concatenative Sound Synthesis: The Early Years

Diemo Schwarz

To cite this version:

Diemo Schwarz. Concatenative Sound Synthesis: The Early Years. Journal of New Music Research,
Taylor & Francis (Routledge), 2006, 35 (1), pp.3-22. �hal-01161361�

HAL Id: hal-01161361

https://hal.archives-ouvertes.fr/hal-01161361
Submitted on 8 Jun 2015

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
CONCATENATIVE SOUND SYNTHESIS: THE EARLY YEARS

Diemo Schwarz
Ircam – Centre Pompidou
1, place Igor-Stravinsky, 75003 Paris, France
http://www.ircam.fr/anasyn/schwarz http://concatenative.net
[email protected]

ABSTRACT Concatenative sound synthesis (CSS) methods use a

large database of source sounds, segmented into units,
Concatenative sound synthesis is a promising method and a unit selection algorithm that finds the sequence of
of musical sound synthesis with a steady stream of work units that match best the sound or phrase to be synthe-
and publications for over five years now. This article of- sised, called the target. The selection is performed ac-
fers a comparative survey and taxonomy of the many dif- cording to the descriptors of the units, which are charac-
ferent approaches to concatenative synthesis throughout teristics extracted from the source sounds, or higher level
the history of electronic music, starting in the 1950s, even descriptors attributed to them. The selected units can then
if they weren’t known as such at their time, up to the recent be transformed to fully match the target specification, and
surge of contemporary methods. Concatenative sound are concatenated. However, if the database is sufficiently
synthesis methods use a large database of source sounds, large, the probability is high that a matching unit will be
segmented into units, and a unit selection algorithm that found, so the need to apply transformations, which always
finds the units that match best the sound or musical phrase degrade sound quality, is reduced. The units can be non-
to be synthesised, called the target. The selection is per- uniform (heterogeneous), i.e. they can comprise a sound
formed according to the descriptors of the units. These snippet, an instrument note, up to a whole phrase. Most
are characteristics extracted from the source sounds, e.g. often, however, a homogeneous size and type of units is
pitch, or attributed to them, e.g. instrument class. The used, and sometimes a unit is just a short time window of
selected units are then transformed to fully match the tar- the signal used in conjunction with spectral analysis and
get specification, and concatenated. However, if the da- overlap-add synthesis.
tabase is sufficiently large, the probability is high that a Usual sound synthesis methods are based on a model
matching unit will be found, so the need to apply trans- of the sound signal. It is very difficult to build a model
formations is reduced. The most urgent and interesting that would realistically generate all the fine details of the
problems for further work on concatenative synthesis are sound. Concatenative synthesis, on the contrary, by us-
listed concerning segmentation, descriptors, efficiency, le- ing actual recordings, preserves entirely these details. For
gality, data mining, and real time interaction. Finally, the example, very naturally sounding transitions can be syn-
conclusion tries to provide some insight into the current thesized, since unit selection is aware of the context of
and future state of concatenative synthesis research. 1 the database units. In this data-driven approach, instead
of supplying rules constructed by careful thinking as in a
1. INTRODUCTION rule-based approach, the rules are induced from the data
itself. Findings in other domains, e.g. speech recogni-
When technology advances and is easily accessible, cre- tion, corroborate the general superiority of data-driven ap-
ation progresses, too, driven by the new possibilities that proaches. Concatenative synthesis can be more or less
are open to be explored. For musical creation, we have data-driven; more is advantageous because the informa-
seen such surges of creativity throughout history, for ex- tion contained in the many sound examples in the data-
ample with the first easily usable recording devices in the base can be exploited. This will be the main criterion for
1940s, with widespread diffusion of electronic synthesiz- the taxonomy of approaches to concatenative synthesis in
ers from the 1970s, and with the availability of real-time section 3.
interactive digital processing tools at the end of the 1990s.
The next relevant technology advance is already here, Concatenative synthesis sprung up independently in
widespread diffusion just around the corner, and waiting multiple places and is a complex method that needs many
to be exploited for creative use: Large databases of sound, different concepts working together, thus much work on
with a pertinent description of their contents, ready for only one single aspect fails to relate to the whole. In this
content-based retrieval. These databases want to be ex- article, we try to acknowledge this young field of musical
ploited for musical sound synthesis, and concatenative sound synthesis that has been identified as such only five
synthesis looks like the natural candidate to do so. years ago. Many fields and topics of research intervene,
examples of which are given in section 2.
1 This is a preprint of an article whose final and definitive form has
Development has accelerated over the past few years
been published in the Journal of New Music Research vol. 35 num.
1, March 2006 [copyright Taylor & Francis]. Journal of New Music as can be seen in the presentation and comparison of the
Research is available online at: http://journalsonline.tandf.co.uk different approaches and systems in section 3. There are

1
now the first commercial products available (3.2.4, 3.2.5), 1.2. Technical Overview
and, last but not least, ICMC 2004 saw the first musical
pieces using concatenative synthesis (3.4.5). Any concatenative synthesis system performs the tasks il-
lustrated in figure 2, sometimes implicitly. This list of
Section 4 finally gives some of the most urgent prob-
tasks will serve later for our taxonomy of systems in sec-
lems to be tackled for the further development of concate-
tion 3.
native synthesis.

1.1. Applications Source Sounds Audio Score Symbolic Score

The current work on concatenative synthesis focuses on

four main applications:
Analysis
High Level Instrument Synthesis Because concatena-
tive synthesis is aware of the context of the database as
well as the target units, it can synthesise natural sounding
transitions by selecting units from matching contexts. In- Database Target

formation attributed to the source sounds can be exploited

for unit selection, which allows high-level control of syn-
thesis, where the fine details lacking in the target spec-
Unit Selection
ification are filled in by the units in the database. This
hypothesis is illustrated in figure 1. Sound
Segmentation
Raw Descriptors
Score Target Unit Descriptors
Symbolic level Synthesis
Sound File References

Association of
Information Musician Selection Association of
(Knowledge level) Information
(a) General structure.

Recorded Unit
Signal level Sound Descriptors

Sound Lookup
Figure 1. Hypothesis of high level synthesis: The rela-
tions between the score and the produced sound in the
case of performing an instrument, and the synthesis tar- Segmentation Descriptors Transformation
get and the unit descriptors in the case of concatenative
data-driven synthesis are shown on their respective level
of representation of musical information. 3 Temporal Modeling Concatenation

Resynthesis of audio with sounds from the database: A

sound or phrase is taken as the audio score, which is resyn- (b) Analysis component. (c) Synthesis component.
thesized with the sequence of units best matching its de-
scriptors, e.g., with the same pitch, amplitude, and/or tim- Figure 2. Data flow model of a concatenative synthe-
bre characteristics. sis system, rounded boxes representing data, rectangular
This is often referred to as audio mosaicing, since it boxes components, and arrows flow of data.
tries to reconstitute a given larger entity from many small
parts as in the recently popular photo mosaics.
Texture and ambience synthesis is used for installations 1.2.1. Analysis
or film production. It aims at generating soundtracks
from sound libraries or preexisting ambience recordings, The source sound files are segmented into units and anal-
or extending soundscape recordings for an arbitrarily long ysed to express their characteristics with sound descrip-
time, regenerating the character and flow but at the same tors. Segmentation can be by automatic alignment of mu-
time being able to control larger scale parameters. sic with its score for instrument corpora, by blind segmen-
tation according to transients or spectral change, or arbi-
Free synthesis from heterogeneous sound databases of- trary grain segmentation for free and re-synthesis, or can
fers a sound composer efficient control of the result by us- happen on-the-fly.
ing perceptually meaningful descriptors to specify a target The descriptors can be of type categorical (a class
as a multi-dimensional curve in the descriptor space. If the membership), static (a constant text or numerical value
selection happens in real-time, this allows to browse and for a unit), or dynamic (varying over the duration of a
explore a corpus of sounds interactively. unit), and from one of the following classes: category
3 According to Vinet (2003), we can classify digital musical represen- (e.g. instrument), signal, symbolic, score, perceptual,
tations into the physical level, the signal level, the symbolic level, and spectral, harmonic, or segment descriptors (the latter serve
the knowledge level. for bookkeeping). Descriptors are usually analysed by au-

2
tomatic methods, but can also be given as external meta- the last selected unit and all matching candidates for the
data, or be supplied by the user, e.g. for categorical de- following unit is considered. Both local possibilities can
scriptors or for subjective perceptual descriptors. (e.g. a be seen as a simplified form of the path search unit se-
“glassiness” value or “anxiousness” level could be manu- lection algorithm, which still uses the same framework of
ally attributed to units). distance functions, presented in its most general formula-
For the time-varying dynamic descriptors, temporal tion in the following
modeling reduces the evolution of the descriptor value
over the unit to a fixed-size vector of values characterizing 1.2.5. Target Distance
this evolution. Usually, only the mean value is used, but
some systems go further and store range, slope, min, max, The target distance C t corresponds to the perceptual sim-
attack, release, modulation, and spectrum of the descriptor ilarity of the database unit ui to the target unit tτ . It is
curve. given as a sum of p weighted individual descriptor dis-
tance functions Ckt as:
1.2.2. Database X
p
C t (ui , tτ ) = wkt Ckt (ui , tτ ) (1)
Source file references, units, unit descriptors, and the re- k=1
lationships between them are stored in a database. The To favour the selection of units out of the same context
subset of the database that is preselected for one partic- in the database as in the target, the context distance C x
ular synthesis is called the corpus. Often, the database considers a sliding context in a range of r units around the
is implicitly constituted by a collection of files. More current unit with weights wj decreasing with distance j.
rarely, a (relational or other) database management sys-
tem is used, which can run locally or on a server. Internet X
r

sound databases with direct access to sounds and descrip- C x (ui , tτ ) = wjx C t (ui+j , tτ +j ) (2)
tors 4 are beginning to make their appearance, e.g. with j=−r

the freesound project (see section 4.3). Mostly, a Euclidean distance normalised by the standard
deviation is used and r is zero. Some descriptors need
1.2.3. Target specialised distance functions. Symbolic descriptors, e.g.
phoneme class, require a lookup table of distances.
The target is specified as a sequence of target units with
their desired descriptor characteristics. Usually, only a 1.2.6. Concatenation Distance
subset of the available database descriptors is given. The
unspecified descriptors do not influence the selection di- The concatenation distance C c expresses the discontinuity
rectly, but can, however, be used to stipulate continuity via introduced by concatenating the units ui and uj from the
the concatenation distance (see section 1.2.6 below). The database. It is given by a weighted sum of q descriptor
target can either be generated from a symbolic score (ex- concatenation distance functions Ckc :
pressed e.g. in notes or directly in segments plus descrip- X
q
tors), or analysed from an audio score (using the same seg- C c (ui , uj ) = wkc Ckc (ui , uj ) (3)
mentation and analysis methods as for the source sounds). k=1

The distance depends on the unit type: concatenating an

1.2.4. Selection attack unit allows discontinuities in pitch and energy, a
sustain unit does not. Consecutive units in the database
The unit selection algorithm is crucial as it contains all
have a concatenation distance of zero. Thus, if a whole
the “intelligence” of data-driven concatenative synthesis.
phrase matching the target is present in the database, it
Units are selected from the database that match best the
will be selected in its entirety.
given sequence of target units and descriptors according
to a distance function (section 1.2.5) and a concatenation
quality function (section 1.2.6). The selection can be lo- 1.2.7. The Path Search Unit Selection Algorithm
cal (the best match for each target unit is found individu- This unit selection algorithm is based on the standard path
ally), global (the sequence with the least total distance is search algorithm used in speech synthesis, first proposed
found), or iterative (by a search algorithm that approaches by Hunt and Black (1996). It has been adapted to the
the globally optimal selection until a maximum number of specificities of musical sound synthesis for the first time
search steps is reached). by Schwarz (2000) in the Caterpillar system described in
Two different classes of algorithms can be found in the section 3.7.1.
approaches described in this article: path-search unit se- The unit database can be seen as a fully connected state
lection (section 1.2.7), and unit selection based on a con- transition network through which the unit selection algo-
straint solving approach (section 1.2.8). rithm has to find the least costly path that constitutes the
Most often, however, a simple local search for the best target. Using the weighted extended target distance w t C x
matching unit is used without taking care of the context. as the state occupancy cost, and the weighted concatena-
In some real-time approaches, the local context between tion distance w c C c as the transition cost, the optimal path
4 This excludes the many existing web collections of sounds accessed can be efficiently found by a Viterbi algorithm (Viterbi,
by a search term found in the title, e.g. http://sound-effects-library. 1967; Forney, 1973). A detailed formulation of the algo-
com. rithm is given by Schwarz (2004).

3
1.2.8. Unit Selection by Constraint Solving systems in terms of naturalness and intelligibility. Unit se-
lection algorithms attempt to estimate the appropriateness
Applying the formalism of constraint satisfaction to unit of a particular database speech unit using linguistic fea-
selection permits to express musical desiderata additional tures predicted from a given text to be synthesized. The
to the target match in a flexible way, such as to avoid re- units can be of any length (non-uniform unit selection),
peating units, or not to use a certain unit for the selection. from sub-phonemes to whole phrases, and are not limited
It has been first proposed for music program generation by to diphones or triphones.
Pachet, Roy, and Cazaly (2000), see section 2.4, and for Although concatenative sound synthesis is quite sim-
data-driven concatenative musical synthesis by Zils and ilar to concatenative speech synthesis and shares many
Pachet (2001) in the Musical Mosaicing system described concepts and methods, both have different goals. Even
in section 3.7.3. from a very rough comparison between musical and
It is based on the adaptive local search algorithm de- speech synthesis, some profound differences spring to
scribed in detail in (Codognet & Diaz, 2001; Truchet, As- mind, which make the application of concatenative data-
sayag, & Codognet, 2001), which runs iteratively until a driven synthesis techniques from speech to music non-
satisfactory result is achieved or a certain number of it- trivial:
erations is reached. Constraints are here given by an er-
ror function, which allows us to easily express the unit • Speech is a-priori clustered into phonemes. A mu-
selection algorithm as a constraint satisfaction problem sical analogue for this phonemic identity are pitch
(CSP) using the target and concatenation distances be- classes which are applicable for tonal music, but in
tween units. general, no a-priori clustering can be presupposed.
• In speech, the time position of synthesized units is
1.2.9. Synthesis intrinsically given by the required duration of the se-
lected units. In music, precise time-points have to be
The final waveform synthesis is done by concatenation hit when we want to keep the rhythm.
of selected units with a short cross-fade, possibly apply-
ing transformations, for instance altering pitch or loud- • In speech synthesis, intelligibility and naturalness are
ness. Depending on the application, the selected units are of prime interest, and the synthesised speech is of-
placed at the times given by the target (musical or rhyth- ten limited to “normal” informative mode. However,
mic synthesis), or are concatenated with their natural du- musical creation is based on artistic principles, uses
ration (free synthesis, speech or texture synthesis). many modes of expressivity, and needs to experi-
ment. Therefore, creative and interactive use of the
system should be possible by using any database of
2. RELATED TOPICS sounds, any descriptors, and a flexible expression of
the target for the selection.
Concatenative synthesis is at the intersection of many
fields of research, such as music information retrieval
(MIR), database technology, real-time and interactive 2.2. Singing Voice Synthesis
methods, digital signal processing (DSP), sound synthe-
sis models, musical modeling, classification, perception. Concatenative singing voice synthesis occupies an inter-
We could see concatenative synthesis as one of three mediate position between speech and sound synthesis,
variants of content-based retrieval, depending on what whereas the used methods are most often closer to speech
is queried and how it is used. When just one sound is synthesis, 5 with the limitation of fixed inventories specif-
queried, we are in the realm of descriptor- or similarity- ically recorded, such as the Lyricos system (Macon et al.,
based sound selection. Superposing retrieved sounds to 1997a, 1997b), the work by Lomax (1996), and the recent
satisfy a certain outcome is the topic of automatic orches- system developed by Bonada et al. (2001). There is one
tration tools (Hummel, 2005). Finally, sequencing re- notable exception (Meron, 1999), where an automatically
trieved sound snippets is our topic of concatenative syn- constituted large unit database is used.
thesis. See (Rodet, 2002) for an up-to-date overview of current
Other closely related research topics are given in the research in singing voice synthesis, which is out of the
following, that share many of the basic questions and scope of this article.
problems. This recent spread of data-driven singing voice synthe-
sis methods based on unit selection follows their success
in speech synthesis, and lets us anticipate a coming leap
2.1. Speech Synthesis in quality and naturalness of the singing voice. Regarding
the argument of rule-based vs. data-driven singing voice
Research in musical synthesis is heavily influenced by
synthesis, Rodet (2002) notes that:
research in speech synthesis, which can be said to be
roughly 10 years ahead. Concatenative unit selection Clearly, the units intrinsically contain the influ-
speech synthesis from large databases (Hunt & Black, ence of an implicit set of rules applied by the
1996) is used in a great number of Text-to-Speech sys- 5 Concatenative speech synthesis techniques are directly used for
tems for waveform generation (Prudon, 2003). Its intro-
singing voice synthesis in Burcas (http://www.ling.lu.se/persons/
duction resulted in a considerable gain in quality of the Marcusu/music/burcas), Flinger (http://www.cslu.ogi.edu/tts/
synthesized speech over rule-based parametric synthesis flinger), and abused in http://www.silexcreations.com/melissa.

4
singer with all his training, talent and musi- comparison. We can order these methods according to
cal skill. The unit selection and concatenation two main aspects, which combined indicate the level of
method is thus a way to replace a large and “data-drivenness” of a method. They form the axes of the
complicated set of rules by implicit rules from diagram in figure 3, the abscissa indicating the degree of
the best performers, and it is often called a data- structuredness of information obtained by analysis of the
driven concatenative synthesis. source sounds and the metadata, and the ordinate the de-
gree of automation of the selection. Further aspects ex-
pressed in the diagram are the inclusion of concatenation
2.3. Content-Based Processing quality in the selection, and real-time capabilities.
Groups of similar approaches emanate clearly from
Content-based processing is a new paradigm in digital au- this diagram that will be discussed in the following seven
dio processing that is based on symbolic or high-level ma- sub-sections, going from left to right and bottom to top
nipulations of elements of a sound, rather than using sig- through the diagram.
nal processing alone (Amatriain et al., 2003). Lindsay,
Parkes, and Fitzgerald (2003) propose context-sensitive
effects that are more aware of the structure of the sound
3.1. Group 1: Manual Approaches
than current systems by utilising content descriptions such
as those enabled by MPEG-7 (Thom, Purnhagen, Pfeif- These historical approaches to musical composition use
fer, & MPEG Audio Subgroup, 1999; Hunter, 1999). Je- selection by hand with completely subjective manual anal-
han (2004) works on object-segmentation and perception- ysis (Musique Concrète, Plunderphonics) or based on
based description of audio material and then performs given tempo and character analysis (phrase sampling). It
manipulations of the audio in terms of its musical struc- is to note that these approaches are the only ones described
ture. The Song Sampler (Aucouturier, Pachet, & Hanappe, here that aim, besides sequencing, also at layering the se-
2004) is a system which automatically samples parts of lected sounds.
a song, assigns it to the keys of a MIDI-keyboard to be
For musical sound synthesis (leaving aside the existing
played with by a user.
attempts for collage type sonic creations), we’ll start by
shedding a little light on some preceding synthesis tech-
2.4. Music Selection niques, starting from the very beginning when recorded
sound became available for manipulation:
The larger problem of music selection from a catalog has
some related aspects with selection-based sound synthe-
sis. Here, the user wants to select a sequence of songs (a
3.1.1. Musique Concrète and Early Electronic Music
compilation or playlist) according to his taste and a de-
(1948)
sired evolution of high-level features from one song to
the next, e.g. augmenting tempo and percieved energy.
Going very far back, and extending the term far beyond
The problem is well described in (Pachet et al., 2000),
reason, “concatenative” synthesis started with the inven-
and an innovative solution based on constraint satisfac-
tion of the first usable recording devices in the 1940’s: the
tion is proposed, which ultimately inspired the use of con-
phonograph and, from 1950, the magnetic tape recorder
straints for sound synthesis in (Zils & Pachet, 2001), see
(Battier, 2001, 2003). The tape cutting and splicing tech-
section 3.7.3.
niques were advanced to a point that different types of di-
Other music retrieval systems approach the problem- agonal cuts were applied to control the character of the
atic of selection: The Musescape music browser (Tzane- concatenation (from an abrupt transition to a more or less
takis, 2003) works with an intuitive and space-saving smooth cross-fade).
interface by specifying high-level musical descriptors
(tempo, genre, year) on sliders. The system then selects
in real time musical excerpts that match the desired de-
3.1.2. Pierre Schaeffer
scriptors.
The Groupe de Recherche Musicale (GRM) of Pierre
3. TAXONOMY Schaeffer used for the first time recorded segments of
sound to create their pieces of Musique Concrète. In
Approaches to musical sound synthesis that are somehow the seminal work Traité des Objets Musicaux (Schaeffer,
data-driven and concatenative can be found throughout 1966), explained in (Chion, 1995), Schaeffer defines the
history. The earlier uses are usually not identified as such, notion of sound object, which is not so far from what is
but the brief discussion in this section argues that they can here called unit: A sound object is a clearly delimited seg-
be seen as instances of fixed inventory or manual concate- ment in a source recording, and is the basic unit of compo-
native synthesis. I hope to show that all these approaches sition (Schaeffer & Reibel, 1967, 1998). Moreover, Scha-
are very closely related to, or can sometimes even be seen effer strove to base his theory of sound analysis on ob-
as a special case of the general formulation of concatena- jectively, albeit manually, observable characteristics, the
tive synthesis in section 1.2. écoute réduite (narrow listening) (GRAM, 1996), which
Table 1 lists in chronological order all the methods for corresponds to a standardised descriptor set of the percep-
concatenative musical sound synthesis that will be dis- tible qualities of mass, grain, duration, matter, volume,
cussed in the following, proposing several properties for and so on.

5
Group, Name (Author) Year Type Application Inventory Units Segmentation Descriptors Selection Concatenation Real-time

1 Musique Concrete (Schaeffer) 1948 art composition open heterogeneous manual manual manual manual no

2 Digital Sampling 1980 sound high-level fixed notes/any manual manual fixed mapping no yes

1 Phrase Sampling 1990 art composition open phrases manual musical manual no no

2 Granular Synthesis 1990 sound free open homogeneous fixed time manual no yes

1 Plunderphonics (Oswald) 1993 art composition open heterogeneous manual manual manual manual no

7 Caterpillar (Schwarz) 2000 research high-level open heterogeneous alignment high-level global yes no

7 Musaicing (Pachet et al.) 2001 research resynthesis open homogeneous blind low-level constraints no no

4 Soundmosaic (Hazel) 2001 application resynthesis open homogeneous fixed signal match local no no

4 Soundscapes (Hoskinson et al.) 2001 application texture open homogeneous automatic signal match local yes no

3 La Légende des siècles (Pasquet) 2002 sound resynthesis open frames blind spectral match spectral no yes

4 Granuloop (Xiang) 2002 rhythm free open beats beat spectral match local yes yes

5 MoSievius (Lazier and Cook) 2003 research free open homogeneous blind low-level local no yes

5 Musescape (Tzanetakis) 2003 research music selection open homogeneous blind high-level local no yes

6 MPEG-7 Audio Mosaics (Casey and Lindsay) 2003 research resynthesis open homogeneous on-the-fly low-level local no yes

3 Sound Clustering Synthesis (Kobayashi) 2003 research resynthesis open frames fixed low-level spectral no no

4 Directed Soundtrack Synthesis (Cardle et al.) 2003 research texture open heterogeneous automatic low-level constraints yes no

2 Let them sing it for you (Bunger) 2003 web art high-level fixed words manual semantic direct no no

6 Network Auralization for Gnutella (Freeman) 2003 software art high-level open homogeneous blind context-dependent local no yes

3 Input driven resynthesis (Puckette) 2004 research resynthesis open frames fixed low-level local yes yes

4 Matconcat (Sturm) 2004 research resynthesis open homogeneous fixed low-level local no no

2 Synful (Lindemann) 2004 commercial high-level fixed note parts manual high-level lookahead yes yes

6 SoundSpotter (Casey) 2005 research resynthesis open homogeneous on-the-fly morphological local no yes

7 Audio Analogies (Simon et al.) 2005 research high-level open notes/dinotes manual pitch global yes no

7 Ringomatic (Aucouturier et al.) 2005 research high-level open drum bars automatic high-level global yes yes

5 frelia (Momeni and Mandel) 2005 installation free open homogeneous none high-level+abstract local no yes

5 CataRT (Schwarz) 2005 sound free open heterogeneous alignment/blind high-level local no yes

2 Vienna Symphonic Library Instruments 2006 commercial high-level fixed note parts manual high-level lookahead yes yes

6 iTunes Signature Maker (Freeman) 2006 software art high-level open homogeneous blind context-dependent local no yes

Table 1. Comparison of concatenative synthesis work in chronological order

3.1.3. Karlheinz Stockhausen are here expressed as half bar pieces of a score, stochasti-
cally selected from an (implicit) corpus according to pitch
Schaeffer (1966) also relates Karlheinz Stockhausen’s de-
group, dynamics, and density (DiScipio, 2005).
sire to cut a tape into millimeter-sized pieces to recom-
pose them, the notorious Étude des 1000 collants (study
with one thousand pieces) of 1952. The piece (actually 3.1.6. Phrase Sampling (1990’s)
simply called Étude) was composed according to a score
generated by a series for pitch, duration, dynamics, and In commercial, mostly electronic, dance music, a large
timbral content, for a corpus of recordings of hammered part of the musical material comes from specially consti-
piano strings, transposed and cropped to their steady sus- tuted sampling CDs, containing rhythmic loops and short
tained part (Manion, 1992). bass or melodic phrases. These phrases are generally la-
beled and grouped by tempo and sometimes characterised
by mood or atmosphere. As the available CDs, aimed at
3.1.4. John Cage
professional music producers, number in the tens of thou-
John Cage’s Williams Mix (1953) is a composition for 8 sands, each containing hundreds of samples, a large part
magnetic tapes that prescribes a corpus of about 600 of the work still consists in listening to the CDs and se-
recordings in 6 categories (e.g. city sounds, country lecting suitable material that is then placed on a rhythmic
sounds, electronic sounds), and how they are to be ordered grid, effectively constituting the base of a new song by
and spliced together (Cage, 1962). 6 7 concatenation of preexisting musical phrases.

3.1.5. Iannis Xenakis 3.1.7. Plunderphonics (1993)

In Iannis Xenakis’ Analogique A et B (1958/1959) the Plunderphonics (Oswald, 1999) is John Oswald’s artistic
electronic part B is composed of cut and spliced pieces of project of cutting up recorded music. One outstanding ex-
tape, selected according to a stochastic process. The or- ample, Plexure, is made up from thousands of snippets
chestral part A is supposed to be an analogue to B, where from a decade of pop songs, selected and assembled by
these “units” are realised by acoustic instruments. They hand. The sound base was manually labeled with musi-
6 http://www.medienkunstnetz.de/works/williams-mix cal genre and tempo, which were the descriptors used to
7 http://www.johncage.info/workscage/williamsmix.html guide the selection:

6
Selection
Caterpillar
Musical Mosaicing
automatic
Audio Analogies
Ringomatic

MATConcat
MPEG−7 Audio Mosaics
targeted Input driven resynthesis Directed Soundtracks
Soundspotter
Soundmosaicing

La legende des siecles Granuloop

stochastic Soundclustering Soundscapes N.A.G.

CataRT
Let them sing it for you frelia
fixed Granular Synthesis MoSievius
Sound Sampling Musescape
Synful
Phrase Sampling VSL
manual Plunderphonics
Musique Concrete
Analysis
manual frame spectrum segmental high−level
similarity similarity descriptors

Figure 3. Comparison of musical sound synthesis methods according to selection and analysis, use of concatenation
quality (bold), and real-time capabilities (italics)

Plundered are over a thousand pop stars from played with different dynamics, to better capture the tim-
the past 10 years. [...] It starts with rapmillisy- bral variations of the acoustic instrument (Roads, 1996).
lables and progresses through the material ac- Modern software samplers can use several gigabytes of
cording to tempo (which has an interesting re- sound data 8 which makes samplers clearly a data-driven
lationship with genre). fixed-inventory synthesis system, with the sound database
analysed by instrument class, playing style, pitch, and dy-
Oswald (1993) namics, and the selection being reduced to a fixed map-
ping of MIDI-note and velocity to a sample, without pay-
ing attention to the context of the notes played before, i.e.
Cutler (1994) gives an extensive account of Oswald’s
no consideration of concatenation quality.
and related work throughout art history and addresses the
issue of the incapability of copyright laws to handle this
form of musical composition.
3.2.2. Granular Synthesis (1990’s)
3.2. Group 2: Fixed Mapping
Granular synthesis (Roads, 1988, 2001) takes short snip-
Here, the selection is performed by a predetermined map- pets out of a sound file called grains, at an arbitrary rate.
ping from a fixed inventory with no analysis at all (granu- These grains are played back with a possibly changed
lar synthesis), manual analysis (Let them sing it for you), pitch, envelope, and volume. The position and length of
some analysis in class and pitch (digital sampling), or a the snippets are controlled interactively, allowing to scan
more flexible rule-based mapping that takes care of select- through the soundfile, in any speed.
ing the appropriate transitions from the last selected unit Granular synthesis is rudimentarily data-driven, but
to the next in order to obtain a good concatenation (Synful, there is no analysis, the unit size is determined arbitrar-
Vienna Symphonic Library). ily, and the selection is limited to choosing the position in
one single sound file. However, its concept of exploring a
3.2.1. Digital Sampling (1980’s) sound interactively could be combined with a pre-analysis
of the data and thus enriched by a targeted selection and
In the widest reasonable sense of the term, digital sam- the resulting control over the output sound characteristics,
pling synthesisers or samplers for short, which appeared i.e. where to pick the grains that satisfy the wanted sound
at the beginning of the 1980’s, were the first “concatena- characteristics, as described in the free synthesis applica-
tive” sound synthesis devices. A sampler is a device that tion in section 1.1.
can digitally record sounds and play them back, apply-
ing transposition, volume changes, and filters. Usually the 8 For instance, Nemesys, the makers of Gigasampler, 9 pride them-
recorded sound would be a note from an acoustic instru- selves to have sampled every note of a grand piano in every possible
ment, that is then mapped to the sampler’s keyboard. Mul- dynamic, resulting in a 1 GB sound set.
tisampling uses several notes of different pitches, and also 9 http://www.nemesysmusic.com

7
3.2.3. Let them sing it for you (2003) frame-by-frame according to the descriptors energy and
pitch. Each FFT frame is then stored in a dictionary and
A fun web art project and application of not-quite-CSS is
is clustered using the statistics program R 13 . During the
this site 10 (Bünger, 2003), where a text given by a user is
performance, this dictionary of FFT-frames is used with
synthesised by looking up each word in a hand constituted
an inverse FFT and overlap-add to resynthesize sound ac-
monorepresented database of snippets of pop songs where
cording to a target specification of pitch and energy. The
that word is sung. The database is extended by user’s re-
continuity of the resynthesized frames is assured by a
quest for a new word. At the time of writing, it counted
Hidden Markov Model trained on the succession of FFT-
about 2000 units.
frame classes in the recordings.

3.2.4. Synful (2004)

3.3.2. Sound Clustering Synthesis (2003)
The first commercial application using some ideas of
CSS is the the Synful software synthesiser 11 (Lindemann, Kobayashi (2003) resynthesises a target given by a clas-
2001), based on the technology of Reconstructive Phrase sical music piece from a pre-analysed and pre-clustered
Modeling, which aims at the reconstitution of expressive sound base using a vector-based direct spectral match
solo instrument performances from MIDI input. Real in- function. Resynthesis is done FFT-frame-wise, conserv-
strument recordings are segmented into a database of at- ing the association of consecutive frame clusters, i.e. the
tack, sustain, release, and transition units of varying sub- current frame to be synthesised will be similar to the cur-
types. The real-time MIDI input is converted by rules rent target frame, and the transition from one frame to the
to a synthesis target that is then satisfied by selecting the next will be similar to one occuring in the sound data, in
closest units of the appropriate type according to a simple the same context. This leads to a good approximation of
pitch and loudness distance function. Synthesis is heav- the synthesised sound with the target, and a high consis-
ily using transformation of pitch, loudness, and duration, tency in the development of the synthesised sound. Note
favoured by the hybrid waveform, spectral, and sinusoidal that this does not necessarily mean a high spectral conti-
representation of the database units. nuity, since also transitions from a note release to an attack
Synful is more on the side of a rule-based sampler than frame are captured by the pairwise association of database
CSS, with its fixed inventory and limited set of descrip- frame clusters.
tors, but fulfills the application of high-level instrument
synthesis (section 1.1) impressively well. 3.3.3. Input Driven Resynthesis (2004)

3.2.5. Vienna Symphonic Library (2006) This project (Puckette, 2004) starts from a database of
FFT frames from one week of radio recording, analysed
The Vienna Symphonic Library 12 is a huge collection for loudness and 10 spectral bands as descriptors. The
(550 GB) of samples of all classical instruments in all recording then forms a trajectory through the descriptor
playing styles and moods, and including single notes, space mapped to a hypersphere. Phase vocoder overlap–
groups of notes, and transitions. Their so-called perfor- add resynthesis is controlled in real-time by audio input
mance detection algorithms offer the possibility to auto- that is analysed for the same descriptors, and the selection
matically analyse a MIDI performance input and to select algorithm tries to follow a part of the database’s trajectory
samples appropriate for the given transition and context in whenever possible, limiting jumps.
a real-time instrument plugin.
3.4. Group 4: Segmental Similarity
3.3. Group 3: Spectral Frame Similarity
This group’s units are homogeneous segments that are lo-
This subclass of data-driven synthesis uses as units short- cally selected by stochastic methods (Soundscapes and
time signal frames that are matched to the target by a spec- Textures, Granuloop), or matched to a target by segment
tral similarity analysis (Input Driven Resynthesis) or addi- similarity analysis on low-level signal processing descrip-
tionally with a partially stochastic selection (La Légende tors (Soundmosaicing, Directed Soundtracks, MATCon-
des siècles, Sound Clustering). Here, forced by the short cat).
unit length, the selection must take care of the local con-
text by stipulating certain continuity constraints, because
otherwise FFT-frame salad would result. 3.4.1. Soundscapes and Texture Resynthesis (2001)
The Soundscapes 14 project (Hoskinson & Pai, 2001) gen-
3.3.1. La Légende des siècles (2002) erates endless but never repeating soundscapes from a
La Légende des siècles is a theatre piece performed at recording for installations. This means keeping the tex-
the Comédie Française, using real-time transformation on ture of the original sound file, while being able to play it
readings of Victor Hugo. One of these effects, developed for an arbitrarily long time. The segmentation into syn-
by Olivier Pasquet, uses a data-driven synthesis method thesis units is performed by a Wavelet analysis for good
inspired by CSS: Prerecorded audio is analysed off-line join points. A similar aim and approach is described in
(Dubnov, Bar-Joseph, El-Yaniv, Lischinski, & Werman,
10 http://www.sr.se/sing
11 http://www.synful.com 13 http://www.r-project.org
12 http://www.vsl.co.at 14 http://www.cs.ubc.ca/∼reynald/applet/Scramble.html

8
2002). This generative approach means that also the syn- spectral centroid, spectral drop-off, and harmonicity, and
thesis target is generated on the fly, driven by the original selection is a match of descriptor values within a certain
structure of the recording. range of the target. The application offers many choices of
how to handle the case of a non-match (leave a hole, con-
3.4.2. Soundmosaic (2001) tinue the previously selected unit, pick a random unit), and
through the use of a large window function on the grains,
Soundmosaic (Hazel, 2001) constructs an approximation the result sounds pleasingly smooth, which amounts to a
of one sound out of small pieces of varying size from other squaring of the circle for concatenative synthesis. MAT-
sounds (called tiles). For version 1.0 of Soundmosaic, the Concat is the first system used to compose two electroa-
selection of the best source tile uses a direct match of the coustic musical works, premiered at ICMC 2004: Con-
normalised waveform (Manhatten distance). Version 1.1 catenative Variations of a Passage by Mahler, and Dedi-
introduced as distance metric the correlation between nor- cation to George Crumb, American Composer.
malized tiles (the dot product of the vectors over the prod-
uct of their magnitudes). Concatenation quality is not yet 3.5. Group 5: Descriptor analysis with direct selection
included in the selection. in real time
This group uses descriptor analysis with a direct local real-
3.4.3. Granuloop (2002)
time selection of heterogeneous units, without caring for
The data-driven probabilistic drum loop rearranger Gran- concatenation quality. The local target is given according
uloop 15 (Xiang, 2002) is a patch for Pure Data 16 , which to a subset of the same descriptors in real time (MoSievius,
constructs transition probabilities between 16th notes from Musescape (see section 2.4), CataRT, frelia).
a corpus of four drum loops. These transitions then serve
to exchange segments in order to create variation, either 3.5.1. MoSievius (2003)
autonomously or with user interaction.
The transition probabilities (i.e. the concatenation dis- The MoSievius system 18 (Lazier & Cook, 2003) is an en-
tances) are analysed by loudness and spectral similarity couraging first attempt to apply unit selection to real-time
computation, in order to favour continuity. performance-oriented synthesis with direct intuitive con-
trol.
The system is based on sound segments placed in a
3.4.4. Directed Soundtrack Synthesis (2003)
loop: According to user controlled ranges for some de-
Audio and user directed sound synthesis (Cardle, Brooks, scriptors, a segment is played when its descriptor val-
& Robinson, 2003; Cardle, 2004) is aimed at the pro- ues lie within the ranges. The descriptor set used con-
duction of soundtracks in video by replacing existing tains voicing, energy, spectral flux, spectral centroid, in-
soundtracks with sounds from a different audio source in strument class. This method of content-based retrieval is
small chunks similar in sound texture. It introduces user- called Sound Sieve and is similar to the Musescape system
definable constraints in the form of large-scale properties (Tzanetakis, 2003) for music selection (see section 2.4).
of the sound texture, e.g. preferred audio clips that shall
appear at a certain moment. For the unconstrained parts of 3.5.2. CataRT (2005)
the synthesis, a Hidden Markov Model based on the statis-
tics of transition probabilities between spectrally similar The ICMC 2005 workshop on Audio Mosaicing: Feature-
sound segments is left running freely in generative mode, Driven Audio Editing/Synthesis saw the presentation of
much similar to the approach of Hoskinson and Pai (2001) the first prototype of a real-time concatenative synthesiser
described in section 3.4.1. (Schwarz, 2005) called CataRT, loosely based on Cater-
A slighly different approach is taken by Cano et al. pillar. It implements the application of free synthesis as
(2004), where a sound atmosphere library is queried with interactive exploration of sound databases (section 1.1)
a search term. The resulting sounds, plus other semanti- and is in its present state rather close to directed, data-
cally related sounds, are then laid out in time for further driven granular synthesis (section 3.2.2).
editing. Here, we have no segmentation but a layering In CataRT, the units in the chosen corpus are laid out
of the selected sounds according to exclusion rules and in a Euclidean descriptor space, made up of pitch, loud-
heuristics. ness, spectral characteristics, modulation, etc. A (usually
2-dimensional) projection of this space serves as the user
interface that displays the units’ positions and allows to
3.4.5. MATConcat (2004)
move a cursor. The units closest to the cursor’s position
The MATConcat system 17 (Sturm, 2004a, 2004b), is an are selected and played at an arbitrary rate. CataRT is im-
open source application in Matlab to explore concatena- plemented as a Max/MSP 19 patch using the FTM and Ga-
tive synthesis. For the moment, units are homogeneous bor extensions 20 (Schnell, Borghesi, Schwarz, Bevilac-
large windows taken out of the database sounds. The qua, & Müller, 2005; Schnell & Schwarz, 2005). The
descriptors used are pitch, loudness, zero crossing rate, sound and descriptor data can be loaded from SDIF files
(see section 4.3) containing MPEG-7 descriptors, or can
15 http://crca.ucsd.edu/∼pxiang/research.htm
16 18 http://soundlab.cs.princeton.edu/research/mosievius
http://puredata.info
17 http://www.mat.ucsb.edu/∼b.sturm/sand/VLDCMCaR/ 19 http://www.cycling74.com
VLDCMCaR.html 20 http://www.ircam.fr/ftm

9
be calculated on-the-fly. It is then stored in FTM data audio target from an arbitrary-size database by matching
structures in memory. An interface to the Caterpillar da- of strings of 8 “sound lexemes”, which are basic spectro-
tabase, to the freesound repository (see section 4.3), and temporal constituents of sound. Casey reports that about
other sound databases is planned. 60 lexemes are enough to describe, in their various tem-
poral combinations, any sound. By hashing and standard
3.5.3. Frelia (2005) database indexation techniques, highly efficient lookup
is possible, even on very large sound databases. Casey
The interactive installation frelia 21 by Ali Momeni and (2005) claims that one petabyte or 3000 years of audio
Robin Mandel uses sets of uncut sounds from the free- can be searched in less than half a second.
sound repository (see section 4.3) chosen by the textual
description given by the sound creator. The sounds are
laid out on two dimensions for the user to choose accord- 3.7. Group 7: Descriptor analysis with fully automatic
ing to the two principal components of freesound’s de- high-level unit selection
scriptor space of about 170 dimensions calculated by the
This last group uses descriptor analysis with fully auto-
AudioClas 22 library.
matic global high-level unit selection and concatenation
by path-search unit selection (Caterpillar, Audio Analo-
3.6. Group 6: High-level descriptors for targeted or gies) or by real-time constraint solving unit selection (Mu-
stochastic selection sical Mosaicing, Ringomatic).
Here, high-level musical or contextual descriptors are
used for targeted or stochastic local selection (MPEG-7 3.7.1. Caterpillar (2000)
Audio Mosaics, Soundspotter, NAG) without specific han-
dling of concatenation quality. Caterpillar, first proposed in (Schwarz, 2000, 2003a,
2003b) and described in detail in (Schwarz, 2004), per-
3.6.1. MPEG-7 Audio Mosaics (2003) forms data-driven concatenative musical sound synthesis
from large heterogeneous sound databases.
In the introductory tutorial at the DAFx 2003 confer- Units are segmented by automatic alignment of music
ence 23 titled Sound replacement, beat unmixing and with its score (Orio & Schwarz, 2001) for instrument cor-
audio mosaics: Content-based audio processing with pora, and by blind segmentation for free and re-synthesis.
MPEG-7, Michael Casey and Adam Lindsay showed what In the former case, the solo instrument recordings are
they called “creative abuse” of MPEG-7: audio mosaics split into seminote units, which can then be recombined
based on pop songs, calculated by finding the best match- to dinotes, analogous to diphones from speech synthesis.
ing snippets of one Beatles song, to reconstitute another The unit boundaries are thus usually within the sustain
one. The match was calculated from the MPEG-7 low- phase and as such in a stable part of the notes, where con-
level descriptors, but no measure of concatenation quality catenation can take place with the least discontinuity. The
was included in the selection. descriptors are based on the MPEG-7 low-level descrip-
tor set, plus descriptors derived from the score and the
3.6.2. Network Auralization for Gnutella (2003) sound class. The low-level descriptors are condensed to
unit descriptors by modeling of their temporal evolution
Jason Freeman’s N.A.G. software (Freeman, 2003) selects over the unit (mean value, slope, spectrum, etc.) The da-
snippets of music downloaded from the Gnutella p2p net- tabase is implemented using the relational SQL database
work according to the descriptors search term, network management system PostGreSQL for added reliability and
bandwidth, etc. and makes a collage out of them by con- flexibility.
catenation. The unit selection algorithm is of the path-search type
The descriptors used here are partly content-dependent (see section 1.2.7) where a Viterbi algorithm finds the
like the metadata accessed by the search term, and partly globally optimal sequence of database units that best
context-dependent, i.e. changing from one selection to the match the given synthesis target units using two cost func-
next, like the network characteristics. tions: The target cost expresses the similarity of a target
A similar approach is taken in the forthcoming iTunes unit to the database units by weighted Euclidean distance,
Signature Maker 24 , which creates a short sonic signature including a context around the target, and the concatena-
from an iTunes music collection as a collage according to tion cost predicts the quality of the join of two database
descriptors like play count, rating, last play date, which units by join-point continuity of selected descriptors.
are again context-dependent descriptors. Unit corpora of violin sounds, environmental noises,
and speech have been built and used for a variety of sound
3.6.3. SoundSpotter (2004) examples of high-level synthesis and resynthesis of audio.
Casey’s system, implemented in Pure Data on a Post-
GreSQL 25 database, performs real-time resynthesis of an 3.7.2. Talkapillar (2003)
21
22
http://ali.corpuselectronica.com/projects/frelia/frelia.html The derived project Talkapillar (Kärki, 2003) adapted the
http://audioclas.iua.upf.edu
23 http://www.elec.qmul.ac.uk/dafx03 Caterpillar system for text-to-speech synthesis using spe-
24 http://www.jasonfreeman.net/itsm cialised phonetic and phonologic descriptors. One of its
25 http://www.postgresql.org applications is to recreate the voice of a defunct eminent

10
writer to read one of his texts for which no recordings ex- achieved by selecting note units by pitch from a sound
ist. The goal here is different from fully automatic text-to- base constituted by just one solo recording from a prac-
speech synthesis: highest speech quality is needed (con- tice CD. The result sounds very convincing because of the
cerning both sound and expressiveness), manual refine- good quality of the manual segmentation, the globally op-
ment is allowed. timal selection using the Viterbi algorithm as in (Schwarz,
The role of Talkapillar is to give the highest possible 2000), and transformations with a PSOLA algorithm to
automatic support for human decisions and synthesis con- perfectly attain the target pitch and the duration of each
trol, and to select a number of well matching units in a unit.
very large base (obtained by automatic alignment) accord- An interesting point is that the style and the expression
ing to high level linguistic descriptors, which reliably pre- of the song chosen as sound base is clearly perceivable in
dict the low-level acoustic characteristics of the speech the synthesis result.
units from their grammatical and prosodic context, and
emotional and expressive descriptors (Beller, 2004, 2005). 4. REMAINING PROBLEMS
In a further development, this system now allows hy-
brid concatenation between music and speech by mix- This section gives a (necessarily incomplete) selection of
ing speech and music target specifications and databases, the most urgent or interesting problems to work on.
and is applicable to descriptor-driven or context-sensitive
voice effects (Beller, Schwarz, Hueber, & Rodet,
4.1. Segmentation
2005). 26
The segmentation of the source sounds that are to form
3.7.3. Musical Mosaicing (2001) the database is fundamental because it defines the unit
base and thus the whole synthesis output. While phone
Musical Mosaicing, or Musaicing (Zils & Pachet, 2001), or note units are clearly defined and automatically seg-
performs a kind of automated remix of songs. It is aimed mentable, even more so when the corresponding text or
at a sound database of pop music, selecting pre-analysed score is available, other source material is less easy to seg-
homogeneous snippets of songs and reassembling them. ment. For general sound events, automatic segmentation
Its great innovation was to formulate unit selection as into sound objects in the Schaefferian sense is only at its
a constraint solving problem (CSP). The set of descrip- beginning (Hoskinson & Pai, 2001; Cardle et al., 2003;
tors used for the selection is: mean pitch (by zero cross- Jehan, 2004). Also, segmentation of music (used e.g. by
ing rate), loudness, percussivity, timbre (by spectral dis- Zils and Pachet) is harder to do right because of the com-
tribution). Work on adding more descriptors has picked plexity of the material. Finally, the best solution would be
up again with (Zils & Pachet, 2003, 2004) (see also sec- not to have a fixed segmentation to start from, but to be
tion 4.2) and is further advanced in section 3.7.4. able to choose the unit’s segments on the fly. However,
this means that also the unit descriptors’ temporal model-
3.7.4. Ringomatic (2005) ing has to be recalculated accordingly (see section 1.2.1),
which poses hard problems for efficiency, a possible solu-
The work of Musical Mosaicing (section 3.7.3) is adapted
tion for which is the scale tree in (de Cheveigné, 2002).
to real-time interactive high level selection of bars of drum
recordings in the recent Ringomatic system (Aucouturier
& Pachet, 2005). The constraint solving problem (CSP) 4.2. Descriptors
of Zils and Pachet (2001) is reformulated for the real-
Better descriptors are needed for more musical use of con-
time case, where the next bar of drums from a database of
catenative synthesis, and more efficient use for sound syn-
recordings of drum playing has to be selected according
thesis.
to local matching constraints and global continuity con-
Definitely needed is a descriptor for percussiveness of
straints holding on the previously selected bars.
a unit. In (Tzanetakis, Essl, & Cook, 2002), this ques-
The local match is defined by four drum-specific de-
tion is answered for musical excerpts, by calibrating au-
scriptors derived by the EDS system (see section 4.2): per-
tomatically extracted descriptors for the beat strength to
ceptive energy, onset density, presence of drums, presence
perceptive measurements.
of cymbals. Interaction takes place by analysing a MIDI
An interesting approach to the definition of new de-
performance and mapping its energy, density and mean
scriptors is the Extractor Discovery System (EDS) (Zils &
pitch to target drum descriptors. The local constraints
Pachet, 2003, 2004): Here, a genetic algorithm evolves
derived from these are then balanced with the continuity
a formula using standard DSP and mathematical building
constraints to choose between reactivity and autonomy of
blocks whose fitness is then rated using a cross validation
the generated drum accompaniment.
database with data labeled by users. This method was suc-
cessfully applied to the problem of finding an algorithm to
3.7.5. Audio Analogies (2005) calculate the perceived intensity of music.
Expressive instrument synthesis from MIDI (trumpet in
the examples) is the aim of this project by researchers 4.2.1. Musical Descriptors
from the University of Washington and Microsoft Re-
The recent progress in the establishment of a standard
search (Simon, Basu, Salesin, & Agrawala, 2005),
score representation format with MusicXML as the most
26 Examples can be heard on http://www.ircam.fr/anasyn/concat promising candidate, means that we can soon overcome

11
the limitations of MIDI and make use of the entire in- & Wessel, 1999; Schwarz & Wright, 2000). A common
formation from the score, when available and linked to database API would greatly enhance the possibilities of
the units by alignment. This means performing unit se- exchange, but it is probably still too early to define it.
lection on a higher level, exploiting musical context in- Finally, concatenative synthesis from existing song ma-
formation from the score, such as dynamics (crescendo, terial evokes tough legal questions of intellectual prop-
diminuendo), and better describing the units (e.g. we’d erty, sampling and citation practices as evoked by Oswald
know which units are trills, which ones bear an accent, (1999), Cutler (1994), and Sturm (2006) in this issue, and
etc). We can already now derive musical descriptors from summarised by John Oswald in (Cutler, 1994) as follows:
an analysis of the score, such as:
If creativity is a field, copyright is the fence.
Harmony A unit’s chord or chord class, and a measure
of consonance/dissonance can serve as powerful high- A welcome initiative is the freesound project, 28 a col-
level musical descriptors, that are easy to specify as a tar- laboratively built up online database of samples under li-
get, e.g. in MIDI. censing terms less restrictive than the standard copyright,
Rhythm Position in the measure, relative weight or ac- as provided by the Creative Commons 29 family of li-
cent of the note applies mainly to percussive sounds. This censes. Now imagine a transparent net access from a con-
information can partially be derived from the score but catenative synthesis system to this sound database, with
should be complemented by beat tracking that analyses unit descriptors already calculated 30 — an endless sup-
the signal for the properties of the percussion sounds. ply of fresh sound material.
Musical Structure Future descriptors that express the
position or function of a unit within the musical structure 4.4. Data-Driven Optimisation of Unit Selection
of a piece will make accessible for selection the subtle nu- It should be possible to exploit the data in the database to
ances that performers install in the music. This further de- analyse the natural behaviour of an underlying instrument
velops the concept of high-level synthesis (see section 1.1) or sound generation process, which enables us to better
by giving context information about the musical function predict what is natural in synthesis. The following points
of a unit in the piece, such that the selection can choose are developed in more detail in (Schwarz, 2004).
units that fulfill the same function. For speech synthesis,
this technique has had a surprisingly large effect on natu-
4.4.1. Learning Distances from the Data
ralness (Prudon, 2003).
Knowledge about similarity or distance between high-
4.2.2. Evaluation of Descriptor Salience level symbolic descriptors can be obtained from the da-
tabase by an acoustic distance function, and classifica-
Advanced standard descriptor sets like MPEG-7 propose tion. For speech, with the regular and homogeneous phone
tens of descriptors, whose temporal evolution can then be units, this is relatively clear (Macon, Cronk, & Wouters,
characterised in several parameters. This enormous num- 1998), but for music, the acoustic distance is the first prob-
ber of parameters that could be used for selection carries lem: How do we compare different pitches, how units of
of course incredible redundancies. However, as concate- completely different origins and durations?
native synthesis is to be used for musical applications, one
can not know in advance, which descriptors will be useful. 4.4.2. Learning Concatenation from the Data
The aim is to give maximum flexibility to the composer
using the system. Most applications only use a very small A corpus of recordings of instrumental performances or
subset of these descriptors. any other sound generating process can be exploited to
For the more precisely defined applications, a system- learn the concatenation distance function from the data by
atic evaluation of which descriptors are the most useful statistical analysis of pairs of consecutive units in the da-
for synthesis, would be welcome, similar to the auto- tabase. The set of each unit’s descriptors defines a point in
matic choice of descriptors for instrument classification a high-dimensional descriptor space D. The natural con-
in (Livshin, Peeters, & Rodet, 2003). catenation with the consecutive unit defines a vector to
An important open research question is how to map the that unit’s point in D. The question is now if, given any
descriptors we can automatically extract from the sound pair of points in D, we can obtain from this vector field a
data to a perceptive similarity space that allows us to ob- measure to what degree the two associated units concate-
tain distances between units. nate like if they were consecutive.
The problem of modeling a high-dimensional vector
field becomes easier if we restrict the field to clusters of
4.3. Database and Intellectual Property
units in a corpus and calculate the distances between all
The databases used for concatenative synthesis are gen- pairs of cluster centres. This will provide us with a con-
erally rather small, e.g. 1h 30 in Caterpillar. In speech catenation distance matrix between clusters that can be
synthesis, 10h are needed for only one mode of speech! used as a fast lookup table for unit selection. This allows
Standard descriptor formats and APIs are not so far 28 http://iua-freesound.upf.es
away with MPEG-7 and the SDIF Sound Description In- 29 http://creativecommons.org
terchange Format 27 (Wright, Chaudhary, Freed, Khoury, 30 The license type of each unit should be part of the descriptor set,

such that a composer could, e.g. only select units with a license permit-
27 http://www.ircam.fr/sdif ting commercial use, if she wants to sell the composition.

12
us also to use the database for synthesis by modeling the the units are stored as synthesis parameters that are easier
probabilities to go from one cluster of units to the next. to concatenate before resynthesising.
This model would prefer, in synthesis, the typical articu- Going further, hybrid concatenation of units using dif-
lations taking place in the database source, or, when left ferent signal models promises clear advantages: each type
running freely, would generate a sequence of units that of unit (transient, steady state, noise) could be represented
recreates the texture of the source sounds. in the most appropriate way for transformations of pitch
and duration.
4.4.3. Learning Weights from the Data
Finally, there is a large corpus of literature about auto- 5. CONCLUSION
matically obtaining the weights for the distance functions
by search in the weight-space with resynthesis of natu- What we tried to show in this article is that many ap-
ral recordings for speech synthesis (Hunt & Black, 1996; proaches pick up the general idea of data-driven concate-
Macon et al., 1998). A performance optimised method, native synthesis, or part of it, to achieve interesting re-
applied to singing voice synthesis, is described in (Meron, sults, without knowing about the other work in the field.
1999), and an application in Talkapillar is described in To foster exchange of ideas and experience and help the
(Lannes, 2005). fledgling community, a mailinglist [email protected] has
All these data-driven methods depend on an acoustic been created, accessible from (Schwarz, 2006). This site
or perceptual distance measure that can tell us when two also hosts the online version of this survey of research and
sounds “sound the same”. Again, for speech this might musical systems using concatenation which is continually
be relatively clear, but for music, this is itself a subject of updated.
research in musical perception and cognition. Professional and multi-media sound synthesis devices
or software show a natural drive to make use of the ad-
vanced mass storage capacities available today, and of the
4.5. Real-Time Interactive Selection easily available large amount of digital content. We can
Using concatenative synthesis in real-time allows interac- foresee this type of applications hitting a natural limit of
tive browsing of a sound database. The obvious interac- manageability of the amount of data. Only automatic sup-
tion model of a trajectory through the descriptor space port of the data-driven composition process will be able to
presents the problems of its sparse and uneven popula- surpass this limit and make the whole wealth of musical
tion. A more appropriate model might be that of navi- material accessible to the musician.
gation through a graph of clusters of units. However, a Where is concatenative sound synthesis now? The mu-
good mix of generative and user-driven behaviour of the sical applications of CSS are just starting to become con-
system has to be found. 31 vincing (Sturm 2004a, see section 3.4.5), and real-time
Globally optimal unit selection algoritms, that take explorative synthesis is around the corner (Schwarz 2005,
care of concatenation quality such as Viterbi path search see section 3.5.2). For high-level synthesis, we stand at
or constraint satisfaction, are inherently non real-time. the same position speech synthesis stood 10 years ago,
Real-time synthesis could partially make up for this by with yet too small databases, and many open research
allowing transformation of the selected units. This intro- questions. The first commercial application (Lindemann
duces the need for defining a transformation cost that pre- 2001, see section 3.2.4) is comparable to the early fixed-
dicts the loss of sound quality introduced by this. inventory diphone speech synthesisers, but its expressivity
Real-time synthesis also places more stress on the effi- and real-time capabilities are much more advanced than
ciency of the selection algorithm, which can be augmented that.
through clustering of the unit database or use of optimised Data-driven synthesis is now more feasible than ever
multi-dimensional indices (D’haes, Dyck, & Rodet, 2002, with the arrival of large sound database schemes. They
2003; Roy, Aucouturier, Pachet, & Beurivé, 2005). How- finally promise to provide large sound corpora in stan-
ever, also in the non real-time case, faster algorithms allow dardised description. It is this constellation that provided
for more experimentation and for more parameters to be the basis for great advancements in speech research: the
explored. existence of large speech databases allowed corpus-based
linguistics to enhance linguistic knowledge and the per-
formance of speech tools.
4.6. Synthesis
Where will concatenative sound synthesis be in a few
The commonly used simple crossfade concatenation is year’s time? To answer this question, we can sneak a look
enough for the first steps of concatenative sound synthe- at where speech synthesis is today: Text-to-speech synthe-
sis. Eventually, one would have to apply the findings from sis has, after 15 years of research, now become a technol-
speech synthesis about reducing discontinuities (Prudon, ogy mature to the extent that all recent commercial speech
2003) or the recent work by Osaka (2005), or use ad- synthesis systems are concatenative. This success is also
vanced signal models like additive sinusoidal plus noise, due to the database size of up to 10 hours of speech, a size
or PSOLA. This leads to parametric concatenation, where we did not yet reach for musical synthesis.
31 For instance, one particular difficulty is that in real-time synthesis,
The hypothesis of high level symbolic synthesis ex-
plained in section 1.1 proved true for speech synthesis,
the duration of a target unit is not known in advance, so that the system
must be capable of generating a pleasing stream of database units as long when the database is large enough (Prudon, 2003). How-
as there is no user input. ever, this database size is needed to adequately synthesise

13
just one “instrument” — the human voice — in just one Cage, J. (1962). Werkverzeichnis. New York: Edition
“neutral” expression. What we set out for with data-driven Peters.
concatenative sound synthesis is synthesising a multitude Cano, P., Fabig, L., Gouyon, F., Koppenberger, M.,
of instruments and sound processes, each with its idiosyn- Loscos, A., & Barbosa, A. (2004). Semi-
cratic behaviour. Moreover, research is still at its be- automatic ambiance generation. In Proceedings of
ginning on multi-emotion or expressive speech synthesis, 7th international conference on digital audio ef-
something we can’t do without for music. fects. Naples, Italy.
Cardle, M. (2004). Automated Sound Editing (Tech.
Rep.). University of Cambridge, UK: Computer
6. ACKNOWLEDGEMENTS
Laboratory.
Cardle, M., Brooks, S., & Robinson, P. (2003). Au-
Thanks go to Matt Wright, Jean-Philippe Lambert, and
dio and user directed sound synthesis. In Proceed-
Arshia Cont for pointing out interesting sites that (ab)use
ings of the international computer music conference
CSS, to Bob Sturm for the discussions and the beautiful
(icmc). Singapore.
music, to Mikhail Malt for sharing his profound knowl-
edge of the history of electronic music, to all the authors Casey, M. (2005). Acoustic Lexemes for Real-Time
of the research mentioned here for their interesting work Audio Mosaicing [Workshop]. In A. T. Lind-
in the emerging field of concatenative synthesis, and to say (Ed.), Audio Mosaicing: Feature-Driven Audio
Adam Lindsay for bringing people of this field together. Editing/Synthesis. Barcelona, Spain: International
Computer Music Conference (ICMC) workshop.
(http://www.icmc2005.org/index.php?selectedPage=120
Chion, M. (1995). Guide des objets sonores. Paris,
References France: Buchet/Chastel.
Codognet, P., & Diaz, D. (2001). Yet another local search
Amatriain, X., Bonada, J., Loscos, A., Arcos, J., & Ver-
method for constraint solving. In AAAI Symposium.
faille, V. (2003). Content-based transformations.
North Falmouth, Massachusetts.
Journal of New Music Research, 32(1), 95–114.
Aucouturier, J.-J., & Pachet, F. (2005). Ringomatic: A Cutler, C. (1994). Plunderphonia. Musicworks, 60(Fall),
Real-Time Interactive Drummer Using Constraint- 6–19.
Satisfaction and Drum Sound Descriptors. In Pro- de Cheveigné, A. (2002). Scalable metadata for search,
ceedings of the International Symposium on Music sonification and display. In International Confer-
Information Retrieval (ISMIR) (pp. 412–419). Lon- ence on Auditory Display (ICAD 2002) (pp. 279–
don, UK. 284). Kyoto, Japan.
Aucouturier, J.-J., Pachet, F., & Hanappe, P. (2004). From D’haes, W., Dyck, D. van, & Rodet, X. (2002). An effi-
sound sampling to song sampling. In Proceedings cient branch and bound seach algorithm for com-
of the international symposium on music informa- puting k nearest neighbors in a multidimensional
tion retrieval (ISMIR). Barcelona, Spain. vector space. In Ieee advanced concepts for intelli-
Battier, M. (2001). Laboratori. In J.-J. Nattiez (Ed.), Enci- gent vision systems (acivs). Gent, Belgium.
clopedia della musica (Vol. I, pp. 404–419). Milan: D’haes, W., Dyck, D. van, & Rodet, X. (2003). PCA-
Einaudi. based branch and bound search algorithms for com-
Battier, M. (2003). Laboratoires. In J.-J. Nattiez (Ed.), puting K nearest neighbors. Pattern Recognition
Musiques. Une encyclopédie pour le XXIe siècle Letters, 24(9–10), 1437-1451.
(Vols. I, Musiques du XXe siècle, pp. 558–574). DiScipio, A. (2005). Formalization and Intuition in
Paris: Actes Sud, Cité de la musique. Analogique A et B. In Proceedings of the inter-
Beller, G. (2004). Un synthétiseur vocal par sélection national symposium iannis xenakis (pp. 95–108).
d’unités. Rapport de stage DEA ATIAM, Ircam – Athens, Greece.
Centre Pompidou, Paris, France. Dubnov, S., Bar-Joseph, Z., El-Yaniv, R., Lischinski,
Beller, G. (2005). La musicalité de la voix parlée. D., & Werman, M. (2002). Synthesis of au-
Maitrise de musique, Université Paris 8, Paris, dio sound textures by learning and resampling of
France. wavelet trees. IEEE Computer Graphics and Appli-
Beller, G., Schwarz, D., Hueber, T., & Rodet, X. (2005). cations, 22(4), 38–48.
A hybrid concatenative synthesis system on the Forney. (1973). The Viterbi algorithm. Proceedings of the
intersection of music and speech. In Journées IEEE, 61, 268–278.
d’Informatique Musicale (JIM) (pp. 41–45). MSH Freeman, J. (2003). Network Auralization for Gnutella.
Paris Nord, St. Denis, France. Web page. (http://turbulence.org/Works/freeman http://
Bonada, J., Celma, O., Loscos, A., Ortola, J., Serra, X., www.jasonfreeman.net/Catalog/electronic/nag.html
Yoshioka, Y., Kayama, H., Hisaminato, Y., & Ken- GRAM (Ed.). (1996). Dictionnaire des arts médiatiques.
mochi, H. (2001). Singing voice synthesis com- Groupe de recherche en arts médiatiques, Univer-
bining excitation plus resonance and sinusoidal plus sité du Québec à Montréal. (http://www.comm.uqam.
residual models. In Proceedings of the international ca/∼GRAM
computer music conference (icmc). Havana, Cuba. Hazel, S. (2001). Soundmosaic. web page.
Bünger, E. (2003). Let Them Sing It For You. Web page. (http://thalassocracy.org/soundmosaic)
(http://www.sr.se/sing http://www.erikbunger.com/ Hoskinson, R., & Pai, D. (2001). Manipulation and

14
resynthesis with natural grains. In Proceedings of Macon, M. W., Cronk, A. E., & Wouters, J. (1998).
the International Computer Music Conference Generalization and discrimination in
(ICMC). Havana, Cuba. tree-structured unit selection. In Proceedings of the
Hummel, T. A. (2005). Simulation of Human Voice 3rd esca/cocosda international speech synthesis
Timbre by Orchestration of Acoustic Music workshop. Jenolan Caves, Australia.
Instruments. In Proceedings of the International Manion, M. (1992). From Tape Loops to Midi: Karlheinz
Computer Music Conference (ICMC). Barcelona, Stockhausen’s Forty Years of Electronic Music.
Spain: ICMA. Online article.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a (http://www.stockhausen.org/tape loops.html)
concatenative speech synthesis system using a Meron, Y. (1999). High quality singing synthesis using
large speech database. In Proceedings of the IEEE the selection-based synthesis scheme. Unpublished
international conference on acoustics, speech, and doctoral dissertation, University of Tokyo.
signal processing (ICASSP) (pp. 373–376). Orio, N., & Schwarz, D. (2001). Alignment of
Atlanta, GA. Monophonic and Polyphonic Music to a Score. In
Hunter, J. (1999). MPEG7 Behind the Scenes. D-Lib Proceedings of the International Computer Music
Magazine, 5(9). (http://www.dlib.org/) Conference (ICMC). Havana, Cuba.
Jehan, T. (2004). Event-Synchronous Music Osaka, N. (2005). Concatenation and stretch/squeeze of
Analysis/Synthesis. In Proceedings of the musical instrumental sound using sound morphing.
COST-G6 Conference on Digital Audio Effects In Proceedings of the International Computer
(DAFx). Naples, Italy. Music Conference (ICMC). Barcelona, Spain.
Kärki, O. (2003). Système talkapillar. Unpublished Oswald, J. (1993). Plexure. CD. (http:
master’s thesis, EFREI, Ircam – Centre Pompidou, //plunderphonics.com/xhtml/xdiscography.html\#plexure
Paris, France. (Rapport de stage) Oswald, J. (1999). Plunderphonics. web page.
Kobayashi, R. (2003). Sound clustering synthesis using (http://www.plunderphonics.com
spectral data. In Proceedings of the International Pachet, F., Roy, P., & Cazaly, D. (2000). A combinatorial
Computer Music Conference (ICMC). Singapore. approach to content-based music selection. IEEE
Lannes, Y. (2005). Synthèse de la parole par MultiMedia, 7(1), 44–51.
concaténation d’unités (Mastère Recherche Signal, Prudon, R. (2003). A selection/concatenation TTS
Image, Acoustique, Optimisation). Université synthesis system. Unpublished doctoral
Toulouse III Paul Sabatier. dissertation, LIMSI, Université Paris XI, Orsay,
Lazier, A., & Cook, P. (2003). MOSIEVIUS: Feature France.
driven interactive audio mosaicing. In Proceedings Puckette, M. (2004). Low-Dimensional Parameter
of the COST-G6 Conference on Digital Audio Mapping Using Spectral Envelopes. In
Effects (DAFx) (pp. 312–317). London, UK. Proceedings of the International Computer Music
Lindemann, E. (2001, November). Musical synthesizer Conference (ICMC) (pp. 406–408). Miami,
capable of expressive phrasing [United States Florida.
Patent]. US Patent 6,316,710. Roads, C. (1988). Introduction to granular synthesis.
Lindsay, A. T., Parkes, A. P., & Fitzgerald, R. A. (2003). Computer Music Journal, 12(2), 11–13.
Description-driven context-sensitive effects. In Roads, C. (1996). The computer music tutorial. In (pp.
Proceedings of the COST-G6 Conference on 117–124). Cambridge, Massachusetts: MIT Press.
Digital Audio Effects (DAFx). London, UK. Roads, C. (2001). Microsound. Cambridge, Mass: MIT
Livshin, A., Peeters, G., & Rodet, X. (2003). Studies and Press.
improvements in automatic classification of Rodet, X. (2002). Synthesis and processing of the
musical sound samples. In Proceedings of the singing voice. In Proceedings of the 1st ieee
international computer music conference (icmc). benelux workshop on model based processing and
Singapore. coding of audio (mpca). Leuven, Belgium.
Lomax, K. (1996). The development of a singing Roy, P., Aucouturier, J.-J., Pachet, F., & Beurivé, A.
synthesiser. In 3èmes journees d’informatique (2005). Exploiting the Tradeoff Between Precision
musicale (jim). Ile de Tatihou, Lower Normandy, and CPU-time to Speed up Nearest Neighbor
France. Search. In Proceedings of the international
Macon, M., Jensen-Link, L., Oliverio, J., Clements, symposium on music information retrieval
M. A., & George, E. B. (1997a). A singing voice (ISMIR). London, UK.
synthesis system based on sinusoidal modeling. In Schaeffer, P. (1966). Traité des objets musicaux (1st
Proceedings of the IEEE international conference ed.). Paris, France: Éditions du Seuil.
on acoustics, speech, and signal processing Schaeffer, P., & Reibel, G. (1967). Solfège de l’objet
(ICASSP) (pp. 435–438). Munich, Germany. sonore. Paris, France: ORTF. (Reedited as
Macon, M., Jensen-Link, L., Oliverio, J., Clements, (Schaeffer & Reibel, 1998))
M. A., & George, E. B. (1997b). Schaeffer, P., & Reibel, G. (1998). Solfège de l’objet
Concatenation-Based MIDI-to-Singing Voice sonore. Paris, France: INA Publications–GRM.
Synthesis. In 103rd meeting of the audio (Reedition on 3 CDs with booklet of (Schaeffer &
engineering society. New York. Reibel, 1967))

15
Schnell, N., Borghesi, R., Schwarz, D., Bevilacqua, F., & Thom, D., Purnhagen, H., Pfeiffer, S., & MPEG
Müller, R. (2005). FTM—Complex Data Audio Subgroup, the. (1999, December). MPEG
Structures for Max. In Proceedings of the Audio FAQ. web page. Maui. (International
International Computer Music Conference Organisation for Standardisation, Organisation
(ICMC). Barcelona, Spain. Internationale de Normalisation, ISO/IEC
Schnell, N., & Schwarz, D. (2005). Gabor, JTC1/SC29/WG11, N3084, Coding of Moving
Multi-Representation Real-Time Pictures and Audio,
Analysis/Synthesis. In Proceedings of the http://www.tnt.uni-hannover.de/project/mpeg/audio/faq)
COST-G6 Conference on Digital Audio Effects Truchet, C., Assayag, G., & Codognet, P. (2001). Visual
(DAFx). Madrid, Spain. and adaptive constraint programming in music. In
Schwarz, D. (2000). A System for Data-Driven Proceedings of the International Computer Music
Concatenative Sound Synthesis. In Proceedings of Conference (ICMC). Havana, Cuba.
the COST-G6 Conference on Digital Audio Effects Tzanetakis, G. (2003). MUSESCAPE: An interactive
(DAFx) (pp. 97–102). Verona, Italy. content-aware music browser. In Proceedings of
Schwarz, D. (2003a). New Developments in Data-Driven the COST-G6 Conference on Digital Audio Effects
Concatenative Sound Synthesis. In Proceedings of (DAFx). London, UK.
the International Computer Music Conference Tzanetakis, G., Essl, G., & Cook, P. (2002). Human
(ICMC) (pp. 443–446). Singapore. Perception and Computer Extraction of Musical
Schwarz, D. (2003b). The C ATERPILLAR System for Beat Strength. In Proceedings of the COST-G6
Data-Driven Concatenative Sound Synthesis. In Conference on Digital Audio Effects (DAFx) (pp.
Proceedings of the COST-G6 Conference on 257–261). Hamburg, Germany.
Digital Audio Effects (DAFx) (pp. 135–140). Vinet, H. (2003). The representation levels of music
London, UK. information. In Computer music modeling and
Schwarz, D. (2004). Data-driven concatenative sound retrieval (CMMR). Montpellier, France.
synthesis. Thèse de doctorat, Université Paris 6 – Viterbi, A. J. (1967). Error bounds for convolutional
Pierre et Marie Curie, Paris. codes and an asymptotically optimal decoding
algorithm. IEEE Transactions on Information
Schwarz, D. (2005). Recent Advances in Musical
Theory, IT-13, 260–269.
Concatenative Sound Synthesis at Ircam
Wright, M., Chaudhary, A., Freed, A., Khoury, S., &
[Workshop]. In A. T. Lindsay (Ed.), Audio
Wessel, D. (1999). Audio Applications of the
Mosaicing: Feature-Driven Audio
Sound Description Interchange Format Standard.
Editing/Synthesis. Barcelona, Spain: International
In AES 107th convention preprint. New York, USA.
Computer Music Conference (ICMC) workshop.
Xiang, P. (2002). A new scheme for real-time loop music
(http://www.icmc2005.org/index.php?selectedPage=120
production based on granular similarity and
Schwarz, D. (2006). Caterpillar. Web page.
probability control. In Digital audio effects (dafx)
(http://recherche.ircam.fr/anasyn/schwarz/thesis
(pp. 89–92). Hamburg, Germany.
Schwarz, D., & Wright, M. (2000). Extensions and Zils, A., & Pachet, F. (2001). Musical Mosaicing. In
Applications of the SDIF Sound Description Proceedings of the COST-G6 Conference on
Interchange Format. In Proceedings of the Digital Audio Effects (DAFx). Limerick, Ireland.
International Computer Music Conference (ICMC) Zils, A., & Pachet, F. (2003). Extracting automatically
(pp. 481–484). Berlin, Germany. () the perceived intensity of music titles. In
Simon, I., Basu, S., Salesin, D., & Agrawala, M. (2005). Proceedings of the COST-G6 Conference on
Audio analogies: Creating new music from an Digital Audio Effects (DAFx). London, UK.
existing performance by concatenative synthesis. Zils, A., & Pachet, F. (2004). Automatic extraction of
In Proceedings of the International Computer music descriptors from acoustic signals using EDS.
Music Conference (ICMC). Barcelona, Spain. In Proceedings of the 116th AES Convention.
Sturm, B. L. (2004a). MATConcat: An Application for Atlanta, GA, USA.
Exploring Concatenative Sound Synthesis Using
MATLAB. In Proceedings of the International
Computer Music Conference (ICMC). Miami,
Florida.
Sturm, B. L. (2004b). MATConcat: An Application for
Exploring Concatenative Sound Synthesis Using
MATLAB. In Proceedings of the COST-G6
Conference on Digital Audio Effects (DAFx).
Naples, Italy.
Sturm, B. L. (2006). Concatenative sound synthesis and
intellectual property: An analysis of the legal
issues surrounding the synthesis of novel sounds
from copyright-protected work. Journal of New
Music Research, 35(1), 23–34. (Special Issue on
Audio Mosaicing)

Brigance Record Book I ASSIGN A STUDENT A GRADE LEVEL PDF
100% (2)
Brigance Record Book I ASSIGN A STUDENT A GRADE LEVEL PDF
41 pages
Current Research in Concatenative Sound Synthesis
No ratings yet
Current Research in Concatenative Sound Synthesis
5 pages
Data-Driven Concatenative Sound Synthesis
No ratings yet
Data-Driven Concatenative Sound Synthesis
293 pages
A System For Data-Driven Concatenative Sound Synthesis
No ratings yet
A System For Data-Driven Concatenative Sound Synthesis
6 pages
Real-Time Corpus-Based Concatenative Synthesis With Catart: Diemo Schwarz, GR Egory Beller, Bruno Verbrugghe, Sam Britton
No ratings yet
Real-Time Corpus-Based Concatenative Synthesis With Catart: Diemo Schwarz, GR Egory Beller, Bruno Verbrugghe, Sam Britton
7 pages
Principles and Applications of Interactive Corpus-Based Concatenative Synthesis
No ratings yet
Principles and Applications of Interactive Corpus-Based Concatenative Synthesis
10 pages
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
No ratings yet
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
6 pages
Introsounds 2 2
No ratings yet
Introsounds 2 2
33 pages
From Sound Modeling To Analysis-Synthesis of Sounds: 1 Scope of The Article
No ratings yet
From Sound Modeling To Analysis-Synthesis of Sounds: 1 Scope of The Article
9 pages
The MIT Press Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To Computer Music Journal
No ratings yet
The MIT Press Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To Computer Music Journal
18 pages
Music Database Retrieval Based On Spectral Similarity.
No ratings yet
Music Database Retrieval Based On Spectral Similarity.
9 pages
Computer Audition: Fundamentals and Applications
From Everand
Computer Audition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
No ratings yet
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
9 pages
Digital Sound Synthesis by Physical Modelling
No ratings yet
Digital Sound Synthesis by Physical Modelling
12 pages
Sound Synthesis Methods
100% (1)
Sound Synthesis Methods
8 pages
TOPOMSS Ch1
No ratings yet
TOPOMSS Ch1
23 pages
2013-1-00615-IF Bab1001
No ratings yet
2013-1-00615-IF Bab1001
4 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
A Corpus-Based Concatenative Speech Synthesis System For Turkish
No ratings yet
A Corpus-Based Concatenative Speech Synthesis System For Turkish
15 pages
Klapuri - 2006 - Introduction To Music Transcription
100% (1)
Klapuri - 2006 - Introduction To Music Transcription
28 pages
First Project Guidelines: Digital Music Synthesis: Instructor: Yolanda Campos
No ratings yet
First Project Guidelines: Digital Music Synthesis: Instructor: Yolanda Campos
2 pages
Synthesis by Parametric Design: Abstract. As A Composer I Started My Research Aiming at Developing A
No ratings yet
Synthesis by Parametric Design: Abstract. As A Composer I Started My Research Aiming at Developing A
7 pages
Untitled
No ratings yet
Untitled
269 pages
A Music Data Mining and Retrieval Primer: Dan Berger Dberger@cs - Ucr.edu May 27, 2003
No ratings yet
A Music Data Mining and Retrieval Primer: Dan Berger Dberger@cs - Ucr.edu May 27, 2003
6 pages
Ar Pluck Kamp, KCPS, Icps, Ifn, Imeth (Iparm1, Iparm2)
No ratings yet
Ar Pluck Kamp, KCPS, Icps, Ifn, Imeth (Iparm1, Iparm2)
11 pages
A Comparative Study in Automatic Recognition of Broadcast Audio
No ratings yet
A Comparative Study in Automatic Recognition of Broadcast Audio
4 pages
Festival Hindi Pxc3893287
No ratings yet
Festival Hindi Pxc3893287
6 pages
05 SymbolicControl
No ratings yet
05 SymbolicControl
5 pages
Project Report: Call For Your Symphony
No ratings yet
Project Report: Call For Your Symphony
15 pages
Sonic Art - Recipes and Reasonings
100% (1)
Sonic Art - Recipes and Reasonings
130 pages
Workshop Notes-Physical Modeling Sound Synthesis Using Finite Difference Schemes
No ratings yet
Workshop Notes-Physical Modeling Sound Synthesis Using Finite Difference Schemes
21 pages
Sound Synthesis and Manipulation: Sources
100% (1)
Sound Synthesis and Manipulation: Sources
10 pages
Musical Applications of Digital Synthesi
No ratings yet
Musical Applications of Digital Synthesi
284 pages
Qiaozhan Gao Report ReportFinal
No ratings yet
Qiaozhan Gao Report ReportFinal
6 pages
Introduction
No ratings yet
Introduction
3 pages
1 What Is A Digital Audio Effect?: Daniel Arfib
No ratings yet
1 What Is A Digital Audio Effect?: Daniel Arfib
4 pages
V3S3-6 JamalPriceReport
No ratings yet
V3S3-6 JamalPriceReport
10 pages
RISSET An Introductory Catalogue of Computer Sy
No ratings yet
RISSET An Introductory Catalogue of Computer Sy
120 pages
Issues in The Development of The Next Ge
No ratings yet
Issues in The Development of The Next Ge
4 pages
Searching Musical Audio Datasets by A Batch of Multi-Variant Tracks
No ratings yet
Searching Musical Audio Datasets by A Batch of Multi-Variant Tracks
7 pages
AI-Based Vocal Judging Application
No ratings yet
AI-Based Vocal Judging Application
8 pages
Music Score Alignment and Computer Accompaniment: Roger B. Dannenberg and Christopher Raphael
100% (1)
Music Score Alignment and Computer Accompaniment: Roger B. Dannenberg and Christopher Raphael
8 pages
Sound Classification
No ratings yet
Sound Classification
5 pages
13 Notes - Concatenative Synthesis 1
No ratings yet
13 Notes - Concatenative Synthesis 1
28 pages
Davies Hugh - A History of Sampling
No ratings yet
Davies Hugh - A History of Sampling
10 pages
A Comprehensive Survey On Deep Music Generation
No ratings yet
A Comprehensive Survey On Deep Music Generation
96 pages
VAGGIONE ArticulatingMicrotime 1996
100% (1)
VAGGIONE ArticulatingMicrotime 1996
7 pages
Ji Yang Luo Survey Symbolic Music Generation
No ratings yet
Ji Yang Luo Survey Symbolic Music Generation
39 pages
Compositional Approach To The Control of Computer-Based Sound Synthesis
No ratings yet
Compositional Approach To The Control of Computer-Based Sound Synthesis
17 pages
Audio Synthesis
100% (1)
Audio Synthesis
98 pages
Analysis and Synthesis of Speech Using Matlab
No ratings yet
Analysis and Synthesis of Speech Using Matlab
10 pages
Chap 5 Audio Dbms
No ratings yet
Chap 5 Audio Dbms
16 pages
Making Music With MATLAB
No ratings yet
Making Music With MATLAB
4 pages
GUIProject
No ratings yet
GUIProject
9 pages
Modelling of Audio Effects For Vocal and Music Synthesis in Real Time
No ratings yet
Modelling of Audio Effects For Vocal and Music Synthesis in Real Time
4 pages
Automatic Transcription of Simple Polyphonic Music:: Robust Front End Processing
100% (1)
Automatic Transcription of Simple Polyphonic Music:: Robust Front End Processing
11 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Sound Processing in Openmusic PDF
100% (1)
Sound Processing in Openmusic PDF
6 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Materials and Acoustics Handbook
From Everand
Materials and Acoustics Handbook
Michel Bruneau
No ratings yet
Time-Frequency Domain for Segmentation and Classification of Non-stationary Signals: The Stockwell Transform Applied on Bio-signals and Electric Signals
From Everand
Time-Frequency Domain for Segmentation and Classification of Non-stationary Signals: The Stockwell Transform Applied on Bio-signals and Electric Signals
Ali Moukadem
No ratings yet
Jie Yang, Zhong Xin, Quan (Sophia) He, Kenneth Corscadden, Haibo Niu T
No ratings yet
Jie Yang, Zhong Xin, Quan (Sophia) He, Kenneth Corscadden, Haibo Niu T
21 pages
DSS41A05
No ratings yet
DSS41A05
7 pages
Diagnostic Trouble Code Chart: Hint: When The Air Conditioning System Function Properly, DTC B1400/00 Is Output
No ratings yet
Diagnostic Trouble Code Chart: Hint: When The Air Conditioning System Function Properly, DTC B1400/00 Is Output
3 pages
Silicon NPN Epitaxial General Purpose Amplifier: Features
No ratings yet
Silicon NPN Epitaxial General Purpose Amplifier: Features
6 pages
COMP1001 LAB5.ipynb
No ratings yet
COMP1001 LAB5.ipynb
4 pages
3-4 Gas Laws Int - Reader - Study - Guide PDF
No ratings yet
3-4 Gas Laws Int - Reader - Study - Guide PDF
6 pages
MDP400 Open-Frame and U-Chassis :: ROAL Living Energy
No ratings yet
MDP400 Open-Frame and U-Chassis :: ROAL Living Energy
2 pages
Numerical Simulation of Silicon Heterojunction Solar Cells Featuring Metal Oxides As Carrier-Selective Contacts
No ratings yet
Numerical Simulation of Silicon Heterojunction Solar Cells Featuring Metal Oxides As Carrier-Selective Contacts
9 pages
Merih Instruction BUS Door
No ratings yet
Merih Instruction BUS Door
6 pages
Nagpur City
0% (1)
Nagpur City
2 pages
Exercise 1: Descriptive Statistics Practice Exercises
No ratings yet
Exercise 1: Descriptive Statistics Practice Exercises
6 pages
History of Computing
No ratings yet
History of Computing
3 pages
Topic: The Determinants of Profitability of Bangladeshi Commercial Banks
No ratings yet
Topic: The Determinants of Profitability of Bangladeshi Commercial Banks
50 pages
Chapter 3 Stacks
No ratings yet
Chapter 3 Stacks
28 pages
Problem Solving Assignment 1 PDF
No ratings yet
Problem Solving Assignment 1 PDF
5 pages
Iub Port Available Bandwidth Utilizing Ratio PDF
No ratings yet
Iub Port Available Bandwidth Utilizing Ratio PDF
2 pages
Dynamic Behavior of Materials, Volume 1: Leslie E. Lamberson Editor
No ratings yet
Dynamic Behavior of Materials, Volume 1: Leslie E. Lamberson Editor
218 pages
Physics 1.11 Pressure
No ratings yet
Physics 1.11 Pressure
67 pages
Shape Vocabulary Word Mat
No ratings yet
Shape Vocabulary Word Mat
4 pages
Regression
No ratings yet
Regression
49 pages
Reteach Multiples - Worksheet Given by The Teacher
No ratings yet
Reteach Multiples - Worksheet Given by The Teacher
1 page
Earth Science Reviewer
No ratings yet
Earth Science Reviewer
13 pages
Easy View 2 Manual en
No ratings yet
Easy View 2 Manual en
25 pages
Ma3151 Matrices and Calculus Two Mark Questions 2
No ratings yet
Ma3151 Matrices and Calculus Two Mark Questions 2
14 pages
3500 C175 C280 AftertreatmentCEM T4 Marine A and I
100% (1)
3500 C175 C280 AftertreatmentCEM T4 Marine A and I
121 pages
Journal of Alloys and Compounds
No ratings yet
Journal of Alloys and Compounds
8 pages
IEEEXplore Published Paper
No ratings yet
IEEEXplore Published Paper
8 pages
Summary of Lectures 02 Vector Spaces
No ratings yet
Summary of Lectures 02 Vector Spaces
3 pages
Computer Project
No ratings yet
Computer Project
59 pages

Concatenative Sound Synthesis - The Early Years Diemo Schwarz

Uploaded by

Concatenative Sound Synthesis - The Early Years Diemo Schwarz

Uploaded by

Concatenative Sound Synthesis: The Early Years

To cite this version:

HAL Id: hal-01161361

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

ABSTRACT Concatenative sound synthesis (CSS) methods use a

1.1. Applications Source Sounds Audio Score Symbolic Score

The current work on concatenative synthesis focuses on

formation attributed to the source sounds can be exploited

Resynthesis of audio with sounds from the database: A

The distance depends on the unit type: concatenating an

Table 1. Comparison of concatenative synthesis work in chronological order

3.1.5. Iannis Xenakis 3.1.7. Plunderphonics (1993)

La legende des siecles Granuloop

3.2.4. Synful (2004)

You might also like