Sound modeling: signal-based approaches
Sound modeling: signal-based approaches
2.1 Introduction
The sound produced by acoustic musical instruments is caused by the physical vibration of a certain
resonating structure. This vibration can be described by signals that correspond to the time-evolution of
the acoustic pressure associated to it. The fact that the sound can be characterized by a set of signals
suggests quite naturally that some computing equipment could be successfully employed for generating
sounds, for either the imitation of acoustic instruments or the creation of new sounds with novel timbral
properties.
A wide variety of sound synthesis algorithms is currently available either commercially or in the
literature. Each one of them exhibits some peculiar characteristics that could make it preferable to others,
depending on goals and needs. Technological progress has made enormous steps forward in the past few
years as far as the computational power that can be made available at low cost is concerned. At the
same time, sound synthesis methods have become more and more computationally efficient and the user
interface has become friendlier and friendlier. As a consequence, musicians can nowadays access a wide
collection of synthesis techniques (all available at low cost in their full functionality), and concentrate on
their timbral properties.
Each sound synthesis algorithm can be thought of as a digital model for the sound itself. Though
this observation may seem quite obvious, its meaning for sound synthesis is not so straightforward. As a
matter of fact, modeling sounds is much more than just generating them, as a digital model can be used
for representing and generating a whole class of sounds, depending on the choice of control parameters.
The idea of associating a class of sounds to a digital sound model is in complete accordance with the way
we tend to classify natural musical instruments according to their sound generation mechanism. For ex-
ample, strings and woodwinds are normally seen as timbral classes of acoustic instruments characterized
by their sound generation mechanism. It should be quite clear that the degree of compactness of a class
2-2 Algorithms for Sound and Music Computing [v.February 2, 2019]
of sounds is determined, on one hand, by the sensitivity of the digital model to parameter variations and,
on the other hand, on the amount of control that is necessary for obtaining a certain desired sound. As an
extreme example we may think of a situation in which a musician is required to generate sounds sample
by sample, while the task of the computing equipment is just that of playing the samples. In this case
the control signal is represented by the sound itself, therefore the class of sounds that can be produced
is unlimited but the instrument is impossible for a musician to control and play. An opposite extremal
situation is that in which the synthesis technique is actually the model of an acoustic musical instrument.
In this case the class of sounds that can be produced is much more limited (it is characteristic of the
mechanism that is being modeled by the algorithm), but the degree of difficulty involved in generating
the control parameters is quite modest, as it corresponds to physical parameters that have an intuitive
counterpart in the experience of the musician.
An interesting conclusion that could be already drawn in the light of what stated above is that the
compactness of the class of sounds associated to a sound synthesis algorithm is somehow in contrast
with the “playability” of the algorithm itself. One should remember that the ”playability” is of crucial
importance for the success of a specific sound synthesis algorithm as, in order for a sound synthesis
algorithm to be suitable for musical purposes, the musician needs an intuitive and easy access to its
control parameters during both the sound design process and the performance. Such requirements often
represents the reason why a certain synthesis technique is preferred to others.
Some considerations on control parameters are now in order. Varying the control parameters of a
sound synthesis algorithm can serve several purposes, the first one of which is certainly that of exploring a
sound space, i.e. producing all the different sounds that belong to the class characterized by the algorithm
itself. This very traditional way of using control parameters would nowadays be largely insufficient
by itself. As a matter of fact, with the progress in the computational devices that are currently being
employed for musical purposes, the musician’s needs have turned more and more toward problems of
timbral dynamics. For example, timbral differences between soft (dark) and loud (brilliant) tones are
usually obtained through appropriate parameter control. Timbral expression parameters tend to operate
at a note-level time-scale. As such, they can be suitably treated as signals characterized by a rather slow
rate.
Another reason for the importance of time-variations in the algorithm parameters is that the musician
needs to control the musical expression while playing. For example, staccato, legato, vibrato etc. need to
be obtained through parameter control. Such parameter variations operate at a phrase-level time-scale.
Because of that, they can be suitably treated as sequences of symbols events characterized by a very slow
rate.
In conclusion, control parameters are signals characterized by their own time-scales. Controls signals
for timbral dynamics are best described as discrete-time signals with a slow sampling rate, while controls
for musical expression are best described by streams of asynchronous symbols events. As a consequence,
the generation of control signals can once again be seen as a problem of signal synthesis.
Finding a mathematical model that faithfully imitates a real sound is an extremely difficult task. If an
existing reference sound is available, however, it is always possible to reproduce it through recording.
Such a method, though simple in its principle, is widely adopted by digital sampling instruments or
An overall transposition of a factor L/M can be obtained by reading the table with the signal
⌊ ⌋
L
ϕ[n] = n , n = 0, . . . , M − 1. (2.2)
M
As an example, if 3M = 2L the sound will be transposed upwards by a perfect fifth (i.e. by a ratio
of 3/2). However, ϕ needs not to be a ramp that rises linearly in time. In the case where ϕ is still a
monotonic signal, but changes its slope during time, the transposition will become time-dependent.
Although the pitch shifting outlined above is simple and straightforward to implement, it has to
be noted that substantial pitch variations are generally not very satisfactory as a temporal waveform
compression or expansion results in unnatural timbral modifications, which is exactly what happens when
the playing speed is changed in a tape recorder. Satisfactory quality and timbral similarity between the
original tone and the transposed one can be obtained only if small pitch variations (e.g. a few semitones)
are performed. As an example, sampling a piano with a reasonable quality along the entire instrumental
extension requires that many notes are stored (e.g. three for each octave). In this way, notes that have not
been sampled can be obtained from the available ones through transposition of max. two semitones.
M-2.1
Import a .wav file of a single instrument tone. Scale it (compress and expand) to different extents and listen to the
new sounds. Up to what scaling ratio are the results acceptable?
Often it is desired to vary the sound also in function of other parameters, the most important being
the intensity. To this purpose it not sufficient to change the sound amplitude by a multiplication, by it
is necessary to modify the timbre of the sound. In general louder sounds are characterized by a sharper
attack and by a brighter spectrum. In this case a technique could be to use a unique sound prototype (e.g.
a tone played fortissimo) and then obtaining the other intensity by simple spectral processing, as low pass
filtering. A different and more effective solution, is to use a set of different sound prototype, recorder
with different intensity (e.g. tones played fortissimo, mezzo forte, pianissimo) and then obtaining the
other dynamic values by interpolations and further processing.
This technique is thus characterized by high computational efficiency and high imitation quality, but
by low flexibility for sounds not initially included in the repertoire or not easily obtainable with simple
transformations. There is a trade-off of memory size with sound fidelity.
In order to employ efficiently the memory, often the sustain part of the tone is not entirely stored but
only a part (or few significant parts) and in the synthesis this part is repeated (looping). Naturally the
repeated part should not be to short, to avoid a static character of the resulting sound. For example to
lengthen the duration of a note, first the attack is reproduced without modification, then the sustain part
is cyclically repeated, with possible cross interpolation among the different selected parts, and finally the
sound release stored part is reproduced. Notice that if we want to avoid artefacts in cycling, particular
care should be devoted to choosing the points of the beginning and ending of the loop. Normally an
integer number of periods is used for looping starting with a null value, to avoid amplitude or phase
discontinuities. In fact these discontinuities are very annoying. To this purpose it may be necessary to
process the recorded samples by slightly changing the phases of the partials.
M-2.2
Import a .wav file of a single instrument tone. Find the stationary (sustain) part, isolate a section, and perform the
looping operation. Listen to the results, and listen to the artifacts when the looped section does not start/end at
zero-crossings.
If we want a less static sustain, it is possible to individuate some different and significant sound
segments, and during the synthesis interpolate (cross-fade) among subsequent segments. In this case the
temporal evolution of the tone can be more faithfully reproduced.
where wa [n] is an analysis window and Sa is the analysis hop-size, i.e. the time-lag (in samples) between
one analysis frame and the following one. If the window wa is N samples long, then the block size, i.e.
the length of each frame xm [n], will be N . In order for the signal segments to actually overlap, the
inequality Sa ≤ N must be verified. When Sa = N the segments are exactly juxtaposed with no
overlap.
Given the above
∑ signal segmentation, time-domain overlap-add (OLA) methods construct an output
signal y[n] = m ym [n], where the segments ym are modified versions of the input segments xm . As
an example, they can be obtained by modifying Sa in Eq. (2.3), or by repeating/removing some of the
input segments xm . In the absence of modifications, this procedure reduces to the identity (y[n] ≡ x[n])
if the overlapped and added analysis windows wa sum to unity:
∑ ∑ ∑
y[n] = xm [n] = x[n]wa [n − mSa ] = x[n] ⇔ Awa [n] ≜ wa [n − mSa ] ≡ 1. (2.4)
m m m
If this condition does not hold, then the function Awa acts on the reconstructed signal as a periodic
amplitude modulation envelope, with period Sa . Depending on the application and on the window length
N , this amplitude modulation can introduce audible artifacts. This kind of frame rate distortion can be
seen in the frequency domain as a series of sidebands with spacing Fs /Sa in a spectrogram of the output
signal. In fact, one may prove that the condition Awa ≡ 1 is equivalent to the condition W (ejωk ) = 0 at
all harmonics of the frame rate Fs /Sa . Figure 2.1 illustrates an example of signal reconstruction using a
triangular window, which satisfies the condition of Eq. (2.4).
A widely studied effect is time-stretching, i.e. contraction or expansion of the duration of an audio
signal. Time-stretching algorithms useful in a number of applications: think about wavetable synthesis,
post-synchonization of audio and video, speech technology at large, and so on. A time-stretching algo-
rithm should ideally shorten or lengthen a sound file composed of Ntot samples to a new desired length
′ = αN , where α is the stretching factor. Note that a mere resampling of the sound signal does not
Ntot tot
provide the desired result, since it has the side-effect of transposing the sound: in this context resampling
is the digital equivalent of playing the tape at a different speed.
What one really wants is a scaling of the perceived timing attributes without affecting the perceived
frequency attributes. More precisely, we want the time-scaled version of the audio signal to be perceived
as the same sequence of acoustic events as the original signal, only distributed on a compressed/expanded
time pattern. As an example, a time-stretching algorithm applied to a speech signal should change the
speaking rate without altering the pitch.
Time-domain OLA techniques are one possible approach to time-stretching effects. The basic OLA
algorithm described above can be adapted to this problem by defining an analysis hop size) Sa and a
synthesis hop size Ss = αSa , scaled by the stretching factor, that will be applied to the output. An input
signal x[n] is then segmented into frames xk [n], each taken every Sa samples. The output signal y[n]
original
signal
reconstructed
signal
x k [n]
x k+1[n]
Sa
x k+2[n]
Sa Time−stretched signal
is produced by reassembling the same frames xk [n], each added to the preceding one every Ss samples.
However this repositioning of the input segments with respect to each other destroys the original phase
relationships, and constructs the output signal by interpolating between these misaligned segments. This
cause pitch period discontinuities and distortions that can produce heavily audible artifacts in the output
signal.
of the output signal y[n] and the incoming analysis frame xk+1 [n]. More precisely, xk+1 [n] is pasted
to the output y[n] starting from sample kSs + mk , where mk is a small discrete time-lag that optimizes
the alignement between y and xk (see Fig. 2.2). Note that mk can in general be positive or negative,
although for clarity we have used a positive mk in Fig. 2.2.
When the optimal time-lag mk is found, a linear crossfade is used within the overlap window, in
order to obtain a gradual transition from the last portion of y to the first portion of xk . Then the last
samples of xk are pasted into y. If we assume that the overlap window at the kth SOLA step is Lk
samples long, then the algorithmic step computes the new frame of the input y as
{
(1 − v[j])y[kSs + j] + v[j]xk [j] for mk ≤ j ≤ Lk
y[kSs + j] = (2.6)
xk [j] for Lk + 1 ≤ j ≤ N
where v[j] is a linear smoothing function that realizes the crossfade between the two segments. The effect
of Eq. (2.6) is a local replication or suppression of waveform periods (depending on the value of α), that
eventually results in an output signal y[n] with approximately the same spectral properties of the input
x[n], and an altered temporal evolution.
At least three techniques are commonly used in order to find the optimal value for the discrete time
lag mk at each algorithmic step k:
2. Computation of the maximum cross-correlation rk (m) in a neighborhood of the sample kSs . Let
M be the width of such neighborhood, and let yMk [i] = y[kSs + i] for i = 1 . . . M − 1, and
xMk [i] = xk+1 [i] for i = 1 . . . M − 1. Then the cross-correlation rk (m) is computed as
M −m−1
∑
rk [m] ≜ yMk [i] · xMk [i + m], m = −M + 1, . . . , M − 1. (2.7)
i=0
3. Computation of the maximum normalized cross-correlation, where every value taken from the
cross-correlation signal is normalized by dividing it by the product of the frame energies.
The latter technique is conceptually preferable, but the second one is often used for efficiency reasons.
M-2.3
Write a function that realizes the time-stretching SOLA algorithm through segment cross-correlation.
M-2.3 Solution
function y = sola_timestretch(x,N,Sa,alpha,L)
%N: block length; Sa: analysis hop-size; alpha: stretch factor; L: overlap int.
ñ1 = n1 ,
( ) (2.8)
ñk+1 = ñk + P ni(k,α) , k > 1,
where i(k, α) is the index of the input pitch marks that minimizes the distance | αni − ñk |. Intuitively,
this means that at the time instant ñk the time-stretched signal must have the same pitch possessed by
the original signal at time ni , with ñk ∼ αni .
Once the set {ñk }k has been determined in this way, for every k the segment xi(k,α) [n] is overlapped
and added at the point ñk . The algorithm is visualized in Fig. 2.3: note that with a stretching factor α > 1
(time expansion, as in Fig. 2.3) some segments will be repeated, or equivalently the function i(k, α) will
take identical values for some consecutive values of k. Similarly, when α < 1 (time compression) some
segments are discarded in the resynthesis.
The main advantage of the PSOLA algorithm with respect to SOLA is that it allows for a better align-
ment of segments, by exploiting information about pitch instead of using a simple cross-correlation. On
x 1 [n]
~ ~
...
n1 n2 n~3 n~4 n~5
x 2 [n]
P(n 1 )
x 3 [n]
P(n 2 ) x 4 [n]
P(n 3 )
...
Time−stretched signal
the other hand it has a higher complexity especially because of the pitch estimation procedure. Moreover,
noticeable artifacts still appears for very small or large stretching factors. One problem is that when α is
very large, identical segments will be repeated several times thus providing an unnatural character to the
sound. A second more general problem is that the OLA algorithms examined here stretch an input signal
uniformly, including possible trasients, which instead should be preserved.
M-2.4
Write a function that realizes the time-stretching PSOLA synthesis algorithm, given a vector of input pitch marks
ni .
M-2.4 Solution
function y=psola_timestretch(x,nis,alpha)
%N: block length; nis: pitch marks; alpha: stretch factor;
Clearly we have omitted the most difficult part of the PSOLA approach, i.e. an algorithm that deter-
mines the input pitch marks (the vector nis in our code).
2.2.3.1 Gaborets
The initial idea of granular synthesis can be traced back to the work of the hungarian physicist Dennis
Gabor, which was aimed at pinpointing the physical and mathematical ideas needed to understand what
a time-frequency spectrum is. He considered sound as a sum of elementary Gaussian functions that have
been shifted in time and frequency. Gabor considered these elementary functions as acoustic quanta,
the basic constituents of a sound. These works have been rich in implications and have been the starting
point for studying time-frequency representations and wavelet theory.
The usual Gabor expansion on a rectangular time-frequency lattice of a signal x(t) can be expressed
as a linear combination of “grains” gmk (t), that are shifted and modulated versions of a synthesis window
w(t)
∑∑
x(t) = amk gmk (t), with gmk (t) = w(t − mαT )ejkβΩt . (2.9)
m k
Other names for these grains, or acoustic quanta, are gaborets, or Gabor functions, or Gabor atoms.
In Gabor formulation w(t) is a gaussian window of the form
1
w(t) = √ e−t/2σ ,
2
(2.10)
σ 2π
where σ is the standard deviation of the gaussian. An important property of this function is that it is
possibly the only smooth, nonzero function, known in closed form, that is transformed to itself in the
Fourier domain:
F{w}(ω) = e−t/2(1/σ) .
2
(2.11)
Therefore the grain gmk has a gaussian shape both in time and in frequency: it is a gaussian bell in the
time-frequency domain. As a particular case of the uncertainty principle, the width of the time-domain
gaussian is inversely proportional to that of the frequency-domain gaussian, as shown by Eq. (2.11).
Historically, the use of granular synthesis in musical applications has been classified into two main
approaches. The first one is based on the use of sampled sounds to construct grains, while the second
one is based on the use of abstract, entirely synthetic grains.
2
The greek composer Iannis Xenakis is generally acknowledged to have provided the first formulation of granular synthesis
in his compositions Analogique A et B (1958-59).
10111
11
00000
000
111
00
11 11
00 0
111
00
11
00 00
11
100
0
00
11 000
111
frequency
00
11 00
11
00
11
11
000
1 0
1
00
11 00
11
0
1 00
1100
11
00
11
11
0000
11
00
1100
11
00
1111
000
111
0
1 11
00
00
11
1
0
110
1 11
00
1
0
10101010101011
00
1000
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1
000
111
0
1 111
000
0
1
11
00
00
11
00 11
00
11
000
111 0
1
0011
00
00
11
00
11
00
10 111 0
1
00
11
00
11
00
11
00
11 00
11
00
11
000
11111
00
0
1
0
1
0
1
11
00
1100
11
000
111111
000 0
1
00
11
011 00 00
11 00
11
0
1
0
1 11 00
11 011
00
11
0111
00
1
000
11
11
00
00
11 11
00
00
11
00
11
0
1
00
11
00
11
0
1
00
11
00
11
111
000
0
11
00
1
0
00
11 00 00
00
11
1
1
00
11 00
11
00
00
11
101000
0
1
00
11 0
11
0
11
00 00
11
00
11
00
11
00
11 11001
00
11
00
11
00
11
11 0
time
Figure 2.4: Representation of granular synthesis where grains derived from different sources are ran-
domly mixed.
The term sound granulation is typically used to identify the first of the two above mentioned approaches.
Complex waveforms, extracted from real sounds or described by spectra, are organized in succession
with partial overlap in time. In this way, it is possible both to reproduce accurately real sounds and
modify then in their dynamic characteristics. The original sound x[n] used to create the grains may have
been previously recorded, or may be processed in real-time. In this respect granular synthesis can be
viewed as an OLA technique in which segments xm [n] of a sound signal x[n] represent the grains, and
are processed both in time and frequency before being reassembled.
The time instant nm indicates the point where the windowing starts and the segment is extracted; the
length Lm of the window determines the amount of signal extracted; the shape of the window wm [n]
should provide an amplitude envelope that ensures fade-in and fade-out at the border of the grain and
affects the frequency content of the grain.
The length Lm is a critical parameter. Long grains tend to maintain the timbre identity of the por-
tion of the input signal, while short ones acquire a pulse-like quality and frequency information is not
perceivable any more. When the grain is long, the window has a flat top and it used only to fade-in and
fade-out the borders of the segment. Typical grain length are in the range 5 − 100 ms.
The organization of the choice of the frequencies is also very important, therefore in granular synthe-
sis the proper timing organization of the grain is essential to avoid artifacts produced by discontinuities.
This problem makes often the control quite difficult.
Notice that it is possible to extract grains from different sounds, to create hybrid textures, e.g. evolv-
ing from one texture to another. A schematic visualization of this approach is given in Fig. 2.4.
The second of the two above mentioned approaches is based on grains that consist of synthetic waveforms
whose amplitude envelope is a short Gaussian function. Most typically, frequency modulated gaussian
functions are used in order to localize the energy both in frequency and time domain, in the line of Gabor
0.5
amplitude
0
−0.5
−1
0 0.01 0.02 0.03 0.04 0.05
t (s)
Figure 2.5: Example of a synthetic grain waveform, with frequency ωk = 2π · 500 rad/s and standard
deviation σ = 0.2.
where the index m refers to the grain position in time, while the index k refers to its position in frequency.
The window wmk [n] is a gaussian window shifted by nm samples, and ωk (in rad/s) is the frequency of
the grain. A plot of a grain constructed in this way is provided in Fig. 2.5.
M-2.5
Write a function that computes a gaussian grain given its length, frequency, and standard deviation.
M-2.5 Solution
function yg = grain(L,f,sigma)
global Fs;
yg=(gausswin(L,1/sigma))’.*cos(2*pi*f*(1:L)/Fs);
The sound synthesized from these grains is again constructed according to Gabor formulation: it
is an assemblage of grains scattered on the frequency-time plane in the form of “clouds”. The general
synthesis expression is given by ∑∑
s[n] = amk · gmk [n], (2.14)
m k
where amk is the amplitude coefficient of the corresponding grain. Every grain contributes to the total
sound energy around the point (nm , ωk ) of the time frequency plane.
In order to implement the above synthesis equation, the simplest approach amounts to define a con-
stant density of grains in time, so that there is a constant time-step between the generation of a grain and
the following one. This approach is often termed synchronous granular synthesis.
M-2.6
Write a script that realizes a synchronous granular synthesis scheme.
M-2.6 Solution
y=zeros(1,(slength+2*glength)*Fs);
for m=0:totgrains-1
f = gfreq +(rand(1)-.5)*frange; % grain frequency
t = round(Fs*(glength+(rand(1)-.5)*lrange)); % grain length (samples)
d = gsigma +(rand(1)-.5)*sigrange; %grain st.dev.
yg = grain(t,f,d); %construct grain
frame = round(m*Fs/gdens) +(1:t); %frame to be written (shifts with m)
y(frame) = y(frame) + yg; %add current grain to sound
end
We have realized the synthesis in the time domain. Equivalent results could be obtained by working
in the frequency domain.
However the most used type of granular synthesis is asynchronous granular synthesis, where grains
are irregularly distributed in the time-frequency plane, e.g. they are scattered onto a mask that delim-
its specific portions of the time-frequency-amplitude space. This results in time-varying “clouds” of
micro-sounds, or sonic textures, that can simulate even natural noisy sounds in which general statistical
properties are more important than the exact sound evolution. Typical examples include the sound of
numerous small objects (e.g., rice or sand) falling onto a resonating surface (e.g., a metal plate), or rain
sounds composed by the accumulation of a large amount of water droplet micro-sounds, or even scratch-
ing/cracking sounds made by the accumulation of thousands of complex micro-sounds not necessarily
deterministic. In general we can expect these types of sounds to occur in the real world when they are
the result of multiple realizations of the same event or the same phenomenon.
Musical composers have tried to evaluate the effects of different control parameters from an aesthetic
point of view. Grain duration affects the sonic texture: short duration (few samples) produces a noisy,
particulate disintegration effect; medium duration (tens of ms) produces fluttering, warbling, gurgling;
longer durations (hundreds of ms) produce aperiodic tremolo, jittering spatial position. When the grains
are distributed on a large frequency region, the texture has a massive character, while when the band
is quite narrow, it result a pitched sound. Sparse densities (e.g. 5 grains per second) give rise to a
pointillistic texture.
In recent years, synthesis methods conceptually similar to granular techniques have received a new im-
pulse due to the availability of ever larger databases of sounds. Various definitions are used in the lit-
erature, including concatenative synthesis, audio mosaicing, and musaicing (neologism from music and
mosaicing). All works in this direction share the general idea that a target sound can be approximated by
samples taken from a pre-existing corpus of sounds.
Besides granular synthesis, the closest relative of this idea is in the the field of speech synthesis:
concatenative speech synthesis started to develop in the early sixties and is currently the most used
synthesis approach in text-to-speech systems. In short, written text is automatically segmented into
elementary phonetic units that are subsequently synthesized using a large database of sampled speech
sounds. These components are pieced together to obtain a synthesis of the text.
Central point: how to properly describe both the target sound and the sounds in the database, in
order to define measures of similarity. We need high-level sound descriptors. Sounds in the database
can be segmented into units (e.g. an instrument sound can be subdivided into attack, sustain, and release
portions), and some kind of unit selection algorithm has to be realized that finds the sequence of units that
match best the target sound or phrase to be synthesized. The selection will be performed according to the
descriptors of the units. The selected units can then be transformed to fully match the target specification,
and are concatenated.
1. Any desired modication is applied to the spectra (e.g. multiplying by a lter frequency response
function), and modified frame spectra Ym (ejωk ) are obtained.
2. Windowed segments ym [n] of the modified signal y[n] are obtained by computing the inverse DFT
(IDFT) of the frames Ym .
∑
3. The output is reconstructed by overlapping-adding the windowed segments: y[n] = m ym [n].
In their most general formulation OLA methods utilize a synthesis window ws that can in general be
different from the analysis window wa . In this case the second step of the procedure outlined above is
modified as follows:
2. Windowed segments ym [n] of the modified signal y[n] are obtained by (a) computing the inverse
DFT (IDFT) of the frames Ym , (b) dividing by the analysis window wa (assuming that it is non-
zero for all samples), and (c) multiplying by the synthesis window.
This approach provides greater flexibility than the previous one: the analysis window wa can be
chosen only on the basis of its time-frequency resolution properties, but needs not to satisfy the “sum-
to-unity” condition Awa ≡ 1. On the other hand, the synthesis window ws is only used to cross-fade
between signal segments, therefore one should only ensure that Aws ≡ 1. We will see in section 2.4.1 an
application of this technique to frequency-domain implementation of additive synthesis.
Many digital sound effects can be obtained by empolying OLA techniques. As an example, a roboti-
zation effect can be obtained by putting zero phase values on every FFT before reconstruction: the effect
applies a fixed pitch onto a sound and moreover, as it forces the sound to be periodic, many erratic and
random variations are converted into “robotic” sounds. Other effects are obtained by imposing a random
phase on a time-frequency representation, with different behaviors depending on the block length N : if
N is large (e.g. N = 2048 with Fs = 44.1 kHz), the magnitude will represent the behavior of the par-
tials quite well and changes in phase will produce an uncertainty over the frequency; if N is small (e.g.
N = 64 with Fs = 44.1 kHz), the spectral envelope will be enhanced and this will lead to a whispering
effect.
The term deterministic signal means in general any signal that is not noise. The class of deterministic
signals that we consider here is restricted to sums of sinusoidal components with varying amplitude and
frequency. For pitched sounds in particular, spectral energy is mainly concentrated at a few discrete
(slowly time-varying) frequencies fk . These frequency lines correspond to different sinusoidal compo-
nents called partials. The amplitude ak of each partial is not constant and its time-variation is critical
for timbre characterization. If there is a good degree of correlation among the frequency and amplitude
variations of different partials, these are perceived as fused to give a unique sound with its timbre identity.
Amplitude and frequency variations can be noticed e.g. in sound attacks: some partials that are
relevant in the attack can disappear in the stationary part. In general, the frequencies can have arbitrary
distributions: for quasi-periodic sounds the frequencies are approximately harmonic components (integer
multiples of a common fundamental frequency), while for non-harmonic sounds (such as that of a bell)
they have non-integer ratios.
The deterministic part of a sound signal can be represented by the sinusoidal model, which assumes
that the sound can be modeled as a sum of sinusoidal oscillators whose amplitude ak (t) and frequency
fk (t) are slowly time-varying:
∑
ss (t) = ak (t) cos(ϕk (t)),
k
∫ t (2.15)
ϕk (t) =2π fk (τ )dτ + ϕk (0),
0
where ϕ0k represents an initial phase value. These equations have great generality and can be used to
faithfully reproduce many types of sound, especially in a “synthesis-by-analysis” framework (that we
discuss in Sec. 2.4.2 below). If the sound is almost periodic, the frequencies of partials are approxi-
mately multiples of the fundamental frequency f0 , i.e. fk (t) ≃ k f0 (t). In this sense Eqs. (2.16) are
a generalization of the Fourier theorem to quasi-periodic sounds. Moreover the model is also capable
of representing aperiodic and inharmonic sounds, as long as their spectral energy is concentrated near
discrete frequencies (spectral lines).
As already noted, one limitation of the sinusoidal model is that it discards completely the noisy
components that are always present in real signals. Another drawback of Eq. (2.16) is that it needs an
extremely large number of control parameters: for each note that we want to reproduce, we need to
provide the amplitude and frequency envelopes for all the partials. Moreover, the envelopes for a single
note are not fixed, but depend in general on the intensity.
On the other hand, additive synthesis provides a very intuitive sound representation, and this is one
of the reasons why it has been one of the earliest popular synthesis techniques in computer music.3
Moreover, sound transformations performed on the parameters of the additive representation (e.g., time-
scale modifications) are perceptually very robust.
...
ss [n]
Figure 2.6: Sum of sinusoidal oscillators with time-varying amplitudes and frequencies.
amplitude and the instantaneous angular frequency of a particular partial are obtained by linear interpo-
lation, as discussed there. Figure 2.6 provides a block diagram of such a time-domain implementation.
M-2.7
Use the sinusoidal oscillator realized in Chapter Fundamentals of digital audio processing to synthesize a sum of two
sinusoids.
M-2.7 Solution
global Fs; global SpF; %global variables: sample rate, samples-per-frame
Fs=22050;
framelength=0.01; %frame length (in s)
SpF=round(Fs*framelength); %samples per frame
The sinusoidal oscillator controlled in frequency and amplitude is the fundamental building block for
time-domain implementations of additive synthesis. Here we employ it to look at the beating phe-
nomenon. We use two oscillators, of which one has constant frequency while the second is given a
slowly increasing frequency envelope. Figure 2.7 shows the f1, f2 control signals and the amplitude
envelope of the resulting sound signal: note the beating effect.
In alternative to the time-domain approach, a very efficient implementation of additive synthesis can
be developed in the frequency domain, using the inverse FFT. As we have seen in Chapter Fundamentals of
digital audio processing, the DFT of a windowed sinusoid is the DFT of the window, centered at the frequency
of the sinusoid, and multiplied by a complex number whose magnitude and phase are the magnitude and
phase of the sine wave:
s[n] = a cos(2πf0 n/Fs + ϕ) ⇒ F[w · s](f ) = aejϕ W (f − f0 ). (2.18)
If the window W (f ) has a sufficiently high sidelobe attenuation, the sinusoid can be generated in the
spectral domain by calculating the samples in the main lobe of the window transform, with the appro-
priate magnitude, frequency and phase values. One can then synthesize as many sinusoids as desired,
2
320
1.5
300
1
frequency (Hz)
280
s(t) (adim)
0.5
260
0
240 −0.5
220 −1
−1.5
200
−2
180
0 5 10 15 20 0 5 10 15 20
t (s) t (s)
(a) (b)
Figure 2.7: Beating effect: (a) frequency envelopes (f1 dashed line, f2 solid line) and (b) envelope of
the resulting signal.
by adding a corresponding number of main lobes in the Fourier domain and performing a single IFFT to
obtain the corresponding time-domain signal in a frame.
By an overlap-add process one then obtains the time-varying characteristics of the sound. Note
however that, in order for the signal reconstruction to be free of artifacts, the overlap-add procedure must
be carried out using a window with the property that its shifted copies overlap and add to give a constant.
A particularly simple and effective window that satisfies this property is the triangular window.
The FFT-based approach can be convenient with respect to time-domain techniques when a very
high number of sinusoidal components must be reproduced: the reason is that the computational costs
of this implementation are largely dominated by the cost of the IFFT, which does not depend on the
number of components. On the other hand, this approach is less flexible than the traditional oscillator
bank implementation, especially for the instantaneous control of frequency and magnitude. Note also
that the instantaneous phases are not preserved using this method. A final remark concerns the FFT
size: in general one wants to have a high frame rate, so that frequencies and magnitudes need not to
be interpolated inside a frame. At the same time, large FFT sizes are desirable in order to achieve
good frequency resolution and separation of the sinusoidal components. As in every short-time based
processes, one has to find a trade-off between time and frequency resolution.
M-2.8
Assume that two matrices sinan freqs and sinan amps have been created from analysis of a real sound.
These matrices contain frequency and amplitude envelopes of sinusoidal partials of the analyzed sound. Write a
function that resynthesizes the sound.
M-2.8 Solution
8000 90
7000 80
70
amplitudes (dB)
6000
frequency (Hz)
60
5000
50
4000
40
3000
30
2000 20
1000 10
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
t (s) t (s)
(a) (b)
Figure 2.8: Fourier analysis of a saxophone tone: (a) frequency envelopes and (b) amplitude envelopes
of the sinusoidal partials, as functions of time.
window
generation
{ak }
Mag. pitch
s[n] peak detection peak
FFT &
detection continuation
Phase {fk }
Figure 2.9: Block diagram of the sinusoid tracking process, where s[n] is the analyzed sound signal and
ak , fk are the estimated amplitude and frequency of the kth partial in the current analysis frame.
For inharmonic sounds the size should be set according to the minimum frequency difference that exists
between partials.
The question is now how to perform automatic detection and tracking of the spectral peaks that
correspond to sinusoidal components. In the next section we present the main guidelines of a general
analysis framework, which is summarized in Fig. 2.9. First, the FFT of a sound frame is computed
according to the above discussion. Next, the prominent spectral peaks are detected and incorporated into
partial trajectories. If the sound is pseudo-harmonic, a pitch detection step can improve the analysis by
providing information about the fundamental frequency information, and can also be used to choose the
size of the analysis window.
Such a scheme is only one of the possible approaches that can be used to attack the problem. Hidden
Markov Models (HMMs) are another one: a HMM can optimize groups of peaks trajectories according
to given criteria, such as frequency continuity. This type of approach might be very valuable for tracking
partials in polyphonic sounds and complex inharmonic tones.
frequency is actually present, it can be exploited in two ways. First, it helps the tracking of partials.
Second, the size of the analysis window can be set according to the estimated pitch in order to keep
the number of periods-per-frame constant, therefore achieving the best possible time-frequency trade-off
(this is an example of a pitch-synchronous analysis). There are many possible pitch detection strategies,
which will be presented in Chapter From audio to content.
The third step in Fig. 2.9 is a peak continuation algorithm. The basic idea is that a set of “guides”
advance in time and follow appropriate frequency peaks, forming trajectories out of them. A guide is
therefore an abstract entity which is used by the algorithm to create the trajectories, and the trajectories
are the actual result of the peak continuation process. The guides are turned on, advanced, and finally
turned off during the continuation algorithm, and their instantaneous state (frequency and magnitude) is
continuously updated during the process. If the analyzed sound is harmonic and a fundamental has been
estimated, then the guides are created at the beginning of the analysis, with frequencies set according
to the estimated harmonic series. When no harmonic structure can be estimated, each guide is created
when the first available peak is found. In the successive analysis frames, the guides modify their status
depending on the last peak values. This past information is particularly relevant when the sound is not
harmonic, or when the harmonics are not locked to each other and we cannot rely on the fundamental as
a strong reference for all the harmonics.
The main guidelines to construct a peak continuation algorithm can be summarized as follows. A
peak is assigned to the guide that is closest to it and that is within an assigned frequency deviation. If a
guide does not find a match, the corresponding trajectory can be turned off, and if a continuation peak
is not found for a given amount of time the guide is killed. New guides and trajectories can be created
starting from peaks of the current frame that have high magnitude and are not “claimed” by any of the
existing trajectories. After a certain number of analysis frames, the algorithm can look at the trajectories
created so far and adopt corrections: in particular, short trajectories can be deleted, and small gaps in
longer trajectories can be filled by interpolating between the values of the gap edges.
One final refinement to this process can be added by noting that the sound attack is usually highly
non-stationary and noisy, and the peak search is consequently difficult in this part. Therefore it is cus-
tomary to perform the whole procedure backwards in time, starting from the end of the sound (which is
usually a more stable part). When the attack is reached, a lot of relevant information has already been
gained and non-relevant peaks can be evaluated and/or rejected.
where ss [n] represents the deterministic part of the sound and has already been modeled with Eq. (2.16),
while e[n] represents the stochastic component and is modeled separately from ss [n].
s[n] {a k }
sinusoid {fk }
tracking {φ k }
additive
synthesis
window
generation d[n]
− e[n] spectral
Gain
&
+ fitting
Spec. Env.
Figure 2.10: Block diagram of the stochastic analysis and modeling process, where s[n] is the analyzed
sound signal and ak , fk , ϕk are the estimated amplitude, frequency, and phase of the kth partial in the
current analysis frame.
The most straightforward approach to estimation of the stochastic component is through subtraction of
the deterministic component from the original signal. Subtraction can be performed either in the time do-
main or in the frequency domain. Time domain subtraction must be done while preserving the phases of
the original sound, and instantaneous phase preservation can be computationally very expensive. One the
other hand, frequency-domain subtraction does not require phase preservation. However, time-domain
subtraction provides much better results, and is usually favored despite the higher computational costs.
For this reason we choose to examine time-domain subtraction in the remainder of this section. Fig-
ure 2.10 provides a block diagram.
Suppose that the deterministic component has been estimated in a given analysis frame, using for
instance the general scheme described in section 2.4.2 (note however that in this case the analysis should
be improved in order to provide estimates of the instantaneous phases as well). Then the first step in
the subtraction process is the time-domain resynthesis of the deterministic component with the estimated
parameters. This should be done by properly interpolating amplitude, frequency, and phase values in
order to avoid artifacts in the resynthesized signal. The actual subtraction can be performed as
where s[n] is the original sound signal and d[n] is the re-synthesized deterministic part. The difference
(s − d) is multiplied by an analysis window w of size N , which deserves some discussion.
We have seen in 2.4.2 that high frequency resolution is needed for the deterministic part, and for this
reason long analysis windows are used for its estimation. On the other hand, good time resolution is more
important for the stochastic part of the signal, especially in sound attacks, while frequency resolution is
not a major issue for noise analysis. A way to obtain good resolutions for both the components is to use
two different analysis windows. Therefore w in equation (2.20) is not in general the same window used
to estimate d[n], and the size N is in general small.
Once the subtraction has been performed, there is one more step than can be used to improve the
analysis, namely, test can be performed on the estimated residual in order to assess how good the analysis
was. If the spectrum of the residual still contains some partials, then the analysis of the deterministic
component has not been performed accurately and the sound should be re-analyzed until the residual
is free of deterministic components. Ideally the residual should be as close as possible to a stochastic
20
15
10
5
magnitude (dB)
−5
−10
−15
−20
−25
−30
0 2000 4000 6000 8000 10000
f (Hz)
Figure 2.11: Example of residual magnitude spectrum (solid line) and its line-segment approximation
(dashed line), in an analysis frame. The analyzed sound signal is the same saxophone tone used in
figure 2.8.
signal, therefore one possible test is a measure of correlation of the residual samples.4
where U [k] is the DFT of a white noise sequence and H[k] represents the frequency response of a filter
which varies on a frame-by-frame basis. The stochastic modeling step is summarized in the last block of
figure 2.10.
The filter design problem can be solved using different strategies. One approach that is often adopted
uses some sort of curve fitting (line-segment approximation, spline interpolation, least squares approx-
imation, and so on) of the magnitude spectrum of e in an analysis frame. As an example, line-segment
approximation can be obtained by stepping through the magnitude spectrum, finding local maxima at
each step, and connecting the maxima with straight lines. This procedure can approximate the spec-
tral envelope with reasonable accuracy, depending on the number of points, which in turn can be set
depending on the sound complexity. See Fig. 2.11 for an example.
Another possible approach to the filter design problem is Linear Prediction (LP) analysis, which is
a popular technique in speech processing. In this context, however, curve fitting on the noise spectrum
(e.g., line-segment approximation) is usually considered to a be more flexible approach and is preferred
to LP analysis. We will return on Linear Prediction techniques in section 2.5.3.
The next question is how to implement the estimated time-varying filter in the resynthesis step.
4
Note that if the analyzed sound has not been recorded in silent and anechoic settings the residual will contain not only the
stochastic part of the sound, but also reverberation and/or background noise.
{a k }
additive resynth.
transformations synthesis d(n)
{fk }
Gain
& noise resynth.
transformations
filtering e[n]
Spec. Env
Figure 2.12 shows the block diagram of the synthesis process. The deterministic signal, i.e., the si-
nusoidal component, results from the magnitude and frequency trajectories, or their transformation, by
generating a sine wave for each trajectory (additive synthesis). As we have seen, this can either be imple-
mented in the time domain with the traditional oscillator bank method or in the frequency domain using
the inverse-FFT approach.
Concerning the stochastic component, a frequency-domain implementation is usually preferred to a
direct implementation of the time-domain convolution (2.21), due to its computational efficiency5 and
flexibility. In each frame, the stochastic signal is generated by an inverse-FFT of the spectral envelopes.
Similarly to what we have seen for the deterministic synthesis in section 2.4.1, the time-varying charac-
teristics of the stochastic signal is then obtained using an overlap-and-add process.
In order to perform the IFFT, a magnitude and a phase responses have to be generated starting from
the estimated frequency envelope. Generation of the magnitude spectrum is straightforwadly obtained
by first linearly interpolating the spectral envelope to a curve with half the length of the FFT-size, and
then multiplying it by a gain that corresponds to the average magnitude extracted in the analysis. The
estimated spectral envelope gives no information on the phase response. However, since the phase re-
sponse of noise is noise, a phase response can be created from scratch using a random signal generator.
In order to avoid periodicities at the frame rate, new random values should be generated at every frame.
The sines-plus-noise representation is well suited for modification purposes.
• By only working on the deterministic representation and modifying the amplitude-frequency pairs
or the original sound partials, many kinds of frequency and magnitude transformations can be
obtained. As an example, partials can be transposed in frequency. It is also possible to decouple the
sinusoidal frequencies from their amplitude, obtaining pitch-shift effects that preserve the formant
structure.
• Time-stretching transformations can obtained by resampling the analysis points in time, thus slow-
ing down or speeding up the sound while maintaining pitch and formant structure. Given the
stochastic model that we are using, the noise remains noise and faithful signal resynthesis is pos-
sible even with extreme stretching parameters.
5
In fact, by using a frequency-domain implementation for both the deterministic and the stochastic synthesis one can add
the two spectra and resynthesize both the components at the cost of a sigle IFFT per frame.
• By acting on the relative amplitude of the two components, interesting effects can be obtained in
which either the deterministic or the stochastic parts are emphasized. As an example, the amount
of “breathiness” of a voiced sound or a wind instrument tone can be adjusted in this way. One must
keep in mind however that, when different transformations are applied to the two representations,
the deterministic and stochastic components in the resulting signal may not be perceived as a single
sound event anymore.
• Sound morphing (or cross-synthesis transformations can be obtained by interpolating data from
two or more analysis files. This transformations are particularly effective in the case of quasi-
harmonic sounds with smooth parameter curves.
where ss [n] is the sinusoidal component, et [n] is the signal associated to transients and er [n] is the
noisy residual. The transient model is based on a main undelying idea: we have seen that a slowly
varying sinusoidal signal is impulsive in the frequency domain, and sinusoidal models perform short-time
Fourier analysis in order to track slowly varying spectral peaks (the tips of the impulsive signals) over
time. Transients are very much dual to sinusoidal components: they are impulsive in the time domain,
and consequently they must be oscillatory in the frequency domain. Therefore, although transient cannot
be tracked by a short-time analysis (because their STFT will not contain meaningful peaks), we can track
them by performing sinusoidal modeling in a properly chosen frequency domain. The mapping that we
choose to use is the one provided by the discrete cosine transform (DCT):
∑
N −1 [ ]
(2n + 1)kπ
S[k] = β[k] s[n] cos , for n, k = 0, 1, . . . , N − 1, (2.23)
2N
n=0
√ √
where β[0] = 1/N and β[k] = 2/N otherwise. From equation (2.23) one can see that an ideal
impulse δ[n − n0 ] (i.e., a Kronecker delta function centered in n0 ) is transformed into a cosine whose
1 1
0.8
0.6
0.4
0.5
0.2
S[k]
s[n]
−0.2
0
−0.4
−0.6
−0.8
−0.5 −1
0 50 100 150 200 250 200 400 600 800
time index n (samples) DCT index k (samples)
(a) (b)
Figure 2.13: Example of DCT mapping: (a) an impulsive transient (an exponentially decaying sinusoid)
and (b) its DCT as a slowly varying sinusoid.
s[n] {a k }
sinusoid {fk }
DCT tracking {φ k }
additive
trans. location synthesis
information
IDCT
− er [n] ~
s[n]
+
Figure 2.14: Block diagram of the transient analysis and modeling process, where s[n] is the ana-
lyzed sound signal and ak , fk , ϕk are the estimated amplitude, frequency, and phase of the kth DCT-
transformed transient in the current analysis frame.
frequency is monotonically related to n0 . Figure 2.13(a) shows a more realistic transient signal, a one-
sided exponentially decaying sine wave. Figure 2.13(b) shows the DCT of the transient signal: a slowly
varying sinusoid. This considerations suggest that the time-frequency duality can be exploited to develop
a transient model: the same kind of parameters that characterize the sinusoidal components of a signal
can also characterize the transient components of a signal, although in a different domain.
formed. The block length should be chosen so that a transient appears as “short”, therefore large block
sizes (e.g., 1 s) are usually chosen. The block DCT is followed by a sinusoidal analysis/modeling process
which is identical to what we have seen is section 2.4.2. The analysis can optionally embed some infor-
mation about transient location within the block: there are many possible transient detection strategies,
which we do not want to discuss here. Also, the analysis can perform better if the sinusoid tracking
procedure starts from the end of the DCT-domain signal and moves backwards toward the beginning,
because the beginning of a DCT frame is usually spectrally rich and this can deteriorate the performance
of the analysis (similar considerations were made in Sec. 2.4.2 when discussing sinusoid tracking in the
time domain).
The analysis yields parameters that correspond to slowly varying sinusoids in the DCT domain: each
transient is associated to a triplet {ak , fk , ϕk }, amplitude, frequency, and phase of the kth “partial” in
each STFT analysis frame within a DCT block. By recalling the properties of the DCT one can see that fk
correspond to onset locations, ak is the amplitude of the time-domain signal also, and ϕk is related to the
time direction (positive or negative) in which the transient evolves. Resynthesis of the transients is then
performed using these parameters to reconstruct the sinusoids in the DCT domain. Finally an inverse
discrete cosine transform (IDCT) on each of the reconstructed signals is used to obtain the transients in
each time-domain block, and the blocks are concatenated to obtain the transients for the entire signal.
It is relatively straightforward to implement a “fast transient reconstruction” algorithm. Without
entering the details, we just note that the whole procedure can be reformulated using FFT transformations
only: in fact one could verify that the DCT can be implemented using an FFT block plus some post-
processing (multiplication of the real and imaginary parts of the FFT by appropriate cosinusoidal and
sinusoidal signals followed by a sum of the two parts). Furthermore, this kind of approach naturally
leads to a FFT-based implementation of the additive synthesis step (see Sec. 2.4.1).
One nice property of this transient modeling approach is that it fits well within the sines-plus-noise
analysis examined in the previous sections. The processing block depicted in Fig. 2.14 returns an output
signal s̃[n] in which the transient components et have been removed by subtraction: this signal can be
used as the input to the sines-plus-noise analysis, in which the remaining components (deterministic and
stochastic) will be analyzed and modeled. From the implementation viewpoint, one advantage is that
the core components of the transient-modeling algorithm (sinusoid tracking and additive resynthesis) are
identical to those used for the deterministic model. Therefore the same processing blocks can be used in
the two stages, although working on different domains.
Input
signal
Analysis
Excitation Output
signal signal
Source Filter
x[n] s[n]
Transformations
∑
M ∑
N
s[n] = bk x[n − k] − ak s[n − k] , (2.24)
k=0 k=1
Equation (2.25) shows how the features of source and filter are combined: the spectral fine structure
of the excitation signal is multiplied by the spectral envelope of the filter, which has a shaping effect
on the source spectrum. Therefore, it is possible to control and modify separately different features of
the signal: as an example, the pitch of a speech sound depends on the excitation and can be controlled
separately from the formant structure, which instead depends
( jω )on the filter. When the filter coefficients
are (slowly) varied over time, the frequency response H e d changes. As a consequence, the output
will be a combination of temporal variations of the input and of the filter (cross-synthesis).
1 1 1
0 0 0
−1 −1 −1
0 5 10 0 5 10 0 5 10
t (s) t (s) t (s)
(a)
1 1 1
0 0 0
0 50 100 0 50 100 0 50 100
f (Hz) f (Hz) f (Hz)
(b)
Figure 2.16: Spectrally rich waveforms: (a) time-domain square, triangular, sawtooth, and impulse
train waveforms; (b) corresponding spectra (first 20 partials).
frequency f0 Hz) can be conveniently given in the continuous-time domain.6 A set of compact possible
definitions is the following:
2
xsquare (t) = sgn [sin(2πf0 t)] , xtriang = arcsin [sin(2πf0 t)] − 1,
π (2.26)
xsaw (t) = 2 (f0 t − ⌊f0 t⌋) − 1,
where ⌊a⌋ = max{n ∈ N : n ≤ a} in the definition of the sawtooth wave indicates the floor function.
These equations define waveforms that take values in the range [−1, 1] and have zero average. The
corresponding waveforms are depicted in Fig. 2.16(a).
These waveforms can be written in terms of the followin sinusoidal expansions:
Note that here we are using an expansion on real sinusoids, which can be straightforwardly derived from
the usual Fourier expansion on complex sinusoids. The corresponding spectra are shown in Fig. 2.16(b).
As expected, all the waveforms are spectrally rich. In particular, the square and triangular waves contain
only odd harmonics, with higher harmonics rolling off faster in the triangular than in the square wave
(this is in accordance with the triangular wave being –and sounding– smoother than the square wave).
On the other hand, the sawtooth wave has energy on all harmonics.
One more relevant source signal is the ideal impulse train waveform, a sequence of unit impulses
spaced by the desired fundamental period. It is used especially for the simulation of voiced speech
6
In fact continuous-time domain definitions are appropriate since they were used in analog synthesizers.
1 1 1
amplitude
amplitude
0 0 0
−1 −1 −1
0 2 4 6 0 2 4 6 0 2 4 6
t (rad) t (rad) t (rad)
Figure 2.17: Bandlimited synthesis of the square, triangular, and sawtooth waves, using 3, 8, and 13
sinusoidal components.
sounds, and represents the periodic energy pulses provided by vocal fold movement to the vocal tract
(we will return on this point in Sec. 2.5.2). It has a white spectrum.
M-2.9
Write a function that realizes the generators for the square, triangular, sawtooth, and impulse train signals. The
function will have parameters (t0,a,f): initial time, amplitude envelope, and frequency envelope.
M-2.9 Solution
function s = waveosc(t0,a,f,ph0,wavetype,npart);
%%%% param wavetype can be ’cos’ | ’square’ | ’triang’ | ’saw’ | ’imp’ %%%%
This function utilizes the sinosc function written in Chapter Fundamentals of digital audio processing. We
have used Eqs. (2.27) and for each waveform have summed up all the needed harmonic components
up to the Nyquist frequency Fs /2.
where r and ±ωc are the magnitude and phases of the poles, and the condition r < 1 must hold in order
for the filter to be stable. If assume the filter to be causal then the impulse response is7
0.06 1
0.04 0.8
0.02
0.6
0
0.4
−0.02
−0.04 0.2
−0.06 0
0 0.005 0.01 0.015 0.02 0 1000 2000 3000 4000
t (s) f (Hz)
(a) (b)
Figure 2.18: Example of a second-order resonator tuned on the center frequency ωc = 2π440/Fs and
with bandwidth B = 2π100/Fs ; (a) impulse response; (b) magnitude response.
√
response at the two half-power points (the points where the magnitude response is 1/ 2 times smaller
than at the peak value). An equivalent measure is the quality factor q = ωc /B.
( ) ( ) 2
A half-power point ωhp is such that H ejωhp /H ejωc = 1/2, by definition. If we assume
that in the vicinity of one pole the effect of the second pole is negligible, we can derive a simple relation
between B and r:
( ) 2 b2o
H ejωc ∼ ,
(1 − r)2
( ) (2.31)
2 b2o b2o
H e jωhp
∼ = ... = .
ejωhp − rejωc
2 1 + r2 − 2r cos (ωhp − ωc )
By noting that ωhp − ωc ∼ B/2, and by applying the definition of half-power point, we can write
( ) ( )
(1 − r)2 1 B 1 1
= , ⇒ . . . ⇒ cos =2− r+ . (2.32)
1 + r2 − 2r cos (B/2) 2 2 2 r
This latter equation provides a means to choose r given B. A further useful approximation can be
obtained for very sharp resonances, i.e. when r = 1 − ϵ with ϵ ≪ 1 and the poles are very close to
the unit circle. Taylor expansions of the two sides of the equation give cos(B/2) ∼ 1 − (B/2)2 and
2 − 1/2(r − 1/r)|r=(1−ϵ) ∼ 1 − ϵ2 /2, respectively. Therefore in this limit one can write
In summary, given two values for ωc and B, the poles can be determined using Eqs. (2.30) and (2.32)
or (2.33). Then the coefficients can be written as functions of the parameters r, ωc as
a1 = −2r cos(ωc ),
a2 = r2 , b0 = (1 − r2 ) sin2 (ωc ), (2.34)
( )
where b0 has been determined by imposing that H e±jωc = 1. An example of magnitude response
is shown in Fig. 2.18(b).
M-2.10
Write a function that computes the coefficients of a second-order resonant filter, given the normalized angular
frequency ωc (in radians) and the bandwidth B.
M-2.10 Solution
function [b,a]=reson2(omegac,B); %omegac and B are given in radians
We have followed the Octave/Matlab convention in defining the coefficients b,a, but not the convention
in defining normalized frequencies.
Filter
R1
Source
R2
Excitation Output
Striker R3
signal signal
RN
Figure 2.19: Parallel structure of digital resonators for the simulation of struck objects – the Ri ’s have
transfer functions of the form (2.28).
M-2.11
Write a function modalosc(t0,tau,a,omegac,B) that realizes the structure of Fig. 2.19. The parameter t0
is the total duration of the sound signal, tau,a define duration and max. amplitude of the striker signal (e.g. a
noise burst), and the vectors omegac,B define center frequencies and bandwidths of a bank of second-order
oscillators.
palate
lips
tongue
pharinx
epiglottis
vocal folds
Figure 2.20: A schematic view of the phonatory system. Solid arrows indicates the direction of the
airflow generated by lung pressure.
natural utterances is probably the easiest way and the most popular approach to produce intelligible
and natural sounding synthetic speech. However, concatenative synthesizers are usually limited to one
speaker and one voice and usually require more memory capacity than other methods.
Formant synthesis is based on the source-filter modeling approach described in Sec. 2.5 above: the
transfer function of the vocal tract is typically represented as a series of resonant filters, each accounting
for one formant. This was the most widely used synthesis method before the development of concatena-
tive methods. Being based on a parametric model rather than on pre-recorded sounds, formant synthesis
techniques are in principle more flexible than concatenative methods. We discuss formant synthesis in
the next section.
Articulatory synthesis attempts to model the human voice production system directly and therefore
belongs to the class of models discussed in Chapter Sound modeling: source based approaches. Articulatory syn-
thesis typically involves models of the vocal folds, the vocal tract, and an associated set of articulators
that define the area function between glottis and mouth. Articulators can be lip aperture, lip protrusion,
tongue tip height and position, etc. Parameters associated to vocal folds can be glottal aperture, fold
tension, lung pressure, etc. Although these methods promise high quality synthesis, computational costs
are high and parametric control is arduous. At the time of writing no existing articulatory synthesizer
can compare with a concatenative synthesizer.
Pitch
gv
Vocal. tract
Impulse train Glottal pulse parameters
generator model G(z)
Noise
generator
gu
If s[n] is an unvoiced signal, vocal folds do not vibrate and turbulences is produced by the passage
of air through a narrow constriction (such as the teeth). The turbulence can be modeled as white noise.
In this case, the model is expressed in the Z-domain as
where the source signal X(z) is in this case a white noise sequence, while the gain term gu is in general
different from the voiced configuration gain, gv . Note that the vocal fold response G(z) is not included
in the model in this case.
Any voiced or unvoiced sound is modeled by either Eq. (2.35) or (2.36). The complete transfer
function H(z) = S(z)/X(z) may or may not include vocal fold response G(z) depending on whether
the sound is voiced or unvoiced. The block structure of the resulting model is shown in Fig. 2.21.
The filter G(z) shapes the glottal pulses. More specifically, since the input x[n] is a pulse train, the
output from this block is the impulse response g[n] of this filter. We propose two historically relevant
models. The first one is a FIR model with impulse response:
[ ( )]
1
1 − cos πn
, 0 ≤ n ≤ N1 ,
2 ( N)1
π(n−N )
gFIR [n] = cos
2N2
1
, N1 ≤ n ≤ N1 + N2 , (2.37)
0 elsewhere.
specified. We denote the filter associated to the ith formant as Vi (z), having center frequency fi and
bandwidth Bi . At least three vocal tract formants are generally required to produce intelligible speech
and up to five formants are needed to produce high quality speech.
Two basic structures, parallel and cascade, can be used in general, but for better performance some
kind of combination of these is usually adopted. A cascade formant synthesizer consists of band-pass
resonators connected in series, i.e. the output of each formant resonator is applied to the input of the
following one. A parallel formant synthesizer consists of resonators connected in parallel, i.e. the same
input is applied to each formant filter and the outputs are summed. The corresponding vocal tract models
are
∏K ∑
K
Vcasc (z) = g Vi (z), Vpar (z) = ai · Vi (z). (2.40)
i=1 i=1
The cascade structure needs only formant frequencies as control information. The main advantage of this
structure is that the relative formant amplitudes for vowels do not need individual controls. A cascade
model of the vocal tract is considered to provide good quality in the synthesis of vowels, but is less
flexible than a parallel structure, which enables controlling of bandwidth and gain for each formant
individually.
M-2.12
Using the functions waveosc and reson2, realize a parallel formant synthesizer. Use 3 second-order IIR cells,
corresponding to the first 3 vowel formants.
M-2.12 Solution
function s= formant_synth(a,f,vowel);
Figure 2.22 shows an example of formant synthesis using a parallel formant filtering structure. In
particular Fig. 2.22(c) shows that when the same vowel is uttered with two different pitches, only the fine
spectral structure is affected, while the overall spectral envelope does not change its shape.
40 40
0
20 −10 20
Amplitude (dB)
−20
0 0
−30
−20 −20
−40
−40 −40
−50
Figure 2.22: Formant synthesis of voice: (a) spectra of two pulse trains with fundamental frequencies
at 150 Hz and 250 Hz; (b) first three formants of the vowel /a/; (c) spectra of the two output signals
obtained by filtering the pulse trains through a parallel combination of the three formants.
passes through the partial peaks. This implies that 1) the peak values have to be retrieved, and 2) an
interpolation scheme should be (arbitrarily) chosen for the completion of the curve in between the peaks.
If the sound contains inharmonic partials or a noisy part, then the notion of a spectral envelope becomes
completely dependent on the definition of what belongs to the “source” and what belongs to the “filter”.
Three main techniques, with many variants, can be used for the estimation of the spectral envelope.
The channel vocoder uses frequency bands and performs estimations of the amplitude of the signal inside
these bands and thus the spectral envelope. Linear prediction estimates an all-pole filter that matches the
spectral content of a sound. When the order of this filter is low, only the formants are taken, hence
the spectral envelope. Cepstrum techniques perform smoothing of the logarithm of the FFT spectrum
(in decibels) in order to separate this curve into its slow varying part (the spectral envelope) and its
quickly varying part (the source signal). In this section we present the basics of Linear Prediction (LP)
techniques. We will return on cepstral analysis in Chapter Auditory based processing.
where g is a gain scaling factor and X(z) and S(z) are the Z-transforms of the source signal x[n] and the
output signal s[n], respectively. This is often termed an ARMA(p, q) (Auto-Regressive Moving Average)
model, in which the output is expressed as a linear combination of p past samples and q + 1 input values.
LP analysis works on an approximation of this system, namely on an all-pole model:
1
S(z) = gHLP (z)X(z), with HLP (z) = ∑p −k
. (2.42)
1− k=1 ak z
∑
The time-domain version of this equation reads s[n] = gx[n] + pk=1 ak s(n − k). Therefore the output
s[n] can be predicted using a linear combination of its p past values, plus a weighted input term. In
statistical terminology, the output regresses on itself, therefore system (2.42) is often termed an AR(p)
(Auto-Regressive) model
One justification of this approximation is that the input signal x[n] is generally unknown together
with the filter H(z). A second more substantial reason is that any filter H(z) of the form (2.41) can be
~
s[n]
P(z) P(z)
A(z) HLP (z)
(a) (b)
Figure 2.23: LP analysis: (a) the inverse filter A(z), and (b) the prediction error e[n] interpreted as the
unknown input gx[n].
written as H(z) = h0 Hmin (z)Hap (z), where h0 is a constant gain, Hmin (z) is a minimum-phase filter,
and Hap is an all-pass filter (i.e. Hap (ejωd ) = 1, ∀ωd ). Moreover, the minimum-phase filter Hmin (z)
can be expressed as an all-pole system of the form (2.42). Therefore we can say that LP analysis ideally
represents the all-pole minumum-phase portion of the general system (2.41), and therefore yelds at least
a correct estimate of the magnitude spectrum.
Given an output signal s[n], Linear Prediction analysis provides a method for deteriming the “best”
estimate {ãi } (k = 1, . . . , p) for the coefficients {ai } of the filter (2.42). The method can be interpreted
and derived in many ways, here we propose the most straightforward one. Given an∑estimate {ãi } of the
filter coefficients, we define the linear prediction s̃[n] of the output s[n] as s̃[n] = pk=1 ãk s(n − k). In
the Z-domain we can write
∑
p
S̃(z) = P (z)S(z), with P (z) = ak z −k , (2.43)
k=1
and we call the FIR filter P (z) a prediction filter. We then define the prediction error or residual e[n]
as the difference between the output s[n] and the linear prediction s̃[n]. In the z-domain, the prediction
error e[n] is expressed as
∑
p
E(z) = A(z)S(z), with A(z) = 1 − ak z −k . (2.44)
k=1
Comparison of Eqs. (2.42) and (2.44) shows that, if the speech signal obeys the model (2.42) exactly,
and if ãk = ak , then the residual e[n] coincides with the unknown input x[n] times the gain factor g, and
A(z) is the inverse filter of HLP (z). Therefore LP analysis provides an estimate of the inverse system
of (2.42):
e[n] = gx[n], A(z) = [HLP (z)]−1 . (2.45)
This interpretation is illustrated in Fig. 2.23. If we assume that the prediction error has white (flat)
spectrum, then the all-pole filter HLP (z) completely characterizes the spectrum of s[n]. For this reason
A(z) is also called a whitening filter, because it produces a residual with a flat power spectrum. Two
kinds of residuals, both having a flat spectrum, can be identified: the pulse train and the white noise. If
LP is applied to speech signals, the pulse train represent the idealized vocal-fold excitation for voiced
speech, while white noise represents the idealized excitation for unvoiced speech.
The roots of A(z) (i.e., the poles of HLP (z)) are representative of the formant frequencies. In other
words, the phases of these poles, expressed in terms of analog frequencies, can be used as an estimate of
the formant frequencies, while the magnitude of the poles relate to formant bandwidths according to the
equations written in Sec. 2.5.1 when discussing resonant filters.
We now describe the heart of LP analysis and derive the equations that determine the “best” estimate
{ãi } (k = 1, . . . , p). In this context
∑ “best” means best in a least-square sense: we seek the {ãi }s that
minimize the energy E{e} = +∞ 2
m=−∞ e [m] of the residual, i.e. we set to zero the partial derivatives of
E{e} with respect to the ai s:
{[ ] }
∂E{e} ∑
+∞
∂e[m] ∑
+∞ ∑p
0= =2 e[m] = −2 s[m] − ak s(m − k) s[m − i] , (2.46)
∂ai m=−∞
∂ai m=−∞ k=1
for i = 1 . . . p. If one defines the temporal autocorrelation of the signal s[n] as the function rs [i] =
∑ +∞
m=−∞ s[m]s[m − i], then the above equation can be written as
∑
p
ak rs [i − k] = rs [i], for i = 1 . . . p. (2.47)
k=1
The system (2.47) is often referred to as the normal equations. Solving this system in the p unknowns ai
yelds the desired estimates ãi .
∑
N
rs [i] ∼ u[m]u[m − i], where u[m] = s[m]w[m] (2.48)
m=1
is a windowed version of s[m] in the considered frame (w[m] is typically a Hamming window), and N
is the length of the frame. Then the system (2.47) is solved within each frame. An efficient solution is
provided by the so-called Levinson-Durbin recursion, an algorithm for solving the problem Ax = b,
with A Toepliz, symmetric, and positive definite, and b arbitrary. System (2.47) is an instance of this
general problem.
M-2.13
Write a function lp coeffs that computes the LP coefficients of a finite-length signal s[n], given the desired
prediction order p.
M-2.13 Solution
% Compute LP coeffs using the autocorrelation method
% s is the (finite-length) signal, p is the prediction order
% a are the computed LP coefficients, g is the gain (sqrt of residual variance)
0.3
−30
0.2 −40
0.1
magnitude (dB)
s(t), x(t) (adim)
−50
0 −60
−0.1 −70
−80
−0.2
−90
−0.3
−100
−0.4
−110
0 0.005 0.01 0.015 0.02 0 1000 2000 3000 4000 5000 6000 7000 8000
t (s) f (Hz)
(a) (b)
Figure 2.24: Example of LP analysis/synthesis, with prediction order p = 50; (a) target signal s[n]
(dotted line) and unit variance residual x[n] (solid line); (b) magnitude spectra |S(f )| (thin line) and
|gHLP (f )| (thick line).
Note that we are using the native function levinson, that computes the filter coefficients (as well as
the variance of the residual) given the autocorrelation sequence and the prediction order.
Figure 2.24 shows an example of LP analysis and resynthesis of a single frame of a speech signal.
As shown in Fig. 2.24(a), the analyzed frame is a portion of voiced speech and s[n] is pseudo-periodic.
Correspondingly, the estimated source signal x[n] is a pulse train. Figure 2.24(b) shows the magnitude
responses of the target signal and the estimated transfer function gHLP (z). A typical feature of LP
spectral modeling can also be observed from this figure: the LP spectrum matches the signal spectrum
much more closely in the region of large signal energy (i.e. near the spectrum peaks) than near the
regions of low energy (spectrum valley).
M-2.14
Write an example script that analyzes frame-by-frame a voice signal using the LP model (2.42).
M-2.14 Solution
[s, Fs] = wavread(’la.wav’); %%%% input file
%% analysis parameters
N=2048; %block length
Sa=256; %analysis hop-size
p=round(Fs/1000)+4 ; %prediction order
Speech
Pre−Processing Encoding Decoding output
Hamming LP
Encoder Decoder De−emphasis
window coefficients
CHANNEL
voiced Formant
Segmentation unvoiced Encoder Decoder
decision filter
pitch Excitation
Pre−emphasis Encoder Decoder
estimate signal
Speech
input
Note that we have used the function lp coeffs written in example M-2.13. The signals plotted in
Fig. 2.24 have been computed from this script.
When formant parameters are extrected on a frame-by-frame basis, a lot of discontinuities and local
estimation observation errors are found. Therefore, proper techniques have to be used in order to de-
termine smooth formant trajectories over analysis frames. We have already encountered a conceptually
similar problem in Sec. 2.4.2, when we have discussed a “sinusoid tracking” procedure.
M-2.15
Plot the formant frequencies as a function of the frame number, i.e., of time, in order to observe the time-evolution
of the vocal tract filter. To this purpose, segment a speech signal s[n] into M Hamming windowed frames sm [n],
with a block length N and a hop-size Sa = N/2. Then, for each frame: a) compute the LP coefficients; b) find
the filter poles and the corresponding formant frequencies; c) discard poles whose magnitude is less than 0.8, as
these are unlikely to represent formants.
80
60
40
20 p=10
magnitude (dB)
0 p=20
−20 p=30
−40 p=50
−60 p=70
−80 p=100
−100
Figure 2.26: Example of LP spectra for increasing prediction orders p (the target signal is a frame of
voiced speech). For the sake of clarity each spectrum is plotted with a different offset.
decoder system based on LP analysis. Speech is segmented in frames (typical frame lengths can range
from 10 to 20 ms). In this phase a pre-emphasis processing can also be applied: since the lower formants
contain more energy, they are preferentially modeled with respect to higher formants, and a pre-emphasis
filter can compensate for this by boosting the higher frequencies (when reconstructing the signal the
inverse filter should be used).
In its simplest formulation the encoder provides, for every frame, the coefficients ak of the prediction
filter, the gain g, a flag that indicates whether the frame corresponds to voiced or unvoiced speech, and
the pitch (only in the case of voiced speech). The decoder uses this information to re-synthesize the
speech signal. In the case of unvoiced speech, the excitation signal is simply white noise, while in the
case of voiced speech the excitation signal is a pulse train whose period is determined by the encoded
pitch information.
It is clear that most of the bits of the encoded signal are used for the ak parameters. Therefore
the degree of compression is strongly dependent on the order p of the LP analysis, which in turn has
a strong influence on the degree of smoothness of the estimated spectral envelope, and consequently
on the quality of the resynthesis (see Fig. 2.26). A commonly accepted operational rule for achieving
reasonable intelligibility of the resynthesized speech is
{
Fs + 4, for voiced speech,
p= (2.49)
Fs , for unvoiced speech,
LP P2 (z)
s 2 [n] e2 [n]
Figure 2.27: Block scheme of a LP-based implementation of cross-synthesis (also known as vocoder
effect) between two input sounds s1 and s2 .
excitation which gives the minimum weighted error between the original and the reconstructed speech is
then chosen by the encoder and used to drive the synthesis filter at the decoder. It is this ‘closed-loop’
determination of the excitation which allows these codecs to produce good quality speech at low bit rates,
at the expense of a much higher complexity of the coding stage.
Whitin this family of analysis-by-synthesis codecs, many different techniques have been developed
for the estimation of the excitation signal. Historically, the first one is the Multi-Pulse Excited (MPE)
codec. Later the Regular-Pulse Excited (RPE), and the Code-Excited Linear Predictive (CELP) codecs
were introduced. The “Global System for Mobile communications” (GSM), a digital mobile radio system
which is extensively used throughout Europe, and also in many other parts of the world, makes use of a
RPE codec.
As an example, for the case of cross-synthesis between speech and music the played instrumental notes
should fit to the rhythm of the syllables of the speech: this may be achieved if either speech or music
is coming from a prerecorded source and the other sound is produced to match to the recording, or if a
performer is both playing the instrument and speaking, thus producing both signals at the same time.
M-2.16
Realize the cross-synthesis effect depicted in Fig. 2.27.
a[n] ω 0 [n]
x[n] y[n]
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 1 2 3 4 5 6 0 1 2 3 4 5 6
t (s) −3
x 10 t (s) −3
x 10
(a) (b)
Figure 2.29: Example of output signals from a linear and from a non-linear system, in response to a
sinusoidal input; (a) in a linear system the input and output differ in amplitude and phase only; (b) in a
non-linear system they have different spectra.
Fig. 2.29(a)). On the other hand, if the signal is processed through a non-linear system of the form (2.50),
more substantial modifications of the spectrum occur: the output has in general the form
∑
N
y[n] = ak cos(kω0 n), (2.51)
k=0
and therefore the spectrum of y possesses energy at higher harmonics of ω0 (see Fig. 2.29(b)). This
effect is termed harmonic distortion, and can be quantified through the total harmonic distortion (T HD)
parameter: v
u ∑N
u a2k
T HD = t ∑k=2 N
. (2.52)
2
k=1 ak
In many cases one wants to minimize the THD in non-linear processing, but in other cases distortion
is exactly what we want in order to enrich an input sound. An example is the effect of valves, as in
amplifiers for electric guitars. There is no way to interpret harmonic distortion in terms of some transfer
function, because the concept of transfer function itself cannot be defined for a non-linear system (equiv-
alently, the impulse response of a non-linear system does not tell anything about its response to a generic
input).
For a memoryless non-linear system, the THD has a straightforward interpretation if one rewrites the
distortion function in terms of its (truncated) Taylor expansion around the origin:
∑
N
y[n] = F ( x[n] ) = ai xi [n]. (2.53)
i=0
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.1 0 0.1 0.2 0.3 −0.2 −0.1 0 0.1 0.2 0.3
ωd [rad] ωd [rad]
(a) (b)
Figure 2.30: Example of quadratic distortion; (a) spectrum of a sinusoid x[n] and (b) spectrum of the
squared sinusoid x2 [n].
Consider the effect of the quadratic term in this summation. The spectrum of x2 [n] is the convolution
of the spectrum of x[n] with itself. Therefore, when x[n] = a cos(ω0 n), i.e. x is a sinusoidal signal
with frequency ω0 , the spectrum of x2 [n] is [X ∗ X](ωd ), and thus contains the frequency 2ω0 as well
as the 0 frequency. The same result may be derived looking at the time-domain signal: straightforward
trigonometry shows that the squaring operation on a sinusoidal input signal produces the output signal
a2
y[n] = x2 [n] = [1 + cos(2ω0 n)] . (2.54)
2
Again, one can see that the output signal contains a DC component and the frequency 2ω0 . For a generic
input x[n], the squaring operation
∑ will in general double the bandwidth of the spectrum. In particular,
for an input signal x[n] = k ak cos(ωk n) the squaring operation produces an output signal that con-
tains all the frequencies 2ωk and moreover all the frequencies ωk1 ± ωk2 (the so-called intermodulation
frequencies, which arise from the cross terms in the square of the sum).
Similar considerations apply to higher-order terms of the Taylor expansion: raising a sinusoidal
signal with frequency ω0 to the i-th power will produce a spectrum that contains every other frequency
up to iω0 . In particular, if the Taylor expansion of F contains only odd (or only even) powers, then the
resulting spectrum will contains only odd (or only even) partials.
One specific application of waveshaping is in the generation of pure harmonics of a sinusoid. As
an example suppose that, given the input x[n] = cos(ω0 n) one wants to generate the output y[n] =
cos(5ω0 n), i.e. the fifth harmonic of the input. It is easily verified that waveshaping the input through
the polynomial distortion function F (x) = 16x5 − 20x3 + 5x provides the desired result. More in
general, the polynomial that transforms the sinusoid cos(ω0 n) into the sinusoid cos(iω0 n) is the i-th
order Chebyshev polynomial.9 By combining Chebyshev polynomials, one can then produce any desired
superposition of harmonic components in the output signal.
Fs /4 a2
input signal ( )2
Fs /6 a3
oversampling ( )3
Fs /8 a4
non−linear processing ( )4
downsampling
Fs /2N aN
( )N
Fs /2 Fs LFs /2 LFs
(a) (b)
Figure 2.31: Two implementations of a memoryless non-linear system; (a) non-linear processing in-
serted between oversampling and downsampling; (b) non-linear processing on band-limited versions of
the input.
From the discussion above we know that if F (x) is a polynomial, or if its Taylor series expansion
can be truncated, the bandwidth of the output spectrum induced by harmonic distortion remains limited.
Nontheless it can easily extend beyond fNy , and consequently cause aliasing in the output signal. Even
though in some musical effects such an additional aliasing distortion can be tolerated,10 in general it has
to be avoided as much as possible.
Solution: oversampling (see Fig. 2.31(a)). The procedure is illustrated in Fig. 2.31(a). The input is
oversampled and interpolated to a higher sampling freqency, say LFs with some L > 1. The distortion
function is applied to this oversampled signal. The resulting output can have spectral energy up to the
new Nyquist frequency LFs /2. Finally the signal has to be converted back to the original sampling
frequency: in order to avoid aliasing at this stage the signal is low-pass filtered back to the original
Nyquist frequency.
If F is a polynomial or can be reasonably approximated by a polynomial through its truncated Taylor
expansion, an alternative procedure can be designed, that avoids oversampling. This is illustrated in
Fig. 2.31(b). The input signal is split into several low-pass versions, and each of them is processed
through one term of the polynomial. In this stage no aliasing is generated by construction. Finally the
output signal is constructed as the sum of the processed low-pass versions. This procedure is equivalent
to the preceding one.
particular, when the input has a decaying amplitude envelope (as in note played on a guitar) the output
evolves from a nearly square waveform at the beginning to an almost pure sinusoid at the end.
This kind of effect will be well known to almost any guitarist or anyone who has played an instrument
through an overdriven amplifier. In musical terms, overdrive refers to a nearly linear audio effect device
which can be driven into the non-linear region of its distortion curve only by high input levels. The
transition from the operating linear region to the non-linear region is smooth. Distortion instead refers
to a similar effect, with the difference that the device operates mainly in the non-linear region of the
distortion curve.
The sound of a valve amplifier is based on a combination of various factors: the main processing fea-
tures of valves themselves are important, but the amplifier circuit as well as the chassis and loudspeaker
combination have their influence on the final sound. Foot-operated pedal effects have simpler circuitry
but always include a non-linear stage that introduces harmonic distortion on the input signal, in a faster
way and at lower sound levels than valve amplifiers. The simplest digital emulations of overdrive and
distortion effects can be obtained by using a static non-linearity that simulates some form of saturation
and clipping.
M-2.17
Write a function that operates a distortion on a guitar input sound using a static non-linearity.
M-2.17 Solution
function y=distortion(x,dtype,params);
y=zeros(1,length(x));
for i=1:length(x)
if dtype==’symm’ y(i)=symm_overdrive(x(i));
elseif dtype==’asymm’ y(i)=asymm_overdrive(x(i),params(1),params(2));
elseif dtype==’exp’ y(i)=exp_distortion(x(i));
end
end
Note that this is a naive implementation, that can potentially introduce aliasing in the output signal.
Let us now examine some specific non-linear functions that can be used to realize these effects.
Symmetric distortion is based on static non-linearities that are odd with respect to the origin, are approx-
imately linear for low input values, and saturate (i.e. progressively decrease their slope) with increasing
input signals. As a consequence, these non-linearities produce a symmetric (with respect to positive and
negative input values) clipping of the signal. A couple of possible parametrizations which have been
proposed in the literature are the following:
2x, 0 ≤ | x | ≤ 1/3, ( )
3−(2−3| x |)2 −q| x |
F (x) = sgn(x) , 1
< | x | ≤ 2
, F (x) = sgn(x) 1 − e , (2.55)
3 3 3
3 < | x | ≤ 1,
2
sgn(x),
where the parameter q in the second equation controls the amount of clipping (higher values provide
faster saturation). Both functions are shown in Fig. 2.32(a). The first one is claimed to be well suited for
implementing an soft overdrive effect, since it realizes a smooth transition of the linear behaviour for low
level signal to saturation for high level sounds, resulting in a warm and smooth sound. The second one
realizes a stronger clipping and is claimed to be more effective for implementing a distortion effect. Note
that, since these functions are odd, their Taylor expansions only contain odd terms and consequently only
odd harmonics are generated.
1 1
0.8
0.5
0.6
0.4
0
y
y
0.2
0
−0.5
soft overdrive −0.2
distortion
−1 −0.4
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
(a) (b)
Figure 2.32: Simulation of overdrive and distortion; (a) soft overdrive and exponential distortion (with
q = 6); (b) asymmetric clipping (with q = −0.2 and d = 8).
Asymmetric overdrive effects (that are resemblant of the effect of triode valves in the analog domain)
are based on distortion curves that clip positive and negative input values in different ways. Since the
distortion curve is no longer odd, also even harmonics are generated in this case. A proposal for a
function that simulates asymmetric clipping is
x−q q
F (x) = + . (2.56)
1−e −d(x−q) 1 − edq
Note that this function is still linear for small input values (f ′ (x) → 1 and f (x) → 0 for x → 0).
The parameter q scales the range of linear behavior (more negative values increase the linear region of
operation) and d controls the smoothness of the transition to clipping (higher values provide stronger
distortions). A plot of this function is shown in Fig. 2.32(b).
M-2.18
Implement the three functions used in the previous example. For each of them, study the output spectrum when
sinusoidal inputs with various amplitudes are provided.
M-2.18 Solution
function y=symm_overdrive(x);
if abs(x)<1/3 y=2*x;
elseif abs(x)<2/3 y=sign(x)*(3-(2-3*abs(x))ˆ2)/3;
else y=sign(x);
end
x[n] y[n]
h 1 [n 1 ]
h 1 [n 1 ,n 2 ]
..
.
h N [n 1 ,...,nN]
Figure 2.33: Realization of a non-linear system with memory using the Volterra series (truncated at the
order N ).
(e.g. current-voltage relations at each stage of the circuit) and then solving the system numerically. Here
insted we stick to a signal based approach and discuss a generalization of the concepts examined so far.
The Volterra series is a model for non-linear systems, that can be seen on one hand as a generalization
of the Taylor series for a non-linear function, and on the other hand as a generalization of the impulse
response concept for a LTI system. Given an input signal x[n], the output y[n] of a discrete-time non-
linear time-invariant system can be expanded in Volterra series as
∑
+∞ ∑
+∞ ∑
+∞
y[n] = h1 [n1 ]x[n − n1 ] + h2 [n1 , n2 ]x[n − n1 ]x[n − n2 ] + . . . +
n1 =0 n1 =0 n2 =0
(2.57)
∑
+∞ ∑
+∞
... + ... hk [n1 , . . . , nk ]x[n − n1 ] . . . x[n − nk ] + . . .
n1 =0 nk =0
The first term of the series corresponds to usual convolution of an impulse response with the input.
However now higher terms are also present, all of which perform multiple convolutions and therefore
depend in principle on the input at all past instants. Note also that if the multidimensional impulse
responses hk reduce to unit impulses, then the Volterra expansion reduces to a Taylor expansion.
The main advantage in representing a non-linear system through Eq. (2.57) is that various methods
exist for estimating the responses hk from measurements on real systems. If estimates for these responses
are available, then Eq. (2.57) also suggests an implementation scheme, depicted in Fig. 2.33.
On the other hand, these kind of representations are useful only for representing systems with mild
non-linearities, while for highly non-linear systems the Volterra series does not converge quickly enough
and even with many terms it does not provide a sufficiently accurate representation of system behavior.
a[n] ωc [n]
m[n] s[n]
and, besides their historical relevance, they still offer a wide range of original synthesis and processing
schemes.
and its spectrum is the convolution of the two input signal spectra, i.e. S(ωd ) = [X1 ∗ X2 ](ωd ).
Most typically one of the two signals is a sinusoid with frequency ωc , and is called the carrier signal
c[n], while the second signal is the input that will be transformed by the ring modulation and is called
the modulating signal m[n]:
This modulation scheme is shown in Fig. 2.34. Note that the resulting modulated signal s[n] is formally
identical to signals with time-varying amplitude examined in other occasions (e.g. sinusoidal oscillators
controlled in amplitude). The only but fundamental difference is that in this case the amplitude signal is
not a “slow” control signal, but varies at audio rate, and consequently it is perceived in a different way:
due to the limited resolution of the human ear, a modulation slower than ∼ 20 Hz will be perceived in
the time-domain as a time-varying amplitude envelope, whereas a modulation faster than that will be
perceived as distinct spectral components. More precisely, the spectrum of s[n] = c[n]m[n] is
1[ ]
S(ωd ) = M (ωd − ωc )ejϕc + M (ωd + ωc )e−jϕc , (2.60)
2
i.e. S(ωd ) is composed of two copies of the spectrum of M (ωd ), symmetric around ωc : a lower sideband
(LSB), reversed in frequency, and an upper sideband (USB). When the bandwidth of M (ωd ) extends
beyond ωc , part of the LSB extends to the negative region of the frequency axis, and this part is aliased.
A variant of ring modulation is amplitude modulation:
α[ ]
s[n] = {1 + αm[n]}c[n], S(ωd ) = C(ωd ) + M (ωd − ωc )ejϕc +M (ωd + ωc )e−jϕc , (2.61)
2
where α is the amplitude modulation index. In this case the spectrum S(ωd ) contains also the carrier
spectral line, plus side-bands of the form (2.60). From the expression for S(ωd ) one can see that α
controls the amplitude of the sidebands.
∑
N
bk
s[n] = {cos [(ωc + kωm )n + ϕk ] − cos [(ωc − kωm )n + ϕk ]} . (2.62)
2
k=1
If ωc − kωm < 0 for some k, then the corresponding spectral line will be aliased around zero. The
resulting spectrum has partials at frequencies | ωc ± kωm | with k = 1, . . . , N , where the absolute value
is used to take into account the possible aliasing around the origin.
Spectra of this kind can be characterized through the ratio ωc /ωm , sometimes also called c/m ratio.
When this ratio is rational (i.e. ωc /ωm = N1 /N2 with N1 , N2 ∈ N and mutually prime), the resulting
sound is periodic: more precisely all partials are multiples of the fundamental frequency ω0 = ωc /N1 =
ωm /N2 and ωc , ωm coincide with the N1 th and N2 th harmonic partial, respectively. As a special case, if
N2 = 1 all the harmonics are present and the components with k < −N1 , i.e. with negative frequency,
overlap some components with positive k. In general the N1 /N2 ratio can be considered as an index
of the harmonicity of the spectrum. The sound spectrum is more resemblant as a complete harmonic
spectrum when the N1 /N2 ratio is simple. The simplest possible c/m ratio is 1/2: in this case the effect
of ring modulation is simply that of producing a sound whose fundamental frequency is half that of the
modulating sound, with a limited distortion of the overall spectral envelope. This is a kind of octave
divider effect.
The c/m ratios can be grouped in families. All ratios of the type | ωc ± kωm | /ωm produce the
same components. As an example, the ratios 2/3, 5/3, 1/3, 4/3, 7/3 and so on produce the same set of
partials, in which only those that are multiples of 3 are missing. As a consequence, one family of ratio
can be identifed through a normal form ratio, i.e. the smallest ratio (the normal form ratio in the previous
example is 1/3).
When the ωc /ωm ratio is irrational, the resulting sound is inharmonic. This √ configuration can be
used to create inharmonic sounds,√ such as bells. As an example if ωc /ωm = 1/ 2, the sound contains
partials with frequency ωc ± k 2 and no implied fundamental pitch is audible. Of particular interest is
the case of an ωc /ωm ratio approximating a simple rational value, that is,
ωc N1
= + ϵ, with ϵ ≪ 1. (2.63)
ωm N2
In this case the fundamental frequency is still ω0 = ωm /N2 , but partials are shifted from the harmonic
series by ±ϵωm , so that the spectrum becomes slightly inharmonic. A small shift of ωc does not change
the pitch, but it slightly spread the partials and makes the sound more lively.
We have already seen in Chapter Fundamentals of digital audio processing how to compute the signal phase ϕ[n]
when the instantaneous frequency f0 [n] is varying at frame rate. We now face the problem of computing
ϕ[n] when the instantaneous frequency varies at audio rate. A way of approximating ϕ[n] is through a
first-order expansion.
Recalling that, in continuous time, phase and instantaneous frequency are related through 2πf0 (t) =
dϕ/dt(t) (see Chapter Fundamentals of digital audio processing), we can approximate this relation over two con-
secutive discrete time instants as
[ ]
dϕ f0 (nTs ) + f0 ((n − 1)Ts )
((n − 1)Ts ) ∼ 2π , (2.64)
dt 2
i.e. the phase derivative is approximated as the average of the instantaneous frequency at two consecutive
instants. Using this approximation, a first-order expansion of the phase can be approximated in discrete
time as
π
ϕ[n] = ϕ[n − 1] + (f0 [n] + f0 [n − 1]). (2.65)
Fs
M-2.19
Write a function that realizes a frequency-modulated sinusoidal oscillator, with input parameters t0 (initial time),
a (frame-rate amplitude vector) f (audio signal representing the instantaneous frequency vector), and ph0 (initial
phase).
M-2.19 Solution
function s=fm_osc(t0,a,f,ph0);
for (i=1:nframes)
phase=zeros(1,SpF); %phase vector in a frame
for(k=1:SpF) % work at sample rate
phase(k)=lastph + pi/Fs*(f((i-1)*SpF+k)+lastf); %compute phase
lastph=phase(k); lastf=f((i-1)*SpF+k); %save last values
end
s(((i-1)*SpF+1):i*SpF)=a(i).*sin(phase);
end
s=[zeros(1,round(t0*Fs)) s]; %add initial silence of t0 sec.
Compare this function with the sinosc function discussed in Chapter Fundamentals of digital audio
processing. The only difference is that in this case the frequency is given at audio rate. Consequently
the phase computation differs.
Although early realizations of FM synthesis were implemented in this fashion, in the remainder of
this section we will follow an equivalent “phase-modulation” formulation, according to which the FM
oscillator is written as:
s[n] = a[n] · sin (ωc [n]n + ϕ[n]) , (2.66)
Amplitude (dB)
J1(I)
0.6 J2(I)J (I) I=2.5
3 J4(I)J (I) 200
Jk(I)
0.4 5
I=2
0.2 150
I=1.5
0
100
−0.2 I=1
−0.4 50
I=0.5
0 5 10 15 0 500 1000 1500 2000 2500 3000
s[n] Modulation index I f (Hz)
Figure 2.35: Simple modulation: (a) block scheme; (b) the first 10 Bessel functions; (c) spectra produced
by simple modulation with ωc = 2π700 Hz, fm = 2π100 Hz, and I varying from 0.5 to 3 (for the sake
of clarity each spectrum is plotted with a different offset).
where a[n] is the (frame rate) amplitude signal, ωc [n] is the (frame rate) carrier frequency, and ϕ[n] is
the (audio rate) modulating signal. In this case the iterative computation used in the example M-2.19
can be substituted by the following:
φ[n] =φ[n − 1] + ωc [n] + ϕ[n],
(2.67)
y[n] =a[n] · sin(φ[n]),
where φ[n] is a state variable representing the instantaneous phase of the oscillator.
where Jk (I[n]) is the k-th order Bessel function of the first kind, evaluated in the point I[n]. From
Eq. (2.69) we can see that the spectrum has partials at frequencies | ωc ± kωm | (as already discussed for
ring modulation, negative frequencies are aliased around the origin). Each partial has amplitude Jk (I): a
plot of the first Bessel functions is shown in Fig. 2.35(b), from which one can see that partial amplitudes
are modulated in a very complex fashion when the modulation index I is varied.
Note that an infinite number of partials is generated, so that the signal bandwidth is not limited.
In practice however only a few low-order Bessel functions take significantly non-null values for small
values of I. As I increases, the number of significantly non-null Bessel functions increases too. A way of
characterizing the bandwidth of s[n] is by saying that the number M of lateral spectral lines | ωc ± kωm |
that are greater than 1/100 of the nonmodulated signal is given by M (I) = I + 2.4 · I 0.27 : therefore
M (I) ∼ I for non small I values, and the bandwidth around ωc is approximately 2I. Manipulation
of the modulation index produces an effect similar to low-pass filtering with varying cut-off frequency,
and with smooth variation of the amplitude of partials. Figure 2.35(c) show the spectra produced by
simple modulation, with varying modulation index values: as the index increases the energy of the
carrier frequency is progressively transferred to the lateral bands, according to the predicted behaviour.
where the integers k1 , . . . , kN all vary between −∞ and +∞. Therefore s[n] possesses all the partials
with frequencies | ωc ± k1 ωm,1 ± · · · ± kN ωm,N | with amplitudes given by the product of N Bessel
functions. If the ratios between the ωm,i s are sufficiently simple, then the spectrum is again of the type
| ωc ± kωm |. Otherwise the spectrum is highly inharmonic (and takes a noisy character for high index
values).
M-2.20
Synthesize a frequency modulated sinusoid in the case of compound modulation, and study the signal spectra
when control parameters are varied.
A more complex FM scheme is nested modulation (shown in Fig. 2.36(b)) , in which a sinusoidal
modulator is itself modulated by a second one, i.e. ϕ[n] = I1 [n] sin [ωm,1 [n]n + I2 [n] sin(ωm,2 [n]n)].
In this case the resulting signal is
s[n] =a[n] sin {ωc [n]n + I1 [n] sin [ωm,1 [n]n + I2 [n] sin(ωm,2 [n]n)]} =
∑
+∞ ∑
+∞
(2.71)
=a[n] Jk (I1 [n])Jn (kI2 [n]) sin {(ωc [n] + kωm,1 [n] + nωm,2 [n])n} .
k=−∞ n=−∞
The result can be interpreted as if each partial produced by the modulating frequency ωm,1 were mod-
ulated by ωm,2 with modulation index kI2 . The spectral structure is similar to that produced by two
sinusoidal modulators, but with larger bandwidth.
The last FM scheme that we examine is feedback modulation (shown in Fig. 2.36(c)), in which past
values of the output signal are used as a modulating signal, i.e. ϕ[n] = βs[n − n0 ]. If n0 = 1, the
modulated signal is
∑
+∞
2
s[n] = a[n] sin (ωc [n]n + βs[n − 1]) = a[n] Jk (kβ) sin(kωc [n]n), (2.72)
kβ
k=−∞
and β (called the feedback factor) acts as a scale factor or feedback modulation index. For increasing
values of β the resulting signal is periodic of frequency ωc and changes smoothly from a sinusoid to a
sawtooth waveform. Moreover one may vary the delay n0 in the feedback, and observe emergence of
chaotic behaviors for suitable combinations of the parameters n0 and β.
...
a[n]
β
z −n0
s[n]
s[n] s[n]
(a) (b) (c)
Figure 2.36: Basic FM schemes; (a) compound modulation, (b) nested modulation, and (c) feedback
modulation.
1980’s, mainly by Miller Puckette, and is today represented in three main software implementation:
Max/MSP, jmax, and Pd. The “Max paradigm” (so named in honor of Max Mathews) is described by
Puckette [Puckette, 2002] as a way of combining pre-designed building blocks into sound-processing
“patches”, to be used in real-time settings. This includes a scheduling protocol for both control- and
audio-rate computations, modularization and component intercommunication, and a graphical environ-
ment to represent and edit patches.
About sampling and wavetable synthesis. Contemporary music synthesizer are still based on these
techniques, and allow ever increasing quality thanks to the ever increasing availability of storage ca-
pacity. A multi-sampled instrument can occupy several Gb. From the point of view of music history
these techniques are rooted in several works from the ’50s, especially by composer Pierre Schaefer and
coworkers, who experimented with the use of recorded environmental sound as sonic material in their
compositions. This approach to musical composition has been termed musique concrete.
About granular synthesis. The scientific foundations of these approaches can be found in the work
of hungarian physicist Dennis Gabor (see [Gabor, 1947]). The composer Iannis Xenakis developed
this method in the field of analog electronic music. Starting from Gabor theory, Xenakis suggested
a compositional method based on the organization of the grains by means of screen sequences, which
specify frequency and amplitude parameters of the grains at discrete points in time. In this way a common
conceptual approach is used both for micro and macro musical structure: “All sound, even continuous
musical variation, is conceived as an assemblage of a large number of elementary sounds adequately
disposed in time. In the attack, body and decline of a complex sound, thousands of pure sounds appear
in a more or less short time interval of time ∆t” [Xenakis, 1992].
The most widely treated case is (asynchronous granular synthesis), where simple grains are dis-
tributed irregularly. A classic introduction to the topic is [Roads, 1991]. In particular, figure 2.4 in this
chapter is based on an analogous figure in [Roads, 1991]. In another classic work, Truax [Truax, 1988]
describe the granulation of recorded waveforms.
About recent corpus-based concatenative synthesis techniques. A review is provided in [Schwarz,
2007]
About overlap-add techniques. The pitch-synchronous overlap-add algorithm for time-stretching was
introduced by Moulines and Charpentier [1990] in the context of speech processing applications.
Additive synthesis was one of the first sound modeling techniques adopted in computer music and has
been extensively used in speech applications as well. The main ideas of the synthesis by analysis tech-
niques that we have reviewed date back to the work by McAulay and Quatieri [McAulay and Quatieri,
1986]. In the same period, Smith and Serra started working on “sines-plus-noise” representations, usu-
ally termed SMS (Spectral Modeling Synthesis) by Serra. A very complete coverage of the topic is
provided in [Serra, 1997]. The extension of the additive approach to a “sines-plus-transients-plus-noise”
representation is more recent, and has been proposed by Verma and Meng [Verma and Meng, 2000].
Subtractive synthesis techniques became extremely popular in the 1960’s and 1970’s, with the advent
of analog voltage controlled synthesizers. The Moog synthesizers were especially successful and were
based on a range of signal generators, filters, and control modules, which could be easily interconnected
to each other. The central component was the voltage-controlled oscillator (VCO), which could produce
a variety of waveforms and could be connected to other modules such as voltage-controlled amplifiers
(VCA), voltage-controlled filters (VCF), envelope generators, and other devices. Moog’s innovations
were first presented in [Moog, 1965]. Several techniques for antialiasing digital oscillators, to be used in
digital emulation of analog subtractive synthesis, are discussed in [Välimäki and Huovilainen, 2007].
A tutorial about filter design techniques, including normalization approaches that use L1 , L2 , and
∞
L norms of the amplitude response, is [Dutilleux, 1998]. Introductions to formant speech synthesis and
linear prediction techniques and their applications in speech technology can be found in many textbook.
See e.g. [Rabiner and Schafer, 1978] (our Fig. 2.21 is based on a similar figure in this book). Another
useful reference on the topic is [Deller et al., 1993]. One technique that is alternative to linear prediction
and widely used is digital all-pole modeling (DAP) [El-Jaroudi and Makhoul, 1991].
The use of frequency modulation as a sound synthesis algorithm was first experimented by Chown-
ing [1973] (later reprinted in [Roads and Strawn, 1985]), although these techniques had already been
used for decades in electrical communications. While performing experiments on different extents of
vibrato applied to simple oscillators, Chowning realized that when vibrato rates entered the audio range,
dramatic timbral changes were produced. Soon after FM became very popular and were applied also
to the simulation of real sounds: see [Schottstaedt, 1977] (also reprinted in [Roads and Strawn, 1985]).
Our example of a synthetic piano tone at the end of Sec. 2.6.3 is taken from this latter work. The FM
algorithms used for the DX7 synth are discussed at length in [Chowning and Bristow, 1986].
References
John Chowning. The synthesis of complex audio spectra by means of Frequency Modulation. J. Audio Engin. Soc., 21(7),
1973.
John Chowning and David Bristow. FM Theory and applications. Yamaha Music Foundation, Tokio, 1986.
John R Deller, John G. Proakis, and John. H.L. Hansen. Discrete-Time Processing of Speech Signals. Macmillan, New York,
1993.
P. Dutilleux. Filters, Delays, Modulations and Demodulations: A Tutorial. In Proc. COST-G6 Conf. Digital Audio Effects
(DAFx-98), pages 4–11, Barcelona, 1998.
Amro El-Jaroudi and John Makhoul. Discrete all-pole modeling. IEEE Trans. Sig. Process., 39:411–423, Feb. 1991.
Dennis Gabor. Acoustical quanta and the theory of hearing. Nature, 159(4044):591–594, 1947.
R. McAulay and T. F. Quatieri. Speech Analysis/Synthesis Based on a Sinusoidal Speech Model. IEEE Trans. Acoust., Speech,
and Sig. Process., 34:744–754, 1986.
Robert A. Moog. Voltage-controlled electronic music modules. J. Audio Eng. Soc., 13(3):200–206, July 1965.
E...... Moulines and F...... Charpentier. Pitch synchronous waveform processing techniques for text to speech synthesis using
diphones. Speech Communication, 9(5/6):453–467, 1990.
L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, Englewood Cliffs, NJ, 1978.
C. Roads. Asynchronous granular synthesis. In G. De Poli, A. Piccialli, and C. Roads, editors, Representations of Musical
Signals, pages 143–186. MIT Press, 1991.
C. Roads and J. Strawn, editors. Foundations of Computer Music. MIT Press, 1985.
William Schottstaedt. The simulation of natural instrument tones using frequency modulation with a complex modulating wave.
Computer Music J., 1(4):46–50, 1977.
Diemo Schwarz. Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, pages 92–104–, Mar. 2007.
X. Serra. Musical sound modeling with sinusoids plus noise. In C. Roads, S. Pope, A. Piccialli, and G. De Poli, editors, Musical
Signal Processing, pages 91–122. Swets & Zeitlinger, 1997. http://www.iua.upf.es/∼xserra/articles/msm/.
B. Truax. Real-time granular synthesis with a digital signal processor. Computer Music J., 12(2):14–26, 1988.
Vesa Välimäki and Antti Huovilainen. Antialiasing oscillators in subtractive synthesis. IEEE Signal Processing Magazine, 24
(2):116–125, Mar. 2007.
B. Vercoe. Csound: A manual for the audio processing system and supporting programs with tutorials. Technical report, Media
Lab, M.I.T., Cambridge, Massachusetts, 1993. Software and Manuals available from ftp://ftp.maths.bath.ac.uk/pub/dream/.
T. S. Verma and T. H. Y. Meng. Extending Spectral Modeling Synthesis with Transient Modeling Synthesis. Computer Music
J., 24(2):47–59, 2000.
Iannis Xenakis. Formalized music: Thought and Mathematics in Composition. Pendragon Press, Stuyvesant, NY, 1992.
2-61
2-62 Algorithms for Sound and Music Computing [v.February 2, 2019]