A Simple LPC Vocoder Bob Beauchaine EE586, Spring 2004: Vocal Tract Modeling
A Simple LPC Vocoder Bob Beauchaine EE586, Spring 2004: Vocal Tract Modeling
Bob Beauchaine
human vocal tract consider the speech
production process to be sufficiently well
conditions are met by the input signal. This broadest of terms, speech can be broken
does not imply that the model is a down into two major categories. The first
particularly “good” fit to the real data – just type of speech is called voiced speech. It is
that it is the best fit that can be achieved for produced by the periodic excitation of the
vocal tract at the larynx via vibration of the of a segment of voiced and unvoiced speech
vocal cords excited by the passage of air (above) and the impulse response of a 12th
expelled from the lungs (this excitation order LPC filter predicted from that speech
source is generally referred to as a glottal (below). A cursory analysis shows that the
pulse). Voiced speech is characterized by a fine details of the speech are clearly lost,
pseudo-stationary primary frequency (called but that the general envelope of the spectra
a formant) plus harmonics. Voiced speech is retained. It is this ability of the LPC
is typically associated with sounds produced model to track the gross spectral
by an “open” vocal tract – in English, vowels characteristics of the human vocal tract that
provide the best example of voiced speech. makes is such a successful speech coding
The other primary form of speech is model.
called unvoiced. Unvoiced speech is
produced by turbulence when air is forced LPC Vocoder
through a constriction in the vocal tract – by
Now that we have a rudimentary
pursed lips, tongue against teeth, or other
understanding of speech production, we
combinations. Unvoiced speech closely
may finally discuss the basic LPC speech
resembles random noise, showing little
encoder/decoder system (see Figure 3 – LPC
periodicity and little correlation between
encoder block diagram). First, speech is
samples. Of course, nothing as complex as
sampled at a frequency appropriate to
speech production can be reduced to a
capture all of the necessary frequency
binary model without some compromise.
components important for processing and
Much more robust models segment speech
recognition. For voice transmission, 10kHz
into many more categories, such as
is typically the sampling frequency of choice,
5
4
4
3
1
1
0
0
-1 -1
0 50 100 150 200 250 300 0 50 100 150 200 250 300
sonorants, voiced consonants, nasals, semi- though 8kHz is not unusual. This is
vowels, fricatives – all of which possess because, for almost all speakers, all
quantifiable differences that could in theory significant speech energy is contained in
be used to improve LPC based speech those frequencies below 4kHz (although
analysis and production, but which in reality some women and children violate this
only manage to complicate the analysis. assumption). The speech is then
Figure 2 LPC modeling of voiced/unvoiced segmented into blocks for processing.
speech shows a comparison of the spectrum Simple LPC analysis uses equal length blocks
of between 10 and 30ms. Less than 10ms order – the number of taps used in the
does not encompass a full period of some estimation process. Considerable analysis of
low frequency voiced sounds for male the prediction error has been examined in
speakers. My own experiments with male the literature. It has been shown that the
speech sounded synthetic at 10ms sample normalized error drops steeply for prediction
windows when, for certain frames, pitch orders from 0 to 4, then gradually reduces
detection became impossible. More than up to order 12, thereafter flattening out into
30ms violates the basic principle of a region of diminishing returns1. Thus, since
stationarity upon which the least squares LPC systems are almost universally used in
method relies. low bit-rate applications, and since using a
Once segmented, the speech type is prediction order of greater than 14 produces
determined as either voiced, unvoiced, or little improvement, LPC coders most often
silence. This simple sounding task is indeed use a model of order 10-14.
the most difficult issue in LPC vocoders – In reality, the determination of LPC
the past 25 years have seen dozens of parameters can be performed at any point
papers published on varying methods of in the analysis process, and in fact some
accomplishing this feat reliably across V/U/S detectors make use of the LPC
parameters.
Once the LPC
parameters are
determined, the
energy of the
signal segment
is determined.
This is required
at the synthesis
end to equalize
energy levels of
the synthesized
speech. Again,
Figure 4 LPC decoder the energy
content of the
speakers and environments.
signal is, as we shall see, a very powerful
If the speech is classified as voiced,
discriminator of speech type.
then its pitch must be determined. This is
At the end of this process, we have
because the human ear, while tolerant of
a model of the speech segment that
many errors in speech coding, is somewhat
determines its type, energy content, pitch,
sensitive to errors in pitch. Speech that is
and LPC parameters. If it is not clear at this
produced with pitch errors is quite annoying
point, it should be stated that, for an LPC
and typically sounds synthetic. Once the
type of speech is determined, the LPC
parameters are estimated. Crucial to the 1
Digital Processing of Speech Signals, L.R.
speech production model is the prediction Rabiner, R.W. Schafer, pg. 427.
vocoder, no portion of the actual speech
waveform is transmitted. The model is
sufficient for the receiver to synthesize N M M
1 1
speech at the receiver end that can be a
remarkable facsimile of the original. This ∑ a i⋅
M ∑
s ( k − i) s ( k − j)
M
⋅ ∑ s ( k) s ( k − j)
1
M N ∑ a i⋅
M ∑ s ( m) s ( m + j − i)
M
⋅ ∑ s ( m) s ( m + j)
ε
M
⋅ ∑ s ( k) −
∑ a i⋅ s ( k − i) i=1 k = − ∞ k =−∞
RA=C
2 The fundamental difference between these
Digital Compression for Multimedia, Jerry two representations is the form of the
D. Gibson et. al.
covariance or correlation matrix. In the important and single most difficult, part of
autocorrelation system, the R matrix is the LPC vocoder process.
Toeplitz. In the covariance representation, On the surface, the determination
this is not the case. Toeplitz matrices lend problem would not seem to be all that
themselves to a particularly elegant, difficult. A quick survey of many of the
efficient, and recursive solution known as characteristics of voiced speech versus
the Durbin or Levinson-Durbin algorithm. unvoiced speech would lead one to expect
Durbin’s recursion computes the LP that the two are easily separable. Let’s take
coefficients one at a time in terms of the such a survey. First, lets compare the
previous coefficient. While the covariance energy content of voiced and unvoiced
coefficients can be calculated somewhat speech segments.
more efficiently than brute force matrix
inversion through Cholesky decomposition, 3000
the Levinson-Durbin algorithm has a decided 2000
3
Digital Processing of Speech Signals, L.R.
Rabiner, R. W. Schafer, pg. 418.
4
Ibid
Another metric widely used in speech, with its characteristically lower
speech type analysis is the zero crossing frequency content, there is usually a strong
correlation from one sample to the next.
3000
Unvoiced speech, being much closer to
2000
random, does not have this property, and
1000
the difference manifests itself in properties
0
of the LP coefficients, most notably the first
-1000
LP coefficient. Consider figure 7, which
-2000
0 1000 2000 3000 4000 5000 6000 shows the first LP coefficient calculated for
Zero-crossing rate segments of the utterance of “six” we’ve
0.8
been using. Notice how the first coefficient
0.6 clusters around the value 1 for the unvoiced
0.4
3000
0.2
2000
0
0 10 20 30 40 50 60 1000
-1000
Figure 6 - Zero crossing rate, utterance of the word
-2000
“six” 0 1000 2000 3000 4000 5000 6000
Figure 10 - Speech type estimators cont'd In center clipped speech, a variable clipping
level is determined, usually as some fixed
example is provided in the presentation).
Theoretically, pitch detection is
Y(n)=C[x(n)]
straightforward. A Fourier transform of the
frame can typically pick out the fundamental
frequency for any given frame. The
problem, of course, is the processing power
7
-CL
x 10
14
+CL
12
10
2
Figure 11 - Center clipping
0
De-emphasis
Excitation selection Inverse LPC filter
1/1-.9z
Random noise
generator
8
“Speech Coding Algorithms”, Wai C. Chu,
pg. 133
speech boundaries where each frame
contains only a single type of speech. I
have no working experience with this type
of encoder, and cannot attest to its worth,
but it has apparently found its way into
commercial products.
The binary speech type decision
also causes problems for LPC. Certain kinds
of sounds in speech do not fall cleanly into
either the voiced or unvoiced category –
these include sounds like the letter ‘z’ and
some of the nasals. These sounds show a
noticeable mix of excitations, appearing as
noisy periodic signals. Handling this
problem in LPC has created a class of mixed
excitation coders that use both periodic and
noise components, suitably balanced, as the
excitation source for the synthesizer.