Speech Signal Processing
Speech Signal Processing
[email protected] +919495123331 Lizy Abraham Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India
1
Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing, Articulatory model). Acoustic Phonetics ( Basic speech units and their classification). Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram, Formant Estimation &Analysis). Cepstral Analysis Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR, MFCC, Sinusoidal Model, GMM, HMM Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic Coding, Vector Quantization based Coders, CELP Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-tospeech conversion, speech enhancement, Speaker Verification, Language Identification, Issues of Voice transmission over Internet.
REFERENCE
1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE Press, Hardcover 2nd edition, 1999; ISBN: 0780334493. 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall; ISBN: 013242942X; 1st edition 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, September 1999; ISBN: 0471349593 For the End semester exam (100 marks), the question paper shall have six questions of 20 marks each covering entire syllabus out of which any five shall be answered. It shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20 marks each and 10 marks for assignments (Minimum two) /Term Project.
3
Algorithms (Programming)
Acoustics
Information Theory
Phonetics
Speech can be defined as a pressure acoustic signal that is articulated in the vocal tract
Speech is produced when: air is forced from the lungs through the vocal cords and along the vocal tract.
8
This air flow is referred to as excitation signal. This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in shaping the sound produced. Vocal Tract components: Oral Tract: (from lips to vocal cords). nostrills). Nasal Tract: (from the velum till nostrills).
9
10
11
Larynx: the source of speech Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans. Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz).
12
13
Places of articulation
dental labial
14
Unvoiced sounds
Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present Eg. S, F
Plosive sounds
A complete closure in the vocal tract Air pressure is built up and released suddenly Eg. B , P
15
Speech Model
16
SPEECH SOUNDS
Coarse classification with phonemes. A phone is the acoustic realization of a phoneme. Allophones are context dependent phonemes.
17
PHONEME HIERARCHY
Speech sounds Language dependent. About 50 in English. Consonants
Vowels iy, ih, ae, aa, ah, ao,ax, eh, er, ow, uh, uw
Lateral liquid Retroflex liquid Nasal m, n, ng Fricative f, v, th, dh, s, z, sh, zh, h r l
18
19
20
sounds like /SH/ and /S/ look like (spectrally shaped) random noise, while the vowel sounds /UH/, /IY/, and /EY/ are highly structured and quasi-periodic. These differences result from the distinctively different ways that these sounds are produced.
21
22
e Mid
Low
24
Spectral envelope.
Formants.
25
26
FORMANTS
Formants can be recognized in the frequency content of the signal segment.
Formants are best described as high energy peaks in the frequency spectrum of speech sound.
27
The resonant frequencies of the vocal tract are called formant frequencies or simply formants. The peaks of the spectrum of the vocal tract response correspond approximately to its formants. Under the linear time-invariant all-pole assumption, each vocal tract shape is characterized by a collection of formants.
28
Because the vocal tract is assumed stable with poles inside the unit circle, the vocal tract transfer function can be expressed either in product or partial fraction expansion form:
29
30
A detailed acoustic theory must consider the effects of the following: Time variation of the vocal tract shape Losses due to heat conduction and viscous friction at the vocal tract walls Softness of the vocal tract walls Radiation of sound at the lips Nasal coupling Excitation of sound in the vocal tract Let us begin by considering a simple case of a lossless tube:
31
28 December 2012
32
33
Consider an N-tube model of the previous figure. Each tube has length lk and cross sectional area of Ak. Assume:
No losses Planar wave propagation
34
35
28 December 2012
36
28 December 2012
37
28 December 2012
38
28 December 2012
*(derivation in Quateri text: page 122 125) The poles of the transfer function T (j ) are where cos( l/c)=0
28 December 2012
The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) The length of the vocal tract, l, corresponds to 1/41, 3/42, 5/43, , where i is the wavelength of the ith natural frequency
40
28 December 2012
28 December 2012
42
43
44
45
VOWELS
Modeled as a tube closed at one end and open at the other the closure is a membrane with a slit in it the tube has uniform cross sectional area membrane represents the source of energy (vocal folds) the energy travels through the tube the tube generates no energy on its own the tube represents an important class of resonators odd quarter length relationship Fn=(2n-1)c/4l
VOWELS
Filter characteristics for vowels the vocal tract is a dynamic filter it is frequency dependent it has, theoretically, an infinite number of resonances each resonance has a center frequency, an amplitude and a bandwidth for speech, these resonances are called formants formants are numbered in succession from the lowest F1, F2, F3, etc.
Fricatives Modeled as a tube with a very severe constriction The air exiting the constriction is turbulent Because of the turbulence, there is no periodicity unless accompanied by voicing
52
53
54
28 December 2012
55
56
57
58
For all-pole linear systems, the input and output are related by a difference equation of the form:
59
60
The operator T{ } defines the nature of the short-time analysis function, and w[n m] represents a time shifted window sequence
61
62
SHORT-TIME ENERGY
simple to compute, and useful for estimating properties of the excitation function in the model.
63
64
Since |sgn{x[m]} sgn{x[m 1]}| is equal to 1 if x[m] and x[m 1] have different algebraic signs and 0 if they have the same sign, it follows that it is a weighted sum of all the instances of alternating sign (zero-crossing) that fall within the support region of the shifted window w[n m].
65
shows an example of the short-time energy and zero crossing rate for a segment of speech with a transition from unvoiced to voiced speech. In both cases, the window is a Hamming window of duration 25ms (equivalent to 401 samples at a 16 kHz sampling rate). Thus, both the short-time energy and the short-time zero-crossing rate are output of a low pass filter whose frequency response is as shown.
66
Short time energy and zero-crossing rate functions are slowly varying compared to the time variations of the speech signal, and therefore, they can be sampled at a much lower rate than that of the original speech signal. For finite-length windows like the Hamming window, this reduction of the sampling rate is accomplished by moving the window position n in jumps of more than one sample
67
during the unvoiced interval, the zero-crossing rate is relatively high compared to the zerocrossing rate in the voiced interval. Conversely, the energy is relatively low in the unvoiced region compared to the energy in the voiced region.
68
69
70
e[n] is the excitation to the linear system with impulse response h[n]. A well known, and easily proved, property of the autocorrelation function is that
i.e., the autocorrelation function of s[n] = e[n] h[n] is the convolution of the autocorrelation functions of e[n] and h[n].
71
72
73
74
FILTERING VIEW
75
76
77
78
79
80
81
82
83
OVERLAP-ADD METHOD
Just as the FBS method was motivated from the filteling view of the STFT, the OLA method is motivated from the Fourier transform view of the STFT. In this method, for each fixed time, we take the inverse DFT of the corresponding frequency function and divide the result by the analysis window. However, instead of dividing out the analysis window from each of the resulting short-time sections, we perform an overlap and add operation between the short-time sections.
84
given a discrete STFT X (n, k), the OLA method synthesizes a sequence Y[n] given by
85
86
Furthermore, if the discrete STFT had been decimated in time by a factor L, it can be similarly shown that if the analysis window satisfies
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
PHASE VOCODER
The fourier series is computed over a sliding window of a single pitch period duration and provide a measure of amplitude and frequency trajectories of the musical tones.
105
106
107
which can be interpreted as a real sinewave that is amplitude- and phase-modulated by the STFT, the "carrier" of the latter being the kth filter's center frequency. the STFT of a continuos time signal as,
108
109
where is an initial condition. The signal is likewise referred to as the instantaneous amplitude for each channel. The resulting filter-bank output is a sinewave with generally a time-varying amplitude and frequency modulation. An alternative expression is,
110
111
we can sample the continuous-time STFT, with sampling interval T, to obtain the discrete-time STFT.
112
113
114
115
116
117
SPEECH MODIFICATION
118
119
120
121
122
123
124
That is, the complex cepstrum operator transforms convolution into addition. This property, is what makes the cepstrum useful for speech analysis, since the model for speech production involves convolution of the excitation with the vocal tract impulse response, and our goal is often to separate the excitation signal from the vocal tract signal.
125
The key issue in the definition and computation of the complex cepstrum is the computation of the complex logarithm. ie, the computation of the phase angle arg[X(ej)], which must be done so as to preserve an additive combination of phases for two signals combined by convolution
126
127
128
RECURSIVE COMPUTATION OF THE COMPLEX CEPSTRUM Another approach to compute the complex cepstrum applies only to minimum-phase signals. i.e., signals having an z-transform whose poles and zeros are inside the unit circle. An example would be the impulse response of an all-pole vocal tract model with system function
129
In this case, all the poles ck must be inside the unit circle for stability of the system.
130
SHORTSHORT-TIME HOMOMORPHIC FILTERING OF SPEECH PAGE N0: 63, RABINER & SCHAFER
131
The low quefrency part of the cepstrum is expected to be representative of the slow variations (with frequency) in the log spectrum, while the high quefrency components would correspond to the more rapid fluctuations of the log spectrum.
132
the spectrum for the voiced segment has a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech. This periodic structure in the log spectrum manifests itself in the cepstrum peak at a quefrency of about 9ms. The existence of this peak in the quefrency range of expected pitch periods strongly signals voiced speech. Furthermore, the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. the autocorrelation function also displays an indication of periodicity, but not nearly as unambiguously as does the cepstrum. But the rapid variations of the unvoiced spectra appear random with no periodic structure. As a result, there is no strong peak indicating periodicity as in the voiced case.
133
These slowly varying log spectra clearly retain the general spectral shape with peaks corresponding to the formant resonance structure for the segment of speech under analysis.
134
135
for the positions 1 through 5, the window includes only unvoiced speech for positions 6 and 7 the signal within the window is partly voiced and partly unvoiced. For positions 8 through 15 the window only includes voiced speech. the rapid variations of the unvoiced spectra appear random with no periodic structure. the spectra for voiced segments have a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech.
136
137
the cepstrum peak at a quefrency of about 11 12 ms strongly signals voiced speech, and the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. Presence of a strong peak implies voiced speech, and the quefrency location of the peak gives the estimate of the pitch period.
138
the bandwidths are constant for center frequencies below 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 filters. The mel-frequency spectrum at analysis timen is defined for r = 1,2,...,R as
140
141
is a normalizing factor for the rth mel-filter. For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is computed to form the function mfccn[m], i.e.,
142
143
shows the result of mfcc analysis of a frame of voiced speech in comparison with the shorttime Fourier spectrum, LPC spectrum, and a homomorphically smoothed spectrum. all these spectra are different, but they have in common that they have peaks at the formant resonances. At higher frequencies, the reconstructed melspectrum has more smoothing due to the structure of the filter bank.
144
In order to make smooth, R is usually quite small compared to both the window length L and the number of samples in the frequency dimension, N, which may be much larger than the window length L. Such a function of two variables can be plotted on a two dimensional surface as either a grayscale or a color-mapped image. The bars on the right calibrate the color map (in dB).
146
147
if the analysis window is short, the spectrogram is called a wide-band spectrogram which is characterized by good time resolution and poor frequency resolution. when the window length is long, the spectrogram is a narrow-band spectrogram, which is characterized by good frequency resolution and poor time resolution.
148
THE SPECTROGRAM
149
150
151
Note the three broad peaks in the spectrum slice at time tr = 430 ms, and observe that similar slices would be obtained at other times around tr = 430 ms. These large peaks are representative of the underlying resonances of the vocal tract at the corresponding time in the production of the speech signal.
152
The lower spectrogram is not as sensitive to rapid time variations, but the resolution in the frequency dimension is much better. This window length is on the order of several pitch periods of the waveform during voiced intervals. As a result, the spectrogram no longer displays vertically oriented striations since several periods are included in the window.
153
ACF
154
CEPSTRUM
SPEECH WAVE (X)= EXCITATION (E) . FILTER (H)
(S)
(H)
(Vocal tract filter)
(E)
Glottal excitation From Vocal cords (Glottis)
http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
155
CEPSTRAL ANALYSIS
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index
S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
156
CEPSTRUM
C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
X(n) S(n) windowing DFT X(w) Log|x(w)| Log|x(w)| IDFT C(n)
In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis
157
EXAMPLE OF CEPSTRUM
sampling frequency 22.05KHz
158
159
the time-decimated subband outputs are quantized and encoded, then are decoded at the receiver. In subband coding, a small number of filters with wide and overlapping bandwidths are chosen and each output is quantized each bandpass filter output is quantized individually. although the bandpass filters are wide and overlapping, careful design of the filter, resuIts in a cancellation of quantization noise that leaks across bands.
160
Quadrature mirror filters are one such filter class; shows an example of a two-band subband coder using two overlapping quadrature mirror filters Quadrature mirror filters can be further subdivided from high to low filters by splitting the fullband into two, then the resulting lower band into two, and so on.
161
This octave-band splitting, together with the iterative decimation, can be shown to yield a perfect reconstruction filter bank such octave-band filter banks, and their conditions for perfect reconstruction, are closely related to wavelet analysis/synthesis structures.
162
163
164
165
i =0
j =0
N ( z ) = b( j ) z j and D( z ) = a(i ) z i
j =0 i =0 Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).
166
The mixed pole-zero model is called the autoregressive moving-average (ARMA) model.
167
e( n ) = y ( n ) y ( n )
p
= a (i ) y (n i )
i =0
To derive the predictor we use the orthogonality principle, the principle states that the desired coefficients are those which make the error orthogonal to the samples y(n-1), y(n-2),, y(n-p).
168
y (n j ) a (i ) y (n i ) = 0
i =0
Interchanging the operation of averaging and summing, and representing < > by summing over n, we have
p
169
E = e 2 ( n ) = y ( n ) e( n )
Or,
p
a ( i )ri j = 0 , j = 1 ,2 , ...,p
i=0
i=0
a ( i ) ri = E
where
ri =
y ( n) y ( n i )
n =
170
1-A(z)
H ( z) =
A( z )
1 ai z i
i =1
171
For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW. One formant requires two complex conjugate poles. Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.
172
V U
R(z) LP Filter
Uncorrelated
Unvoiced
173
Signal
H(z)