BF 03201771
BF 03201771
This paper describes how a speech synthesizer can be controlled by a small computer in real time. The
synthesizer allows precise control of the speech output that is necessary for experimental purposes. The
control information is computed in real time during synthesis in order to reduce data storage. The
properties of the synthesizer and the control program are prsented along with an example of the speech
synthesis.
Real-time minicomputers are now fairly common- vocal tract at the second stage. In this model, the sound
place in the psychological sciences. The psychologist source and the characteristics of the resonant circuits
expects the computer to control the sequence of events representing the vocal tract can be independently varied
in an experiment, present the stimuli, record the partici- to produce the sound output.
pant's response, and analyze the cumulative results. The desired sound can, therefore, be produced by
Although the computer appears to be well educated, he specifying a small number of parameters controlling the
(or she) has developed language abilities in the reverse significant acoustic dimensions of the speech. The sound
order of the scientist. Man spent many centuries source can be either voice or noise. The voice source
speaking before writing was developed, whereas the com- stimulates vibration of the vocal cords in real speech; it
puter writes (or at least prints) but "speak less than thou consists of a periodic quasi-sawtooth-shaped sequence of
knowest" (King Lear, Act I, Scene IV). If only our small pulses. The frequency of pulsing is referred to as the fun-
computers could speak scholarly and wisely (or at least damental (FO) and is heard as the pitch of the speaker's
intelligibly), they could be assigned to many additional voice and the Intonation pattern of the message. The
useful tasks. This paper describes how a relatively cheap noise source stimulates the forcing of air through some
synthesizer can be controlled by a small computer in constriction in the vocal tract. It has the properties of a
real-time. pseudorandom noise generator.
Artificial speech has been synthesized by mechanical, The sound source is fed into the resonant circuits at
electronic, and computer simulation techniques. [Coker, the second stage of synthesis. For the production of
Denes, & Pinson, Note I; Dudley & Tarnoczy (1950), vowel sounds, the resonant circuits are set to correspond
Mattingly (1968), Holmes (1972), and Flanagan (1972) to the acoustic resonances or formants of the vocal tract.
discuss the historical developments in speech synthesis.] The effect of each resonator IS to emphasize the energy
The electronic resonance synthesizer IS currently one or at Its set frequency and to produce additional energy at
the most efficient and popular techniques of speech its formant of the sound to be synthesized. The reson-
synthesis. Whereas mechanical synthesis attempted to ators can be arranged in parallel or in series. Parallel syn-
stimulate the articulatory properties or speech produc- thesizers combine the output of individual resonating
tion, electronic synthesis focuses on the acoustic struc- Circuits (Mattingly, 1968). In senes synthesizers, the
ture of speech. The focus on acoustic structure rather resonating circuits are arranged so that the sound source
than articulatory structure in synthesizing speech led to is fed Into the first resonator and the resulting output IS
synthesizers that were terminal analogs rather than the input of the second resonator, and so on. The series
direct analogs of speech. Whereas a direct analog syn- synthesizer better approximates the vocal tract in which
thesizer would have a direct representation of each com- the sound is modified in a serial fashion as it flows
ponent movement or sound in the vocal tract in the syn- through the vocal tract (Fant & Martony , 1962;
thesizer, the terminal-analog synthesizer simply attempts Flanagan, 1957).
to duplicate the final speech output. The acoustic speech There are a number of dimensions that must be con-
signal can be considered as a sound resulting from a sidered in determining the most appropriate speech syn-
two-stage process. The sound source of the first stage IS thesis system for experimental use. For the synthesizer
modified by the time-varying filter characteristics of the these are cost, flexibility, degree of control, and
programming complexity. For the control programs one
The preparation of this paper as well as the purchase of the
synthesizer was supported in part by L' .S. Public Health Service
must consider the power, speed, and memory capacity
Grant MH-19399. We would like to thank Jim Bryant for his of the controlling machine. We desired a synthesizer that
help in preparing tile demonstration tape for tile conference. allowed precise control over the synthesized signal since
189
190 COHEN AND MASSARO
SP 0 0 0 0 0
Vowe 1 e xc i. ta t ion AV 1 0 0 0 1 X X X X _00
2 - 30 dB 2 dB
Aspirative excitation AH 2 0 0 1 0 X X X X _00, 2 - 30 dB 2 dB
Nasal excitation AN 3 0 0 1 1 X X X X _00, 2 - 30 dB 2 dB
Pitch FO 4 0 1 0 0 X X X X X X 50 - 308 Hz 3 %
Vowel formant 1 Fl 5 0 1 0 1 X X X X X X 200 - 1234 Hz 3 %
Vowel formant 2 F2 6 0 1 1 0 X X X X X X 504 - 3109 Hz 3 %- Figure 3. Specification of speech syn-
Vowel formant 3 F3 7 0 1 1 1 XXXXXX 800 - 4935 Hz 3 % thesizer control parameters.
Nasal formant Nl 8 1 0 0 0 X X X X X X 200 - 1234 Hz 3 %
Fl bandwidth increment Bl 9 1 0 0 1 X X X X 0 - 188 Hz 12 Hz
F2 bandwidth increment B2 10 1 0 1 0 X X X X 0 - 470 Hz 31 Hz
F3 bandwidth increment B3 11 1 0 1 1 X X 0 -
600 liz 200 Hz
Fricative e xc i t.a t t on AC 12 1 1 0 0 X X X X _00, 2 - 30 dB 2 dB
Fr lea t i ve rorma n t 1
Fricative formant 2
Ir i c . pole/zero fa. t io
Kl
K2
AK
13
14
15
1 1 0 I
1 I
I
I
1 1 I
0
X X X X X X
X X X X X X
XXXXXX
800
1600
0
-
-
-
4935 Hz
9870 liz
31. 5 dB
33 %
I
0.5 dB
I
F3. The intensity levels are adjusted automatically as a shown in Figure 4. When a I/O instruction specifying
function of the formant frequencies. Some control of device 47 is executed, the 7430 gate is set to false and,
the formant intensity levels is possible through speci- negated by the 7402, enables the BIOP gates. If a
fication of the bandwidth controls Bl, B2, and B3. BIOP-2 pulse or an initializing pulse is issued by the
Narrowing the bandwidth increases the peak intensity of computer, a master clear pulse is sent to the OVE-IIId
the appropriate formant. for about 600 nsec. This clears all registers in the OVE-
The fricative sounds make use of the fricative branch IIId. If a BIOP4 pulse is issued, the 74121 issues a
of the synthesizer. The source CO, a pseudorandom 3-psec pulse which gates the BAC lines to the OVE-IIId
noise generator, is modified by two cascaded formant reo by enabling the 7408s and sends a set request pulse to
sonators, Kl and K2. The frequencies of Kl and K2 may the synthesizer.
be independently controlled. Control AK allows the
introduction of a variable antiforrnant into the fricative
branch. The intensity level of the resulting signal is con- STIMULUS SPECIFICATION
trolled by the fricative excitation AC.
The stop consonants make use of both the vowel and In order to control the synthesizer, detailed informa-
fricative branches. A stop consonant-vowel syllable can tion about the speech sound to be produced must be
be partitioned into burst, transition, and steady state specified. This information must then be coded and
vowel segments. The burst created when the stop con- typed into a file on the computer disk. A suitable
sonant is released is synthesized along the vowel and/or computer program (the PALO assembler with a supple-
fricative branches. The burst is followed by a transition mented symbol table) can then translate the code into a
to the levels of the following steady state vowel. Voice- form acceptable to the synthesizer control subroutine.
less aspirated sounds are synthesized by introducing the The first step in stimulus description is to divide the
noise source CO into the vowel branch through the level speech sound into timed segments. For example, in the
control amplifier AH. The presence of voicing without coding of a simplified syllable /ba/, one would have two
aspiration vs. the presence of aspiration without voicing segments, first a transition period and then a steady state
during the transition period are major cues to dis- period. This is illustrated in the schematic spectrogram
tinguishing /b, d , g/ from tv, t, k/. in Figure 5. After dividing the sample into segments, one
The nasal sounds /m, n, 'l / are similar to the voiced must specify the desired values of the control parameters
stop consonants except that the additional nasal formant at the segment boundaries. These values may be ob-
N1 is used. The nasal formant is excited by the voice tained from a frequency table. For example, if FO is
source FO, and the intensity level is controlled by AN. desired to be 126 Hz, the proper value is 40 8 , Only
The glides or semivowels are synthesized through the those parameters that are to be changed from one seg-
vowel branch. ment to the next need be specified. In our example, at
To communicate with the synthesizer it was neces- time a one would specify values for AV, FO, F 1, and F2.
sary to construct an appropriate interface. On the PDP- At time b one would only specify values for Fl and F2.
8/L most I/O is accomplished through the accumulator The programmer indicates whether the parameters that
(AC). From the AC, a 12-bit buffered AC (BAC) bus is differ between adjacent boundaries should be inter-
distributed to all peripheral devices. There is also a set of polated or whether they should maintain their present
six lines (BMB) for device selection and several buffered values until the next boundary (steady state). Currently,
I/O pulse (BIOP) lines. Both the PDP-8/L and the input all interpolation is carried out in a linear fashion, but we
stage of the OVE·IIId use TTL logic, so no level conver- are developing an exponential interpolation for more
sion was necessary. A schematic of the interface is realistic synthesis.
192 COHEN AND MASSARO
A,
&"""+ CONTROL
~
BLOC" I-'T"-+-"""T"""-r--+-l~r-""--"T"""'"T"'"--r~
W"3 : : AZ
BAt~z
A3
Figure 4. Schematic diagram of computer-synthesizer inter-
face.
SEGWENT SEGtoENT
I 2
TRANSITION STEADr STATE
.J-_ _ --~F2
T I , I
PV",_I
.A------~FI
I
PN",_,
I , 0
I I I I I
PH.. I PV",
Q b c
TIME-
Figure s. Schematic diagram of simplified /ba/.
FI
PNm PVm E a b c d
TIME-
where PN = parameter address name. e.g., "AV," "FO," Figure 9. Schematic diagram of simplified /bag/.
194 COHEN AND MASSARO
r-
e
N
I
:.:::
v
5
KI
I
I
-------- F2
Figure 10. Schematic diagram of the
o ------- FI syllable Izi/.
}n JFOl I
i
I'~
fu ~ I I N
c:J"j
70 110 130
!-!-~ I
210 2&0 310 320
TIME CMSEC>
CB or PL, one must include a memory origin statement zer is that of repeatability. Each time we synthesize a
of the form: *1XXXX, where 0 ~ XXXX ~ 7377 8. sound, we may not get exactly the same sound. This
This will cause the first CV or PL to be assembled at occurs because both the noise and voicing sources are
location 1XXXX. After the last CB or PL, one must in- free-running; if we start our speech sample at a different
clude an end flag, simply: PAUSE. Comments may be .J-ALI
.(WT- ~ I ~"f'
added in the coding by preceding them with a /.
SYNTHESUERPERFORMANCE -oFT-l
trogram of the resulting sound is shown in Figure 12. ft13 211111 CON2. 55 14 IstEADY S1All~ 611 "SEC
ftU 8216 PAR2
In coding speech sounds, we have found that the most eelS .211 CON3
MI6 61411 FAJi2. AC 4f1 [ 116 OB••• 1RANSltlON FRO'" "DB.EMIl Of 1..15'
natural sounding results are obtained by gradually bring-
1211 2818 Cl1f3. 55 Ie ISTEADY STAT[. . . . M$EC
ing up the amplitude at the beginning and gradually 122. 1222 PAIlJ
lIel .223 CON4
bringing it down at the end. In our example, the first 10 .22 15.. fAR3. AV .. e E 116 DB••• \lOVo. COMES ON
msec, controlled by CONI, and the last 10 msec, con- 1223 .... 4 CON..... IB 1& nNTERPOLATE BACKVIIUiD5. 21 "S£C
.... 1226 PAR ...
trolled by CON8, accomplish this purpose. Note that the 1225 .22'7 CON5
AC I" E ITUM Off fRI CI Tl V[
.26 61" PAR4,.
frequencies for the vowel in /zi/ are specified in the first 8.
II.' 4821 c•• 5. Ie 2' I1Nl£RPDLATE BIIlCKIARD$. "SEC
parameter list, PARI, although they are not actually 123' .232 PARS
- 0 0 at the same time as setting the formant frequencies 123~ .231 CDln
.36 21 ~5 PAR6. re 55 [ 1183 HZ ... F' TltMSI1IOil 19 .. HZ,JWfI
sometimes results in a distorted signal. In general, a cer- 923' 2812 CON'. 55 12 ISTEADY SlAl1. 51 III$EC
... eeee e ,1NO PARAM[1ER LI st I t I
tain amount of care must be taken whenever specifying 82 .. 1 82"2 CONS
rapid parameter transitions. Especially susceptible are 12"2 ...,.2 CONe. 18 2 IllITlRPOLATl BACK_fillS. II M1SC
Another problem that we have had with the synthesi- Figure 11. Data coding for the syllable Izi/.
REAL-TIME SPEECH SYNTHESIS 195
"--.'~~-~j[
o 100 .%00 300
TIME (MSEC)
Figure 12. Sound spectrogram of hs],
JMS I SPEAK
ARGI
ARG2
ARG1 may include one or both of the two commands,
plot enable (PE) and clear (CL). When PE is specified,
the spectrogram will be plotted. If CL is specified, cer- Figure 14. Computer-generated display of /sldavski/.
196 COHEN AND MASSARO
to the 0 point of the time scale. The display routine will FLANAGAN, 1. L. The synthesis of speech. Scientific American,
display the stimuli while the synthesizer produces it. 1972, 226. 48-58. .
HOLMES. J. N. Speech synthesis. London: Mills & Boon.
This feature is quite useful for debugging stimuli during
1972.
preparation. Figure 13 shows the scope display of the HOLMES. J. N,; MATTINGLY. I. G., & SHEARME, J. N. Speech
syllable /zi/ as programmed in Figures 10 and II. Figure synthesis by rule. Language and Speech, 1964. 7, 127-143.
14 shows the display of a slightly more complex LILJENCRANTS. J. C. W. A. The OVE-lII speech synthesizer. IEEE
example, /sIdauski/. Transactions on Audio Electroacoustics. 1968. AU-16. 137-140.
MATTINGLY. I. G. Synthesis by rule of general American
English. Status reports on speech research (Suppl.).
REFERENCE NOTE New York: Haskins Laboratories, 1968.
RAHIMI. M. A.. & EULENBERG, J. B. A computer terminal with
1. Coker. C. H., Denes, P. B., & Pinson, E. N. Speech
synthetic speech output. Behavior Research Methods &
synthesis: An experiment in electronic speech production. Bell
Instrumentation, 1974. 6, 255-258.
Telephone Laboratories, 1963.
TOMLINSON, I. G. SPASS-An improved terminal analog speech
synthesizer. Quarterly progress report. MIT Research Lab of
REFERENCES Electronics, Cambridge, Mass., Vol. SO, 1966.
DUDLEY. Hoo & TARNOCZY, T. H. The speaking machine
of Wolfgang von Kempelen. Journal of the Acoustical Society of
America, 1950, 22. 151-166.
FANT. Goo & MARTONY, J. Instrumentation for parametric
synthesis (OVE-II). Quarterly progress report. Speech
Transmission Laboratory, Stockholm. July 1968. Pp. 18·24. NOTE
FLANAGAN. J. L. Note on the design of "terminal analog" speech
synthesizers. Journal of the Acoustical Society ofA merica, 1957, I. The OVE-Illd speech synthesizer is manufactured by
29. 300-310. A. B. Fonema, Box 1010, S-640 25. Julita, Sweden.