0% found this document useful (0 votes)
11 views8 pages

BF 03201771

Uploaded by

eon62701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

BF 03201771

Uploaded by

eon62701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Behavior Research Methods & instrumentation

i976, Vol. 8 (2),189·196


SESSION X
CONTRIBUTED PAPERS:
STIMULUS GENERATION
DOMINIC W. MASSARO, Unt'versity of Wisconsin, Presider

Real-time speech synthesis


MICHAEL M. COHEN and DOMINIC W. MASSARO
University of Wisconsin, Madison, Wisconsin 53706

This paper describes how a speech synthesizer can be controlled by a small computer in real time. The
synthesizer allows precise control of the speech output that is necessary for experimental purposes. The
control information is computed in real time during synthesis in order to reduce data storage. The
properties of the synthesizer and the control program are prsented along with an example of the speech
synthesis.
Real-time minicomputers are now fairly common- vocal tract at the second stage. In this model, the sound
place in the psychological sciences. The psychologist source and the characteristics of the resonant circuits
expects the computer to control the sequence of events representing the vocal tract can be independently varied
in an experiment, present the stimuli, record the partici- to produce the sound output.
pant's response, and analyze the cumulative results. The desired sound can, therefore, be produced by
Although the computer appears to be well educated, he specifying a small number of parameters controlling the
(or she) has developed language abilities in the reverse significant acoustic dimensions of the speech. The sound
order of the scientist. Man spent many centuries source can be either voice or noise. The voice source
speaking before writing was developed, whereas the com- stimulates vibration of the vocal cords in real speech; it
puter writes (or at least prints) but "speak less than thou consists of a periodic quasi-sawtooth-shaped sequence of
knowest" (King Lear, Act I, Scene IV). If only our small pulses. The frequency of pulsing is referred to as the fun-
computers could speak scholarly and wisely (or at least damental (FO) and is heard as the pitch of the speaker's
intelligibly), they could be assigned to many additional voice and the Intonation pattern of the message. The
useful tasks. This paper describes how a relatively cheap noise source stimulates the forcing of air through some
synthesizer can be controlled by a small computer in constriction in the vocal tract. It has the properties of a
real-time. pseudorandom noise generator.
Artificial speech has been synthesized by mechanical, The sound source is fed into the resonant circuits at
electronic, and computer simulation techniques. [Coker, the second stage of synthesis. For the production of
Denes, & Pinson, Note I; Dudley & Tarnoczy (1950), vowel sounds, the resonant circuits are set to correspond
Mattingly (1968), Holmes (1972), and Flanagan (1972) to the acoustic resonances or formants of the vocal tract.
discuss the historical developments in speech synthesis.] The effect of each resonator IS to emphasize the energy
The electronic resonance synthesizer IS currently one or at Its set frequency and to produce additional energy at
the most efficient and popular techniques of speech its formant of the sound to be synthesized. The reson-
synthesis. Whereas mechanical synthesis attempted to ators can be arranged in parallel or in series. Parallel syn-
stimulate the articulatory properties or speech produc- thesizers combine the output of individual resonating
tion, electronic synthesis focuses on the acoustic struc- Circuits (Mattingly, 1968). In senes synthesizers, the
ture of speech. The focus on acoustic structure rather resonating circuits are arranged so that the sound source
than articulatory structure in synthesizing speech led to is fed Into the first resonator and the resulting output IS
synthesizers that were terminal analogs rather than the input of the second resonator, and so on. The series
direct analogs of speech. Whereas a direct analog syn- synthesizer better approximates the vocal tract in which
thesizer would have a direct representation of each com- the sound is modified in a serial fashion as it flows
ponent movement or sound in the vocal tract in the syn- through the vocal tract (Fant & Martony , 1962;
thesizer, the terminal-analog synthesizer simply attempts Flanagan, 1957).
to duplicate the final speech output. The acoustic speech There are a number of dimensions that must be con-
signal can be considered as a sound resulting from a sidered in determining the most appropriate speech syn-
two-stage process. The sound source of the first stage IS thesis system for experimental use. For the synthesizer
modified by the time-varying filter characteristics of the these are cost, flexibility, degree of control, and
programming complexity. For the control programs one
The preparation of this paper as well as the purchase of the
synthesizer was supported in part by L' .S. Public Health Service
must consider the power, speed, and memory capacity
Grant MH-19399. We would like to thank Jim Bryant for his of the controlling machine. We desired a synthesizer that
help in preparing tile demonstration tape for tile conference. allowed precise control over the synthesized signal since
189
190 COHEN AND MASSARO

for synthesis by rule, i.e., a system that automatically


takes the user from a phonemic transcription input to
LAll synthesized speech output (Holmes, Mattingly, &
Shearme, 1964; Mattingly, 1968). Rather than quantity,
we are striving for quality within the confines of
simplicity and compactness.
I Of 4
SIJIl.Jf:CT
ROOMS THE SYNTHESIZER
OVE·IIId 1 is a formant series synthesizer stimulating
Figure I. Hardware configuration of laboratory used for the vocal. tract. The synthesizer is a very compact device
speech perception research.
measuring 19 x 14 x 1.75 in. high and is rack mounted.
our research program involves the manipulation of the The OVE-Illd synthesizer is the theoretical descendant
microstructure of speech stimuli. Accordingly, the com- of such synthesizers as OVE-II (Fant & Martony, 1972)
mercial synthesizers that have phoneme or syllable and SPASS (Tomlinson, 1966). [For a more detailed
sounds hard wired into the system would not be appro- description of the aVE-Hi see Liijencrants (1968).) The
priate (e.g., the VOTRAX synthesizer, Rahimi & synthesizer incorporates three parallel branches for the
Eulenberg, .1974). Figure 1 shows the general layout of synthesis of vowels, fricatives, and nasals (see Figure 2).
equipment used in our laboratory for speech perception The synthesizer is digitally controlled. Control data are
research. Given the small memory capacity (8K) of our received over a lO-bit bus and stored in digital registers.
PDP.8/L and the fact that the synthesis would have to One lO-bit control word contains a 4-bit address code
be carried out during the experiment proper, we decided and a 6·bit logarithmic data code (see Figure 3). The
that rather than calculating the control information in frequencies are incremented in 3% steps and the ampli-
advance, as done by the Haskins system (Mattingly, tudes in 2-dB steps.
1968), all interpolations would have to be carried out When a parameter word is presented to the synthesi-
dynamically during synthesis. This method allows for zer along with a set command, a l-asec control cycle
very compact and flexible specification of the stimuli. starts. A demultiplexer within the synthesizer then gates
Given that synthesis is carried out during the experiment the data to the appropriate register according to the
itself, the control program must be very small. Memory address code. The data in these registers are used as co-
storage must also be allocated to the experimental con- efficients by the analog circuitry in generating the per-
trol subroutines, the main experimental program, data tinent waveforms. Parameter words can also be entered
storage, synthesizer control specifications, and the disk manually from toggle switches and a set button on the
system resident. Our current synthesizer program, front panel. .
written as an assembly language subroutine callable from Vowels may be synthesized by introducing the voice
a main experimental program, takes only 317 decimal source to the vowel formant branch through the level
locations. (A somewhat smaller program could probably control amplifier AV. The frequency of the voice source
be written for a machine that had hardware arithmetic.) is controlled by Fa. After passing through a correction
The cost of the synthesizer and the interface to the network and mixer KH, the sound is directed through
computer had to be minimal. The actual cost came to formant resonators corresponding to the first five for-
less than $4000. mants, FI-F5 (cf. Figure 2). Only the first three for-
Our present synthesizer operating system is not mants can be controlled; formants F4 and F5 are present
meant to compete with large-scale speech synthesis pro- at 3.5 and 4.0 KHz, respectively. The center frequencies
grams. We do not presently foresee using our machine of the first three formants are controlled by Fl, F2, and

Figure 2. Block diagram of OYE-llId


speech synthesizer. See text for explanation.
REAL-TIME SPEECH SYNTHESIS 191

PARAMETER ADDRESS DATA RANGE STEP


mne dec A3 2 1 0 05 4 3 2 1 0

SP 0 0 0 0 0
Vowe 1 e xc i. ta t ion AV 1 0 0 0 1 X X X X _00
2 - 30 dB 2 dB
Aspirative excitation AH 2 0 0 1 0 X X X X _00, 2 - 30 dB 2 dB
Nasal excitation AN 3 0 0 1 1 X X X X _00, 2 - 30 dB 2 dB
Pitch FO 4 0 1 0 0 X X X X X X 50 - 308 Hz 3 %
Vowel formant 1 Fl 5 0 1 0 1 X X X X X X 200 - 1234 Hz 3 %
Vowel formant 2 F2 6 0 1 1 0 X X X X X X 504 - 3109 Hz 3 %- Figure 3. Specification of speech syn-
Vowel formant 3 F3 7 0 1 1 1 XXXXXX 800 - 4935 Hz 3 % thesizer control parameters.
Nasal formant Nl 8 1 0 0 0 X X X X X X 200 - 1234 Hz 3 %
Fl bandwidth increment Bl 9 1 0 0 1 X X X X 0 - 188 Hz 12 Hz
F2 bandwidth increment B2 10 1 0 1 0 X X X X 0 - 470 Hz 31 Hz
F3 bandwidth increment B3 11 1 0 1 1 X X 0 -
600 liz 200 Hz
Fricative e xc i t.a t t on AC 12 1 1 0 0 X X X X _00, 2 - 30 dB 2 dB
Fr lea t i ve rorma n t 1
Fricative formant 2
Ir i c . pole/zero fa. t io
Kl
K2
AK
13
14
15
1 1 0 I
1 I
I
I
1 1 I
0
X X X X X X
X X X X X X
XXXXXX
800
1600
0
-
-
-
4935 Hz
9870 liz
31. 5 dB
33 %
I

0.5 dB
I
F3. The intensity levels are adjusted automatically as a shown in Figure 4. When a I/O instruction specifying
function of the formant frequencies. Some control of device 47 is executed, the 7430 gate is set to false and,
the formant intensity levels is possible through speci- negated by the 7402, enables the BIOP gates. If a
fication of the bandwidth controls Bl, B2, and B3. BIOP-2 pulse or an initializing pulse is issued by the
Narrowing the bandwidth increases the peak intensity of computer, a master clear pulse is sent to the OVE-IIId
the appropriate formant. for about 600 nsec. This clears all registers in the OVE-
The fricative sounds make use of the fricative branch IIId. If a BIOP4 pulse is issued, the 74121 issues a
of the synthesizer. The source CO, a pseudorandom 3-psec pulse which gates the BAC lines to the OVE-IIId
noise generator, is modified by two cascaded formant reo by enabling the 7408s and sends a set request pulse to
sonators, Kl and K2. The frequencies of Kl and K2 may the synthesizer.
be independently controlled. Control AK allows the
introduction of a variable antiforrnant into the fricative
branch. The intensity level of the resulting signal is con- STIMULUS SPECIFICATION
trolled by the fricative excitation AC.
The stop consonants make use of both the vowel and In order to control the synthesizer, detailed informa-
fricative branches. A stop consonant-vowel syllable can tion about the speech sound to be produced must be
be partitioned into burst, transition, and steady state specified. This information must then be coded and
vowel segments. The burst created when the stop con- typed into a file on the computer disk. A suitable
sonant is released is synthesized along the vowel and/or computer program (the PALO assembler with a supple-
fricative branches. The burst is followed by a transition mented symbol table) can then translate the code into a
to the levels of the following steady state vowel. Voice- form acceptable to the synthesizer control subroutine.
less aspirated sounds are synthesized by introducing the The first step in stimulus description is to divide the
noise source CO into the vowel branch through the level speech sound into timed segments. For example, in the
control amplifier AH. The presence of voicing without coding of a simplified syllable /ba/, one would have two
aspiration vs. the presence of aspiration without voicing segments, first a transition period and then a steady state
during the transition period are major cues to dis- period. This is illustrated in the schematic spectrogram
tinguishing /b, d , g/ from tv, t, k/. in Figure 5. After dividing the sample into segments, one
The nasal sounds /m, n, 'l / are similar to the voiced must specify the desired values of the control parameters
stop consonants except that the additional nasal formant at the segment boundaries. These values may be ob-
N1 is used. The nasal formant is excited by the voice tained from a frequency table. For example, if FO is
source FO, and the intensity level is controlled by AN. desired to be 126 Hz, the proper value is 40 8 , Only
The glides or semivowels are synthesized through the those parameters that are to be changed from one seg-
vowel branch. ment to the next need be specified. In our example, at
To communicate with the synthesizer it was neces- time a one would specify values for AV, FO, F 1, and F2.
sary to construct an appropriate interface. On the PDP- At time b one would only specify values for Fl and F2.
8/L most I/O is accomplished through the accumulator The programmer indicates whether the parameters that
(AC). From the AC, a 12-bit buffered AC (BAC) bus is differ between adjacent boundaries should be inter-
distributed to all peripheral devices. There is also a set of polated or whether they should maintain their present
six lines (BMB) for device selection and several buffered values until the next boundary (steady state). Currently,
I/O pulse (BIOP) lines. Both the PDP-8/L and the input all interpolation is carried out in a linear fashion, but we
stage of the OVE·IIId use TTL logic, so no level conver- are developing an exponential interpolation for more
sion was necessary. A schematic of the interface is realistic synthesis.
192 COHEN AND MASSARO

8I"l1!1l') field (C) indicates whether the segment is steady state


a..,&+(/J
8MeSff')
(S8: C = 1), interpolating forward (IF: C = 3), or inter-
8'"'&,,(,)
&1B7(,)
polating backward (IB: C = 2). If C is 88 or IF, T speci-
l.\MBS{')
fies the time taken to get from the currently specified
parameter values to those next specified. If C is IB, T
BINIT specifies the time taken to get from the parameter values
at the preceding segment boundary to those addressed
BtoP,!..
by the current CB. The reason for these two interpola-
BIOP+ Fr------<JS<T Il£~T tion directions will be explained shortly. The PPL field
contains the memory address of the PL associated with
the CB. If PPL = 0, then no PL is associated with this
CB. The PCB field contains the memory address of the
next CB. If PCB =0, then the present CB is the last. The
MCl!

PL is composed of a variable number of computer words,
0, and each word is divided into three fields. The PN field
SAC"
contains the 4-bit address of the specified parameter in
8o\,,,q OZ the synthesizer (see Figure 3). The PV field contains the
B~"S ~ 6-bit value of that parameter. The E field indicates
whether or not the PL word is the last in the list. If it is
8Itn D+
the last, E is set to I, otherwise it is 0. Figure 7 shows
aAC""
05 how those data structures are connected in memory. An

!lAC'S

A,
&"""+ CONTROL

~
BLOC" I-'T"-+-"""T"""-r--+-l~r-""--"T"""'"T"'"--r~
W"3 : : AZ

BAt~z
A3
Figure 4. Schematic diagram of computer-synthesizer inter-
face.

SEGWENT SEGtoENT
I 2
TRANSITION STEADr STATE
.J-_ _ --~F2

T I , I
PV",_I

.A------~FI
I
PN",_,
I , 0
I I I I I
PH.. I PV",

Figure 6. Structure of data specification elements.

Q b c
TIME-
Figure s. Schematic diagram of simplified /ba/.

The data structure of the sample specifications in the


computer memory consists of two types of elements:
control blocks (CB) and parameter lists (PL). Each of
these is composed of a group of control fields in
adjacent 12-bit memory locations. Figure 6 shows the
structure of the elements. The CB always consists of
three computer words. The first field in the CB is the
time field (T), which indicates how much time there is
to the temporally adjacent boundary (in 5-msec units). Figure 7. Connections of control blocks (CD) and parameter
The next field is the plot (P) field. This is used by a dis- lists (PL) in memory. A field with a dot holds the address indi-
play program that will be described later. The control cated by the pointed arrow.
REAL-TIMESPEECH SYNTHESIS 193

arrow in the illustration indicates that a given field holds


IF[ I I
~
the address of what is pointed to. Each CB need not TI SS Tz.
reference a unique PL; rather, a PL can be referenced by
many CBs. In Figure 7, for example, PL 1 is referenced 0
by both CBI and CB4 .
Let us return for a bit to the syllable /ba/. The data
structure for our example is illustrated in Figure 8. AV* I....j AVY< o 40 L.l Fl* I )o! 15
refers to the 4-bit number which corresponds to the AV FO* 0 40 F2* I III 30
register address within the synthesizer. Note that, al- FI* 0 ~
though /ba/ has three segment boundaries, only two CBs
F2Y< I 20
are necessary to describe it. In general, a sound divided
into m segments requires m CBs. Figure 9 shows a Figure 8. Data structure for /ba/.
schematized diagram of the syllable /bag/ cut into three
segments. The data in Figure 8 specify the sound up to "Kl ," PV =parameter data value in octal 0 ~ PV ~ 77 8,
the c boundary. From c to d it is necessary to interpo- and E =end flag, include only for last parameter in the
late the values of F I and F2. Rather than specifying F I current parameter list. According to this syntax, we
and F2 in a CB representing c and constructing another would code the sound display represented in Figure 8 as
CB to represent the values at d, we can make use of the follows:
IB feature. It is possible to construct a CB for the d
boundary in the IB mode which specifies the time from *12</X1>¢
d back to c. In this case, there is no CB representing the CONTRI, IF I¢
segment boundary c. LIST!
This scheme of data specification precludes the direct CONTR2
specification of transitions across two or more segments. UST! , AV4¢
Consider the case in which it is desirable to have FO fall F¢4¢
linearly from time b to time d in /bag/ (cf. Figure 9). It FI ¢5
would not be sufficient to specify the FO values at times F2 2¢ E
band d, respectively. The programmer must calculate CONTR2, SS 5¢
the appropriate FO value at time c and include this value usn
in the PL of the CB representing c. In general, to have a usn, FI 15
parameter interpolate across a segment boundary, one F2 3¢ E
must calculate the value at the intermediate boundary
and include it in the PL of the intermediate CB. PAUSE
The syntax of the language used to describe the Given that the data structure is held together with
speech sample is as follows: address pointers, this particular ordering of the CBs and
PLs is not mandatory.
CONTROL BLOCK FORMAT: Two additional codes are necessary. Before the first
CBNAME, CC TT P
PNAME
F2
NCNAME

where CBNAME, PNAME, NCNAME are symbolic


names up to six alphanumeric symbols starting with a
letter. They name the CV, the PL, and the next CB.
CC = "SS," "IB," or "IF" for steady state, interpolate
backward or interpolate forward, respectively. TT =
1
number of 5-msec time units in octal 0 ~ TT ~ 77 8. ~
w
P = optional flag for display routine. ::>
C1
w
It:
11.
PARAMETER LIST FORMAT:
PNAME, PN I PV I

FI

PNm PVm E a b c d
TIME-
where PN = parameter address name. e.g., "AV," "FO," Figure 9. Schematic diagram of simplified /bag/.
194 COHEN AND MASSARO

CONTROL BLOCK AND DIRECTION


1-2 03 4 5-!7 '-'-8
8
K2 -------- -- ------- ------1--
7

r-
e
N
I
:.:::
v
5
KI
I

--------- ------ ------


a 4
~ 3 -------- F3
u..
2

I
-------- F2
Figure 10. Schematic diagram of the
o ------- FI syllable Izi/.

}n JFOl I
i

I'~

fu ~ I I N
c:J"j
70 110 130
!-!-~ I
210 2&0 310 320
TIME CMSEC>

CB or PL, one must include a memory origin statement zer is that of repeatability. Each time we synthesize a
of the form: *1XXXX, where 0 ~ XXXX ~ 7377 8. sound, we may not get exactly the same sound. This
This will cause the first CV or PL to be assembled at occurs because both the noise and voicing sources are
location 1XXXX. After the last CB or PL, one must in- free-running; if we start our speech sample at a different
clude an end flag, simply: PAUSE. Comments may be .J-ALI
.(WT- ~ I ~"f'
added in the coding by preceding them with a /.

SYNTHESUERPERFORMANCE -oFT-l

, SAMPLE ('ATP, fiR SPE.ECH SYNl1USIZUt - ai,

In this section, some comments on the performance e2e. 6182 CONI.


.11121.
IJ f P IINTUtPILATf ".D. Sl1 PLOt I1.A5. 1,II$lC
of the synthesizer will be presented along with a more .2.1
8282
'2'3
8213
fAil I
CON2
INArlll Of fARAIU11Ji LiST
,1NAME 0' ND. T COlitflOL BLOCK
complex example of stimulus coding and synthesis. For 8213
le8.41
e"~1
6.11
FARI. F' 51
Kl 11
1194 HZ
1"93~ HZ
11611 HZ
an example, we will consider the synthesis of the syllable eel!»
1.'6
"66
3.55
K~
13 ~5
66
1293" HZ
/zi/. A schematic diagram of the syllable is shown in .'1
121'
3.55
2 ....
H 55
fl llil
/1849 HZ
1252 xz
1211 6811 AC 10 14 DE
Figure 10; the coding is shown in Figure 11, and a spec- "12 7511 AK 10 l 14 VB. ~v (If LiSt

trogram of the resulting sound is shown in Figure 12. ft13 211111 CON2. 55 14 IstEADY S1All~ 611 "SEC
ftU 8216 PAR2
In coding speech sounds, we have found that the most eelS .211 CON3
MI6 61411 FAJi2. AC 4f1 [ 116 OB••• 1RANSltlON FRO'" "DB.EMIl Of 1..15'
natural sounding results are obtained by gradually bring-
1211 2818 Cl1f3. 55 Ie ISTEADY STAT[. . . . M$EC
ing up the amplitude at the beginning and gradually 122. 1222 PAIlJ
lIel .223 CON4
bringing it down at the end. In our example, the first 10 .22 15.. fAR3. AV .. e E 116 DB••• \lOVo. COMES ON

msec, controlled by CONI, and the last 10 msec, con- 1223 .... 4 CON..... IB 1& nNTERPOLATE BACKVIIUiD5. 21 "S£C
.... 1226 PAR ...
trolled by CON8, accomplish this purpose. Note that the 1225 .22'7 CON5
AC I" E ITUM Off fRI CI Tl V[
.26 61" PAR4,.
frequencies for the vowel in /zi/ are specified in the first 8.
II.' 4821 c•• 5. Ie 2' I1Nl£RPDLATE BIIlCKIARD$. "SEC
parameter list, PARI, although they are not actually 123' .232 PARS

used until, at 70 msec, PAR3 sets AV to 16 dB. This is


11231
1232
'233
3166 PAR5.
CON6
Ji2 66 E /239' HZ •• '2 lllANSI TJ O. 'R(III .1.., HZ

done because setting AV to a high value directly from ee33


123.
4812
.236
CON6. IS 12
PA1l6
IIIITERPOLA1E BACK_RDS. 51 "SiC

- 0 0 at the same time as setting the formant frequencies 123~ .231 CDln
.36 21 ~5 PAR6. re 55 [ 1183 HZ ... F' TltMSI1IOil 19 .. HZ,JWfI
sometimes results in a distorted signal. In general, a cer- 923' 2812 CON'. 55 12 ISTEADY SlAl1. 51 III$EC
... eeee e ,1NO PARAM[1ER LI st I t I
tain amount of care must be taken whenever specifying 82 .. 1 82"2 CONS
rapid parameter transitions. Especially susceptible are 12"2 ...,.2 CONe. 18 2 IllITlRPOLATl BACK_fillS. II M1SC

the bandwidth controls. Bandwidth transitions which ."3 le"5 PARS


12.... 1.81 • I~D Of ~T"
ft .. s '~/2I AU Ie [ 11U_ OJJ UOlllll. 6U,DUALLY
change several steps at a time will usually cause sharp PAf(8.

transients (i.e., clicks) to occur in the output. fAUSt

Another problem that we have had with the synthesi- Figure 11. Data coding for the syllable Izi/.
REAL-TIME SPEECH SYNTHESIS 195

tain tables within the program are cleared. If the tables


are not cleared, one can use a SS or IB CB to continue or
interpolate from the values set at the end of the last call
to the subroutine. ARG2 is the memory address of the
first CB of the sound to be synthesized.
The display routine plots axes and the center
frequency of each formant on a Tektronix RM503 oscil-
loscope. Solid lines are plotted when a formant has an
amplitude other than -<Xl, unless it has a bandwidth
larger than the minimum, in which case a dashed line is
plotted. The horizontal axis of the display represents
time in lOO-msec increments, covering from 0 to 1,000
msec. The vertical axis represents frequency from a to
I a KHz. Setting the P flag in a CB will reset the display

"--.'~~-~j[
o 100 .%00 300

TIME (MSEC)
Figure 12. Sound spectrogram of hs],

time, the source may be intercepted at a different POInt.


With the noise source, this is really not a problem. With
the voice source, however, the difference is detectable.
As a solution to this problem, we have installed circuitry
to allow the computer to monitor the state of the FO
pulse. In the Circuit, the Fa pulse IS fed through a vol- Figure 13. Computer-generated display of Izi/.
tage follower to a Schmitt tugger which sets a flip-flop
when the FO pulse reaches a certain voltage In a posi-
tive direction. When a skip-on-Fa instruction IS executed
by the computer, the output of this flip-flop IS gated to
the PDP-8L skip bus. If the Fa flip-flop has been set. the
skip will occur. If the skip fails. the program Jumps back
to test again, until It succeeds. By delaying inination of
synthesis until the rising edge of the Fa pulse IS en-
countered, repeatability of the stimuli IS insured.
USING THE PROGRAM

In order to use the synthesizer control subroutine


from a main program, one uses the following code:

JMS I SPEAK
ARGI
ARG2
ARG1 may include one or both of the two commands,
plot enable (PE) and clear (CL). When PE is specified,
the spectrogram will be plotted. If CL is specified, cer- Figure 14. Computer-generated display of /sldavski/.
196 COHEN AND MASSARO

to the 0 point of the time scale. The display routine will FLANAGAN, 1. L. The synthesis of speech. Scientific American,
display the stimuli while the synthesizer produces it. 1972, 226. 48-58. .
HOLMES. J. N. Speech synthesis. London: Mills & Boon.
This feature is quite useful for debugging stimuli during
1972.
preparation. Figure 13 shows the scope display of the HOLMES. J. N,; MATTINGLY. I. G., & SHEARME, J. N. Speech
syllable /zi/ as programmed in Figures 10 and II. Figure synthesis by rule. Language and Speech, 1964. 7, 127-143.
14 shows the display of a slightly more complex LILJENCRANTS. J. C. W. A. The OVE-lII speech synthesizer. IEEE
example, /sIdauski/. Transactions on Audio Electroacoustics. 1968. AU-16. 137-140.
MATTINGLY. I. G. Synthesis by rule of general American
English. Status reports on speech research (Suppl.).
REFERENCE NOTE New York: Haskins Laboratories, 1968.
RAHIMI. M. A.. & EULENBERG, J. B. A computer terminal with
1. Coker. C. H., Denes, P. B., & Pinson, E. N. Speech
synthetic speech output. Behavior Research Methods &
synthesis: An experiment in electronic speech production. Bell
Instrumentation, 1974. 6, 255-258.
Telephone Laboratories, 1963.
TOMLINSON, I. G. SPASS-An improved terminal analog speech
synthesizer. Quarterly progress report. MIT Research Lab of
REFERENCES Electronics, Cambridge, Mass., Vol. SO, 1966.
DUDLEY. Hoo & TARNOCZY, T. H. The speaking machine
of Wolfgang von Kempelen. Journal of the Acoustical Society of
America, 1950, 22. 151-166.
FANT. Goo & MARTONY, J. Instrumentation for parametric
synthesis (OVE-II). Quarterly progress report. Speech
Transmission Laboratory, Stockholm. July 1968. Pp. 18·24. NOTE
FLANAGAN. J. L. Note on the design of "terminal analog" speech
synthesizers. Journal of the Acoustical Society ofA merica, 1957, I. The OVE-Illd speech synthesizer is manufactured by
29. 300-310. A. B. Fonema, Box 1010, S-640 25. Julita, Sweden.

You might also like