ExploringDis uenciesforSpeechToTextMachineTranslation (2021)

This document discusses disfluencies in speech and their impact on speech-to-text systems and downstream NLP tasks. It defines disfluencies as disruptions to regular speech such as filler words, repetitions, elongations, restarts, and discusses how they can alter syntactic structure. The document outlines various types of disfluencies like filler pauses, word repetitions, elongations, discourse markers, and restarts. It also discusses challenges introduced by disfluencies for tasks like information extraction, summarization, and translation when operating on transcripts from speech recognition systems.

Uploaded by

It's Me

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

ExploringDis uenciesforSpeechToTextMachineTranslation (2021)

Uploaded by

It's Me

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Survey: Exploring Disfluencies for Speech To Text Machine Translation

Nikhil Saini, Preethi Jyothi and Pushpak Bhattacharyya

Department of Computer Science and Engineering
Indian Institute of Technology Bombay
Mumbai, India
{nikhilra, pjyothi, pb}@cse.iitb.ac.in

Abstract chat-bots, search engines, fitness apps, sleep moni-

toring, spam detection in email, and many more.
Spoken language is different from the written In Natural Language Processing (NLP), it be-
language in its style and structure. Disfluen- comes more and more critical to deal with sponta-
cies that appear in transcriptions from speech neous speech, such as dialogs between two people
recognition systems generally hamper the per-
or even multi-party meetings. The goal of this
formance of downstream NLP tasks. Thus, a
disfluency correction system that converts dis-
processing can be translation, text summarization,
fluent to fluent text is of great value. This spoken language translation, real-time audio dub-
survey paper talks about disfluencies present bing or subtitle generation, or simply the archiving
in speech and its transcriptions. Later, we of a dialog or a meeting in a written form.
describe methodologies to correct disfluencies Disfluencies are disruptions to the regular flow
present in the transcriptions of a spoken ut- of speech, typically occurring in conversational
terance via various approaches viz, a) style speech. They include filler pauses such as uh and
transfer for disfluency correction b) transfer
um, word repetitions, irregular elongations, dis-
learning and language model pretraining. We
observe that disfluency inherent speech phe- course markers, conjunctions, and restarts. For
nomenon and its correction is crucial for down- example, the disfluent sentence “well we’re actu-
stream NLP tasks. ally uh we’re getting ready” has its fluent form as,
“we’re getting ready”. Here, the words highlighted
1 Introduction in green, blue and red refer to discourse, filler and
restart disfluencies, respectively.
Natural Language and Speech Processing strives to Disfluencies in the text can alter its syntactic and
build machines that understand, respond and gen- semantic structure, thereby adversely affecting the
erate text and voice data in the same way humans performance of downstream NLP tasks such as in-
do. NLP and speech come under the umbrella of formation extraction, summarization, translation,
Artificial Intelligence, which is a branch of com- and parsing (Charniak and Johnson, 2001; Johnson
puter science. NLP and Speech processing has and Charniak, 2004). These tasks also employ pre-
come a long way from rule-based systems to tradi- trained language models that are typically trained
tional statistical systems to machine learning and to expect fluent text. This motivates the need for
deep learning based systems. The NLP and speech disfluency correction systems that convert disflu-
systems enable machines to understand the whole ent to fluent text. Prior work has predominantly
meaning of text or speech and the intent and senti- focused on the problem of disfluency detection (Za-
ment of the writer or speaker. yats et al., 2016; Wang et al., 2018; Dong et al.,
Natural language and speech processing is the 2019). The effect is profound for pre-trained lan-
driving force behind several computer programs guage models (Devlin et al., 2019; Edunov et al.,
like those which translate text from one language to 2018) that are typically trained to expect fluent
another, respond to spoken commands from users, language. Various systems such as System User In-
correct spelling, grammar and prompts suggestions terfaces and speech-to-speech translations systems
on keyboards, recommends movies and shows on suffer due to disfluencies. Additionally, it is cru-
streaming websites, recommends products in e- cial to model disfluencies for different higher-level
shopping websites, speech to text dictation systems, natural language processing tasks such as informa-
tion extraction, summarization, parsing from tran- 2. Incorrect grammar in the spoken ut-
scribed textual inputs. In the tasks of parsing and terance: Often, speakers do not care
machine translation (Rao et al., 2007), it has been much about exact grammar when
observed that disfluencies adversely affect perfor- communicating via speech. This in-
mance. Most of the existing NLP tools, such as pre- troduces irregularity in the utterance.
trained language models (Devlin et al., 2019) and Incorrect Grammar: “i are getting ready”
translators (Edunov et al., 2018) are developed for
well-formed fluent text without considerations of 3. Incomplete utterances: Automatic speech
disfluency. Therefore, in spite of their very high ac- recognition systems generate transcriptions by
curacy on fluent text, utilizing them for solutions on segmenting input speech into fixed slots (say
disfluent (transcribed from spoken) text is relatively 5 seconds). It leads to creation of utterance
less accurate. For example, to predict sentiment in that can be the beginning, middle or end of an
customer care scenario, we could potentially use utterance. Downstream NLP tasks aren’t com-
pre-trained language models and sentence classi- patible handling incomplete utterances. The
fiers, if we could make the transcribed text nearly related task is known as sentence boundary
fluent. detection in asr transcriptions.
Incomplete utterance:
2 Disfluency “and i told her to create”
2.1 Conversational Speech
4. Other errors introduced via ASR system:
In contrast to texts which are well-formed like ASR systems introduce other errors due
in newspapers, Wikipedia pages, blogs, books, to several factors like speaker variabilities
manuscripts, formal letters/documents, etc., conver- (change in voice due to age, illness, tired-
sational/spontaneous speech has a very high degree ness, etc.), spoken language variabilities
of freedom and includes a very high number of ut- (pronunciation variation due to dialects
terances which are not fluent/clean. The elements and co-articulation), mismatch factors (i.e.,
that make an utterance non-fluent are termed as mismatch in recording conditions between
disfluencies. training and testing data).
Disfluent speech and its disfluent transcriptions pos-
sess problems for various downstream NLP tasks.
Mainly, all downstream NLP tasks deal with text 2.2 Surface Structure of Disfluencies
which is well-formed and formatted. Therefore, it In this section, a pattern is described which
is difficult for such models to incorporate the irreg- demonstrates the structure of disfluencies. These
ularities present in the speech data in the form of patterns are called the surface structure of disflu-
disfluencies. Moreover, since speech is becoming encies as only characteristics of disfluencies are
very important looking at the linguistic geography, considered, observable from the text. A disfluency
it is of utmost importance to remove irregularities can be divided into three parts: The reparan-
present in speech utterances so that a clean utter- dum, then there is an interruption point, after
ance can be utilized by other NLP applications like which comes the Interregnum, followed by repair.
Machine Translation, Speech To Speech Transla-
tion, Summarization, Question Answering, etc. Figure 1 shows a breakdown example. The
The problems pertaining to transcripts of conver- reparandum contains those words, which are
sational speech can be broadly summarized as (but originally not intended to be in the utterance.
not limited to): Thus it consists of one or more words that will
1. Presence of disfluent terms/phrases: be repeated or corrected ultimately (in case of a
Spoken utterances usually contain various repetition/correction) or abandoned completely
disfluent terms in a single utterance, which (in case of a false start). The interruption point
the speaker didn’t intend to speak and must marks the offset of the reparandum. It is not
be processed before using in a downstream connected with any pause or audible phenomenon.
NLP task. The interregnum can consist of an editing term,
Disfluent: a non lexicalized pause like uh or uhm or simply
“well we’re actually uh we’re getting ready” of an empty pause, i.e. a short moment of silence.
In many cases however, the interregnum of a can vary slightly from corpus to corpus. Disflu-
disfluency is empty and the repair follows directly encies can be divided into two sub-groups, viz.,
after the reparandum. In the repair the words from simpler and complex disfluencies. Filled pauses
the reparandum are finally corrected or repeated like oh, uh, um and discourse markers like yeah,
(repetition/correction) or a complete new sentence well, okay, you know are considered as simpler
is started (false start). Note that in the latter case, disfluencies. Sometimes, single word discourse
the extension of the repair can not be determined. markers like the word Yeah in the sentence ”Yeah,
we are leaving now.”, is considered as a filled
The three terms reparandum, interregnum, pause. We differentiate between filler words
and repair can be used to explain repetitions, false and discourse markers, even in single word
starts, and editing terms. The reparandum and occurrences, as this distinction is also present
interregnum can be empty in a disfluent sentence. in the annotated switchboard corpus. Now, we
This situation fits the criteria for three different will look into complex disfluency types, viz.,
disfluency types, viz., discourse markers, filled Repetition or Correction, False Start, Edit, Aside.
pauses and interjections. These three types consists For the distinction of the categories Repetition
only of interregnum. Figure 2 shows breakdown of or Correction and False Start, it is important to
interregnum being empty and Figure 3 shows the consider that the phrase which has been abandoned
breakdown of reparandum, repair being empty. is repeated with only slight or no changes in the
syntactical structure. The change can be in the
Interruption
Point form of Insertion, Deletion, or Substitution of
words. The slight or no change identifies it as a
Let us, okay, let us take a look here. Repetition or Correction disfluency. On the other
hand, if a completely different syntactical structure
Reparandum Interregnum Repair with different semantics is chosen for the repair,
the observed disfluency is a false start.
Figure 1: Surface Structure of Disfluency
The disfluency classification is important
and is used to determine the type of disfluencies
Interruption
Point one wants to correct in the disfluent text. It also
forms the basis for the classifiers one can train
So we will, , we can take a look here. to learn the disfluency type domain embeddings.
Generally, the approaches do not depend on
Reparandum Interregnum Repair the type of disfluencies, but making explicit
use of the annotated corpus and incorporate the
Figure 2: Disfluencies with empty interregnum. knowledge of specific disfluency types into the
models is beneficial. Table 1 describes the different
disfluency types, their definitions and examples.
Interruption
Point
4 Approaches
How about, , well, next week?
In this section, we will discuss two approaches to
Reparandum Interregnum Repair
correct disfluencies in disfluent text. The problem
statement is: “Correct disfluencies present in tran-
Figure 3: Disfluencies with empty reparandum and scribed utterances (e.g.noisy ASR output) of con-
empty repair. versational speech (e.g. Telephonic conversations,
Lectures delivered, etc) by removing the “disflu-
ent” part without changing the intended meaning
3 Types of Disfluencies of the speaker.”
This section will describe the different types of
disfluencies that can be found in the disfluent 4.1 Style Transfer for Disfluency Correction
text. These disfluency types are present in the 1. Architecture
switchboard corpus. The annotation of disfluencies Figure 4 clearly shows the two directions of
Disfluency Type Description Constituents Example
Non lexicalized sounds with no semantic
Filled Pause uh, um, ah, etc We’re uh getting ready.
content.
A restricted group of non lexicalized uh-huh, mhm,
sounds indicating affirmation or negation. mm, uh-uh,
1. I dropped my phone again, ugh.
Interjection An interjection is a part of speech nah, oops, yikes,
2. Oops, I didn’t mean it.
that demonstrates the emotion or feeling woops, phew, alas,
of the author. blah, gee, ugh.
Words that are related to the structure
of the discourse in so far that they help
1. Well, this is good.
beginning or keeping a turn or serve as okay, so, well,
Discourse Marker 2. This is, you know, a pretty
acknowledgment. They do not contribute you know, etc
good report.
to the semantic content. These are also
called linking words.
Exact repetition or correction of words
previously uttered. A correction may
involve substitutions, deletions or 1. This is is a bad bad situation.
Restart or Correction -
insertions of words. However, the correction 2. Are you you happy?
continues with the same idea or train of thought
started previously.
1. We’ll never find a day
An utterance is aborted and restarted
False Start - what about next month ?
with a new idea or train of thought.
2. Yes no I’m not coming.
Phrases of words which occur after
that part of a disfluency which is repeated
or corrected afterwards or even abandoned We need two tickets,
Edit completely. They refer explicitly to the - I’m sorry, three tickets
words which just previously have been for the flight to Boston.
said, indicating that they are not intended
to belong to the utterance.

Table 1: Disfluency Types, Description and Examples.

translation. The model obtains latent disflu- improve the accuracy when the different
ent and latent fluent utterances from the non- sources are related.
parallel fluent and disfluent sentences, respec- • Similarly, when the targets are related,
tively, which are further reconstructed back parameter sharing helps to improve the
into fluent and disfluent sentences. A back- accuracy.
translation-based objective is employed, fol- • Parameter sharing allows the model to
lowed by reconstruction for both domains i.e. get benefit from the learning’s through
disfluent and fluent text. For every mini-batch the back-propagated loss of different
of training, soft translations for a domain are translation directions. Since we are only
first generated (denoted by x̄ and ȳ in Fig- operating on the English language in the
ure 4), and are subsequently translated back source(disfluent) and target(fluent), it is
into their original domains to reconstruct the imperative to utilize the benefit of param-
mini-batch of input sentences. The sum of eter sharing.
token-level cross-entropy losses between the
input and the reconstructed output serves as Disadvantages of Parameter Sharing:
the reconstruction loss. • Sometimes, sharing of encoders and de-
Components in a neural model can be shared coders leads to burdening the parameters
minimally, completely, or in a controlled fash- to learn a large representation with lim-
ion. A complete parameter sharing is done, ited space.
which treats the model as a black box for This bottleneck can be avoided by increasing
both translation directions and offers maxi- the layers in both encoders and decoders. The
mum simplicity. Advantages of Parameter encoders and decoders are shared for both
Sharing: translation directions; disfluent-to-fluent and
• In sequence to sequence tasks, sharing fluent-to-disfluent. In a sequence to sequence
parameters between encoders helps to transduction task, the encoder takes an input
and generates a representation in the latent 3. Domain Embedding in Transformer: Fig-
space, the decoder then takes it and generates ure 5 illustrates the conditioning of the
a sequence in the target domain. Since the transformer-based decoder. Dimensionality
disfluent and fluent domains in a language reduced word embedding is concatenated with
share almost all the vocabulary and are in the the domain embedding DE at every time-
same language; the components can learn the step(t) to form the input for the decoder.
representations from each other’s loss. More-
1024 1024
over, since we are operating in an unsuper- Pred 1 Pred 2
vised setting; the sharing of parameters forces
the encoder to limit the representations of both Dec Dec
domains in a common space; thereby allowing
Input 1 Input 2
the model to mix the knowledge of the two 1024 1024

domains.
BOS' Pred 1'
Fluent Domain
512 512
Pretrained
Latent Classifier
FC Layers

BOS DE Pred 1 DE
Disfluent
Emb
Dec Enc Enc Dec Type
Emb
} 1024 512 1024 512
Conv Layer

Shared transduction
parameters

Enc Dec Dec

Dec Enc
Fluent
Figure 5: Induction of domain embedding:
Emb
Demonstration of domain embeddings into
Latent

transformers’ decoder. Pred(i = 1) and

Shared Transduction Parameters
Input(i = 1) are decoder’s prediction and
Disﬂuent Domain
input to the decoder at ith time-step respec-
Figure 4: Illustration of Style transfer model tively.
modified to use type embedding drawn from
a pretrained CNN classifier.
4.2 Seq2Seq with MASS Pretraining
Objective
Borrowing from prior work on an unsuper-
This is an encoder-decoder model built on Trans-
vised style transfer model (He et al., 2020),
former encoder-decoder cells. MASS: Masked Se-
the decoder is conditioned on a domain em-
quence to Sequence (Song et al., 2019) is a novel
bedding that specifies the direction of transla-
pretraining method for language generation based
tion. There are two types of embeddings: A
tasks. It randomly masks a sequence fragment in
vanilla binary domain embedding that takes a
the encoder, and then predicts it in the decoder.
bit as input to indicate whether the input text
Figure 6 shows the masked language modeling ob-
is fluent or disfluent and a classifier-based
jective for language generation.
domain embedding. The latter is obtained
from a trained standalone CNN-based classi-
fier (Kim, 2014) that predicts the disfluency
type of a disfluent input sentence. (Here, it is
assumed that disfluency type labels are avail-
able for the disfluent sentences in our training Figure 6: A novel pretraining objective for language
data.) The penultimate layer from the classi- generation.
fier acts as our classifier embedding, which
is further used to condition the decoder. It Figure 7 shows a novel pretraining loss for
is hypothesized that additional information large scale supervised neural machine translation.
about disfluency types via the classifier-based Masked Language Modeling(MLM) can be seen
embedding might help guide the process of
disfluency correction better.

2. Choice of Encoder-Decoder Cells:

• Bi-LSTM Figure 7: A novel pretraining loss for supervised learn-

• Transformer ing.
in BERT. BERT is built on Transformer encoder not limited to English, end-to-end disfluency cor-
layers. Standard Language Modeling(SLM) GPT- rection with other downstream NLP tasks like ma-
2. GPT-2 is built on Transformer decoder layers. chine translation, speech-to-text translation, etc, is
Let the number of words masked/hidden be defined an active and promising area of research.
by a parameter k. Masked Language Modeling in
BERT can be viewed as when k=1 and Standard
Language Modeling in GPT-2 can be viewed as References
when k=m (where m is the length of the output Eugene Charniak and Mark Johnson. 2001. Edit detec-
sequence). The model structure of MASS varies tion and parsing for transcribed speech. In Second
Meeting of the North American Chapter of the Asso-
between k=1 and k=m.
ciation for Computational Linguistics.
To pretrain the language model, publicly avail-
able clean text corpus in the desired language is Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
used. These sentences do not contain disfluencies Kristina Toutanova. 2019. Bert: Pre-training of
deep bidirectional transformers for language under-
and work well as a proxy to a large fluent corpus standing. In Proceedings of the 2019 Conference of
in the desired language. Following experimental the North American Chapter of the Association for
setting can be used to train and evaluate the model: Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
1. Language Modeling: Pretraining only on Flu- 4171–4186.
ent sentences. Qianqian Dong, Feng Wang, Zhen Yang, Wei Chen,
Shuang Xu, and Bo Xu. 2019. Adapting transla-
2. Language Modeling and Supervised Training: tion models for transcript disfluency detection. Pro-
Here, both language modeling and supervised ceedings of the AAAI Conference on Artificial Intel-
ligence, 33:6351–6358.
training steps on the respective datasets in
each epoch. Sergey Edunov, Myle Ott, Michael Auli, and David
Grangier. 2018. Understanding back-translation at
3. Supervised Training: on disfluent-to-fluent scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
parallel corpus. pages 489–500.

4. Language Modeling and Supervised Training Junxian He, Xinyi Wang, Graham Neubig, and Taylor
with Pretrained encoder: Reload the encoder Berg-Kirkpatrick. 2020. A probabilistic formulation
of unsupervised text style transfer. In International
from a pretrained model in the same language. Conference on Learning Representations.

5. Language Modeling and Supervised Training Mark Johnson and Eugene Charniak. 2004. A tag-
with Pretrained encoder and decoder: Reload based noisy channel model of speech repairs. pages
33–39.
both the encoder and decoder from a pre-
trained model (where the source language is Yoon Kim. 2014. Convolutional neural networks for
same as the language which is being consid- sentence classification. In Proceedings of the Con-
ered for disfluency correction). ference on Empirical Methods in Natural Language
Processing, pages 1746–1751.
5 Conclusion Sharath Rao, Ian Lane, and Tanja Schultz. 2007. Im-
proving spoken language translation by automatic
This paper presented the problem if disfluency as disfluency removal: Evidence from conversational
an inherent phenomenon in conversational speech speech transcripts. Machine Translation Summit XI.
and its transcriptions. We discussed the definition Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
of disfluency, types of disfluencies and their surface Yan Liu. 2019. Mass: Masked sequence to se-
structures. We discussed two broad approaches to quence pre-training for language generation. In In-
correct disfluencies in ASR transcriptions, i) style- ternational Conference on Machine Learning, pages
5926–5936.
transfer based disfluency correction and, ii) dis-
fluency correction using pretraining and language Feng Wang, Wei Chen, Zhen Yang, Qianqian Dong,
modeling objectives of MASS. We observed that Shuang Xu, and Bo Xu. 2018. Semi-supervised dis-
fluency detection. In Proceedings of the 27th Inter-
very little research has been done in correction of national Conference on Computational Linguistics,
disfluencies in text and speech, but a lot has been pages 3529–3538, Santa Fe, New Mexico, USA. As-
done. However, disfluency correction in languages sociation for Computational Linguistics.
Vicky Zayats, Mari Ostendorf, and Hannaneh Ha-
jishirzi. 2016. Disfluency detection using a bidirec-
tional LSTM. CoRR, abs/1604.03209.