ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
translation. The model obtains latent disflu- improve the accuracy when the different
ent and latent fluent utterances from the non- sources are related.
parallel fluent and disfluent sentences, respec- • Similarly, when the targets are related,
tively, which are further reconstructed back parameter sharing helps to improve the
into fluent and disfluent sentences. A back- accuracy.
translation-based objective is employed, fol- • Parameter sharing allows the model to
lowed by reconstruction for both domains i.e. get benefit from the learning’s through
disfluent and fluent text. For every mini-batch the back-propagated loss of different
of training, soft translations for a domain are translation directions. Since we are only
first generated (denoted by x̄ and ȳ in Fig- operating on the English language in the
ure 4), and are subsequently translated back source(disfluent) and target(fluent), it is
into their original domains to reconstruct the imperative to utilize the benefit of param-
mini-batch of input sentences. The sum of eter sharing.
token-level cross-entropy losses between the
input and the reconstructed output serves as Disadvantages of Parameter Sharing:
the reconstruction loss. • Sometimes, sharing of encoders and de-
Components in a neural model can be shared coders leads to burdening the parameters
minimally, completely, or in a controlled fash- to learn a large representation with lim-
ion. A complete parameter sharing is done, ited space.
which treats the model as a black box for This bottleneck can be avoided by increasing
both translation directions and offers maxi- the layers in both encoders and decoders. The
mum simplicity. Advantages of Parameter encoders and decoders are shared for both
Sharing: translation directions; disfluent-to-fluent and
• In sequence to sequence tasks, sharing fluent-to-disfluent. In a sequence to sequence
parameters between encoders helps to transduction task, the encoder takes an input
and generates a representation in the latent 3. Domain Embedding in Transformer: Fig-
space, the decoder then takes it and generates ure 5 illustrates the conditioning of the
a sequence in the target domain. Since the transformer-based decoder. Dimensionality
disfluent and fluent domains in a language reduced word embedding is concatenated with
share almost all the vocabulary and are in the the domain embedding DE at every time-
same language; the components can learn the step(t) to form the input for the decoder.
representations from each other’s loss. More-
1024 1024
over, since we are operating in an unsuper- Pred 1 Pred 2
vised setting; the sharing of parameters forces
the encoder to limit the representations of both Dec Dec
domains in a common space; thereby allowing
Input 1 Input 2
the model to mix the knowledge of the two 1024 1024
domains.
BOS' Pred 1'
Fluent Domain
512 512
Pretrained
Latent Classifier
FC Layers
BOS DE Pred 1 DE
Disfluent
Emb
Dec Enc Enc Dec Type
Emb
} 1024 512 1024 512
Conv Layer
Shared transduction
parameters
4. Language Modeling and Supervised Training Junxian He, Xinyi Wang, Graham Neubig, and Taylor
with Pretrained encoder: Reload the encoder Berg-Kirkpatrick. 2020. A probabilistic formulation
of unsupervised text style transfer. In International
from a pretrained model in the same language. Conference on Learning Representations.
5. Language Modeling and Supervised Training Mark Johnson and Eugene Charniak. 2004. A tag-
with Pretrained encoder and decoder: Reload based noisy channel model of speech repairs. pages
33–39.
both the encoder and decoder from a pre-
trained model (where the source language is Yoon Kim. 2014. Convolutional neural networks for
same as the language which is being consid- sentence classification. In Proceedings of the Con-
ered for disfluency correction). ference on Empirical Methods in Natural Language
Processing, pages 1746–1751.
5 Conclusion Sharath Rao, Ian Lane, and Tanja Schultz. 2007. Im-
proving spoken language translation by automatic
This paper presented the problem if disfluency as disfluency removal: Evidence from conversational
an inherent phenomenon in conversational speech speech transcripts. Machine Translation Summit XI.
and its transcriptions. We discussed the definition Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
of disfluency, types of disfluencies and their surface Yan Liu. 2019. Mass: Masked sequence to se-
structures. We discussed two broad approaches to quence pre-training for language generation. In In-
correct disfluencies in ASR transcriptions, i) style- ternational Conference on Machine Learning, pages
5926–5936.
transfer based disfluency correction and, ii) dis-
fluency correction using pretraining and language Feng Wang, Wei Chen, Zhen Yang, Qianqian Dong,
modeling objectives of MASS. We observed that Shuang Xu, and Bo Xu. 2018. Semi-supervised dis-
fluency detection. In Proceedings of the 27th Inter-
very little research has been done in correction of national Conference on Computational Linguistics,
disfluencies in text and speech, but a lot has been pages 3529–3538, Santa Fe, New Mexico, USA. As-
done. However, disfluency correction in languages sociation for Computational Linguistics.
Vicky Zayats, Mari Ostendorf, and Hannaneh Ha-
jishirzi. 2016. Disfluency detection using a bidirec-
tional LSTM. CoRR, abs/1604.03209.