0% found this document useful (0 votes)
21 views

Continual Learning in Machine Speech Chain Using Gradient Episodic Memory.18320v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Continual Learning in Machine Speech Chain Using Gradient Episodic Memory.18320v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CONTINUAL LEARNING IN MACHINE SPEECH CHAIN

USING GRADIENT EPISODIC MEMORY

Geoffrey Tyndall1∗ , Kurniawati Azizah1 , Dipta Tanaya1 ,


Ayu Purwarianti2 , Dessi Puji Lestari2 , Sakriani Sakti3,4
1
University of Indonesia, Indonesia
2
Bandung Institute of Technology, Indonesia
arXiv:2411.18320v1 [cs.CL] 27 Nov 2024

3
Nara Institute of Science and Technology, Japan
4
Japan Advanced Institute of Science and Technology, Japan
[email protected],
{kurniawati.azizah,diptatanaya}@cs.ui.ac.id,
{ayu,dessipuji}@itb.ac.id, [email protected]

ABSTRACT of large-scale speech models [5, 6] that excel in multitask


Continual learning for automatic speech recognition (ASR) performance, these models demand substantial resources in
systems poses a challenge, especially with the need to avoid terms of data and computational power, and they require
catastrophic forgetting while maintaining performance on the availability of all tasks from the beginning, i.e., offline
previously learned tasks. This paper introduces a novel ap- learning.
proach leveraging the machine speech chain framework to An alternative approach to this issue is fine-tuning, which
enable continual learning in ASR using gradient episodic transfers knowledge from one task to another, or multitask
memory (GEM). By incorporating a text-to-speech (TTS) learning, where the model is trained from scratch using
component within the machine speech chain, we support both previous and new task data simultaneously. Unfortu-
the replay mechanism essential for GEM, allowing the ASR nately, the former approach (transfer learning) can degrade
model to learn new tasks sequentially without significant the model’s performance on earlier tasks due to catastrophic
performance degradation on earlier tasks. Our experiments, forgetting [7]. Meanwhile, the latter approach necessitates re-
conducted on the LJ Speech dataset, demonstrate that our taining old data to mix with new task data, potentially leading
method outperforms traditional fine-tuning and multitask to privacy concerns.
learning approaches, achieving a substantial error rate reduc- Continual learning is a paradigm designed to allow mod-
tion while maintaining high performance across varying noise els to learn new tasks sequentially without compromising
conditions. We showed the potential of our semi-supervised their ability to perform previous tasks or violating data pri-
machine speech chain approach for effective and efficient vacy. Its effectiveness in sequentially handling multiple
continual learning in speech recognition. recognition tasks was recently demonstrated in [8].
Index Terms— Machine Speech Chain, Continual Learn- Unlike existing fully supervised methods for conducting
ing, Gradient Episodic Memory continual learning experiments on ASR, this paper proposes
a semi-supervised approach within the machine speech chain
1. INTRODUCTION framework [9]. Our method integrates text-to-speech (TTS)
to support a replay mechanism in continual learning. We
The exceptional performance of deep learning architectures, adopt gradient episodic memory (GEM) [10] as our chosen
as illustrated by the Transformer model [1], has enabled implementation for this replay-based continual learning sce-
state-of-the-art automatic speech recognition (ASR) systems nario.
to reach levels of accuracy comparable to human perfor- We evaluate our proposed method against other prevalent
mance [2, 3, 4]. These advancements have significantly learning paradigms such as fine-tuning and multitask learn-
enhanced speech recognition capabilities. However, a critical ing. Our results indicate that continual learning within the
challenge persists: ASR systems should be capable of rec- machine speech chain framework offers superior performance
ognizing a continuous stream of tasks. Despite the existence compared to these traditional methods and serves as a viable
∗ This work was conducted while the first author was doing internship at alternative to fully supervised continual learning. Although
HA3CI Laboratory, JAIST, Japan under JST Sakura Science Program the upper bound fully supervised continual learning achieves
various works, adaptive lombard TTS [12], data augmenta-
tion [13], and code-switching [14].

2.2. Gradient Episodic Memory

Gradient episodic memory (GEM) is a replay method of con-


tinual learning paradigm [10]. GEM exploits samples from
the past task’s data when encountering the data of a new task
to minimize the L2 distance between the gradients of the new
task’s data and the old ones’ data, i.e.,

1
min ∥g − g̃∥22
g̃ 2 (1)
s.t.⟨g̃, gk ⟩ ≥ 0, ∀k ∈ (0, ..., i − 1),

where g, g̃k ∈ R|θ| and |θ| is the number of model’s param-


eters. ASR model that was equipped with GEM in previous
finding (see [8]) outperformed regularization-based methods
, such as synaptic intelligence [15], or knowledge distillation
[16], in a continual learning scenario with different acous-
tic and topic domain acted as task boundary. In this paper,
we introduce GEM usage in machine speech chain and we
demonstrate first-hand to show its potential.

3. MACHINE SPEECH CHAIN USING GEM

We introduce a three-stage mechanism designed to en-


able ASR models to perform continual learning in a semi-
supervised manner, achieving satisfactory results with mini-
mal forgetting. These three stages, depicted in Figure 1, build
upon the process proposed in [9], for the first and second
stages, with our continual learning method introduced in the
third stage.
Fig. 1. Continual learning in the machine speech chain frame-
work.
1. First stage: Supervised learning on the base task.
Here, ASR and TTS are trained separately in a super-
a lower error rate, our approach manages to achieve a 40% vised manner to ensure strong baseline performance
average error rate reduction relative to fine-tuning. There- for the subsequent training stages.
fore, our contributions include: (1) proposing a machine
speech chain-based method for enabling continual learning 2. Second stage: Semi-supervised learning. At this stage,
in speech recognition; (2) conducting experiments to validate ASR and TTS mutually enhance each other by training
our method using the LJ speech dataset. on unlabeled data from the base task, using unsuper-
vised methods to improve performance.
2. RELATED WORK
3. Third stage: Continual learning. ASR engages in con-
tinual learning for new tasks using replayed inputs from
2.1. Machine Speech Chain
the base task, synthesized by TTS.
Machine speech chain is an architecture that connects sequence-
to-sequence model of automatic speech recognition (ASR) In our approach, the replay process for speech recogni-
and text-to-speech (TTS) in a closed-loop framework. This tion leverages TTS as a synthesis model to generate pseudo-
integration was proposed to be representative of human samples of the base task. These pseudo-samples are stored in
speech chain mechanism [9], which is listening while speak- episodic memory and used by GEM to regulate the gradients
ing [11]. To date, machine speech chain has been used in for both new and previous tasks.
CER (%) CER (%) Split Ratio CER (%) CER (%)
Model LJ Original LJ Noisy Labeled / Unlabeled LJ Original LJ Noisy
ASRLower 30 / 70 11.1 15.5
Pre-trained 9.2 82.6 50 / 50 4.8 11.5
Fine-tuning 19.0 31.3 70 / 30 4.0 10.9
GEM 8.5 15.8
Multitask 74.8 76.7 Table 2. Results for the ASRSpeechChain with updates of dif-
ASRSpeechChain ferent ratio of labeled and unlabeled data during base task
Pre-trained 6.4 95.7 learning in the first stage and second stage of the framework.
Fine-tuning 12.7 33.1
GEM 11.1 15.5
4. EXPERIMENTS
ASRUpper
Pre-trained 1.9 108.4
4.1. Experimental Setup
Fine-tuning 6.7 15.6
GEM 5.2 8.4 We prepared two tasks for the ASR models to recognize.
Multitask 3.8 10.9 The first task, referred to as the base task, utilized the clean
original dataset of LJ Speech [19], consisting of 24 hours of
Table 1. CER results for different methods applied on the audio. To simulate different scenario for the subsequent task,
ASR model. The color-coded rows (■ 1st , ■ 2nd , ■ 3rd ) repre- we created a noisy version of the original speech dataset.
sent each stage of our proposed machine speech chain-based This noisy dataset also comprises 24 hours of audio, but with
method. added white noise at a signal-to-noise ratio (SNR) of 0. Con-
sequently, the base task is denoted as LJ Original, and the
subsequent task is denoted as LJ Noisy. Both datasets were
During the third stage, when the machine speech chain
split into train, dev, and test sets with a ratio of 94%, 3%, and
encounters incoming tasks as:
3%, respectively.
For the ASR architecture, we employed the Speech-
[D1 , ..., Dn ] = [(x1 , y 1 ), ..., (xn , y n )] Transformer [20], while the TTS architecture was based on
= [{(x11 , y11 , ..., (x1|D1 | , y|D
1
1 | )}, ..., the Transformer-based Tacotron 2 [21]. All of the ASR
models did not involve hyperparameter tuning since they al-
{(xn1 , y1n ), ..., (xn|Dn | , y|D
n
n | )}],
ready employed almost identical hyperparameters to those
that had been used in [20]. The architecture of the ASR
where x is the input and y is the label, we forward the speech models employed 12 encoder blocks, 6 decoder blocks, 4
data label to TTS to generate pseudo-samples of the base task, attention heads, and a feed-forward hidden layer size of 2048.
i.e., x̂0 ∼ T T S(y i ). These synthesized samples are stored, We used 80 dimensions for the Mel-spectrogram input. We
along with the data from the incoming task, processed as fol- trained the models using the Adam optimizer with β1 = 0.9,
lows: β2 = 0.98, ϵ = 10−9 and employed cross-entropy loss with
neighborhood smoothing. The episodic memory that we used
M0 ← M0 ∪ (x̂0 , y i ) (2) for continual learning had size of 100 samples per task, or in
i i other word 1% of dataset size.
Mi ← Mi ∪ (x , y ) (3)
For TTS models that are needed in machine speech chain
i i
g ← ∇θ ℓ(ASRθ (x ), y ) (4) condition, we configured them to be consisted of 6 encoder
gk ← ∇θ ℓ(ASRθ , Mk ) for all k < i (5) blocks for the transformer-based encoder, 6 decoder blocks
g̃ ← PROJECT(g, g0 , g1 , ..., gi−1 ), see (1) (6) for the autoregressive decoder, 8 heads, and a feed-forward
hidden layer size of 2048. These values were identical to the
θ ← θ − δg̃, (7)
best configuration that had been used in [21]. The TTS input
was the character sequence, and the output was the 80 dimen-
where M represents the episodic memory and δ denotes the sions of the Mel-spectrogram. We used the Adam optimizer
weight assigned for updating the model parameters during with the same β1 , β2 , ϵ values and employed cross-entropy
continual learning for the i-th task (i > 0). loss.
To our knowledge, our proposed mechanism is the first to Our experiment involved training ASR models under su-
incorporate TTS within the continual learning framework for pervised conditions: lower bound and upper bound, and our
ASR. While prior works in continual learning have utilized proposed method that involved semi-supervised condition:
various generative models [17, 18], none has specifically em- machine speech chain. The upper and lower bound refers
ployed TTS for continual learning in ASR. to the amount of base task data provided to the ASR model
Fig. 2. Learning curves of models in continual learning paradigm and their respective metrics.

before it engages in learning with subsequent task. Specifi- significant error rate reductions, comparable to the ASRUpper
cally, we varied the proportion of the LJ Original training data model, with a 40% error rate reduction relative to the respec-
while keeping the LJ Noisy training data constant at 100% tive fine-tuning methods. We also demonstrate the results
of the train set. We used 30% of LJ Original train set for the with different split ratio of labeled and unlabeled data of the
lower bound condition, 30% of the train set as labeled data & base task in Table 2, where we can observe that with increas-
70% of the train set as unlabeled data for the machine speech ing labeled data the error rates are becoming smaller. These
chain condition, and 100% of the train set for the upper bound results emphasize that our proposed method is effective, mit-
condition. igating catastrophic forgetting and maintaining consistent
performance across tasks and varying semi-supervised learn-
ing scenarios.
4.2. Experiment Result
4.2.1. Continual Learning Performance 4.2.2. Continual Learning Comparison
The experimental results, as detailed in Table 1, demonstrate We also compared our semi-supervised method to the other
the efficacy of various continual learning approaches applied continual learning methods which are carried out in a fully
to the ASR model in both clean (LJ Original) and noisy (LJ supervised scenario. These other methods were gradient
Noisy) conditions. The ASRLower results show that the GEM episodic memory (GEM) and elastic weight consolidation
approach significantly reduces the character error rate (CER) (EWC) [22]. We can see from Figure 2 that the learn-
compared to fine-tuning and multitask learning. For instance, ing curves exhibit the superiority of GEM, as models that
GEM achieved a CER of 8.5% on LJ Original and 15.8% leveraged GEM as their replay process were able to prevent
on LJ Noisy, outperforming the fine-tuning method which re- catastrophic forgetting. Although EWC had worse forgetting
sulted in CERs of 19.0% and 31.3% respectively. Multitask prevention, it performed better on learning new task because
learning, however, showed the highest CERs of 74.8% and of its fully supervised scenario.
76.7%, indicating its limitation in handling noise without op- We also computed the continual learning metrics, such as
timal balance of data. average (AVG), backward transfer (BWT), and forward trans-
The ASRSpeechChain model trained with GEM outper- fer (FWT) character error rate, as shown in Figure 2, which
formed the fine-tuning method, achieving CERs of 11.1% were useful for comparing the three models to each other. In
and 15.5% for LJ Original and LJ Noisy respectively. This is our experiment, BWT is defined as the ability of a model to
a significant improvement over fine-tuning, which recorded transfer the lowest possible error to the previous task it has
CERs of 12.7% and 33.1%. Furthermore, comparing the encountered, while FWT is defined as the ability of a model
GEM method across different models, ASRUpper using GEM to learn a new task with the lowest possible error compared to
achieved the lowest CERs at 5.2% and 8.4%, compared to the error rate attained by the standard fine-tune method.
fine-tuning and multitask methods. However, it is impor- GEM, when applied in a supervised ASR system, as ex-
tant to highlight that the ASRSpeechChain model, despite not pected, achieved the lowest of all the metrics. EWC had
reaching the lowest error rates, still showed substantial im- a slightly lower AVG at 12.5% than ASRSpeechChain , which
provements. The ASRSpeechChain model with GEM achieved achieved 13.3%. Our model performed well in reducing for-
getting by introducing a lower error to the previous task with a donated voluntarily by the speaker to the public domain. The
BWT at 4.7% than EWC’s at 7.8%. For the FWT metric, our texts that were read by the speaker are also in the public
model and EWC performed relatively similarly at -0.3% and domain. We are aware of the usage of synthetic data that
-0.1% respectively. From these results, we can observe that is generated by text-to-speech (TTS) to assist the continual
our model works as intended to learn sequential tasks, pre- learning of automatic speech recognition (ASR). There is
vent catastrophic forgetting, and exploit accumulated knowl- potential to perpetuate ethical risk, such as bias and attribu-
edge to learn a new task, which are all the properties of a tion issues in the synthetic samples. However, our proposed
functioning continual learning process. method utilizes TTS within a closed-loop framework, allow-
ing us to better control the generation process and mitigate
5. CONCLUSION such issues. Furthermore, we believe this method can allevi-
ate key challenges, such as the reliance on large quantities of
We proposed a novel method to allow automatic speech real human speech data.
recognition (ASR) model to perform continual learning in a
semi-supervised manner of machine speech chain. We then 8. ACKNOWLEDGEMENTS
demonstrated first-hand the implementation of such replay
method with gradient episodic memory (GEM). Although Part of this work is supported by JSPS KAKENHI Grant
our upper bound supervised model achieved lower CER Numbers JP21H05054 and JP23K21681, as well as JST
than our proposed method, the machine speech chain-based Sakura Science Program.
method managed to get the same 40% averaged error rate
reduction. Furthermore, we compared both machine speech 9. REFERENCES
chain that was trained under the proposed continual learning
scenario with the machine speech chain under the fine-tuning [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
scenario. We found that our method worked and achieved Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
minimal forgetting, or prevented catastrophic forgetting. This and Illia Polosukhin, “Attention is all you need,” in Ad-
showed that our novel method has potential for further appli- vances in Neural Information Processing Systems, 2017.
cation of speech recognition and can serve as an alternative
to the fully supervised mechanism of continual learning. We [2] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary,
believe this paper provides the first exploration of contin- Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen,
ual learning in machine speech chain framework and makes and Ravi Teja Gadde, “Jasper: An end-to-end con-
a step towards realizing effective and efficient learning for volutional neural acoustic model,” arXiv preprint
speech recognition. arXiv:1904.03288, 2019.
[3] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi,
6. LIMITATIONS Erik McDermott, Stephen Koo, and Shankar Kumar,
“Transformer transducer: A streamable speech recog-
We acknowledge the need for further experiments to assess nition model with transformer encoders and rnn-t loss,”
the generalizability of our approach. While this work demon- in IEEE International Conference on Acoustics, Speech,
strates success on a simple task boundary of noise variation, and Signal Processing (ICASSP), 2020.
future work will involve applying our method to a wider range
of tasks, such as multilingual speech recognition (where the [4] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
model needs to adapt to different phonetic inventories) or Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
task-agnostic continual learning (where tasks are not prede- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang,
fined). This will allow us to investigate the effectiveness of “Conformer: Convolution-augmented transformer for
our method in handling more complex scenarios and poten- speech recognition,” in Conference of the International
tially lead to a more robust continual learning for ASR in Speech Communication Association (INTERSPEECH),
machine speech chain framework. 2020.
[5] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
7. ETHICS STATEMENT man, Christine McLeavey, and Ilya Sutskever, “Robust
speech recognition via large-scale weak supervision,” in
Our study followed the scientific methodology and ethics. International Conference on Machine Learning (ICML),
The LJ Speech dataset that we used is a public domain dataset 2023.
which is not in violation of license and data ethics. LJ Speech
dataset is an English language speech dataset consisting of [6] Seamless Communication, Loı̈c Barrault, Yu-An
13,100 short audio clips of a single speaker reading passages Chung, Mariano Cora Meglioli, David Dale, Ning
from 7 non-fiction books. The audio part was recorded and Dong, Paul-Ambroise Duquenne, Hady Elsahar,
Hongyu Gong, Kevin Heffernan, John Hoffman, learning,” in Proceedings of the 3rd Annual Meeting
Christopher Klaiber, Pengwei Li, Daniel Licht, Jean of the Special Interest Group on Under-resourced Lan-
Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, guages @ LREC-COLING 2024, 2024.
Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen
Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia [15] Friedemann Zenke, Ben Poole, and Surya Ganguli,
Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ “Continual learning through synaptic intelligence,” in
Howes, Bernie Huang, Min-Jae Hwang, Hirofumi International Conference on Machine Learning (ICML),
Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, 2017.
Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-
Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, ing the knowledge in a neural network,” arXiv preprint
Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan arXiv:1503.02531, 2015.
Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood,
Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, [17] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon
Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Kim, “Continual learning with deep generative replay,”
Cynthia Gao, Francisco Guzmán, Justine Kao, Ann in Advances in Neural Information Processing Systems,
Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, 2017.
Christophe Ropers, Safiyyah Saleem, Holger Schwenk,
Paden Tomasello, Changhan Wang, Jeff Wang, and [18] Craig Atkinson, Brendan McCane, Lech Szymanski,
Skyler Wang, “Seamlessm4t: Massively multilingual and Anthony Robins, “Pseudo-recursal: Solving the
& multimodal machine translation,” arXiv preprint catastrophic forgetting problem in deep neural net-
arXiv:2308.11596, 2023. works,” arXiv preprint arXiv:1802.03875, 2018.

[7] Michael McCloskey and Neal J Cohen, Catastrophic [19] Keith Ito and Linda Johnson, “The LJ
interference in connectionist networks: The sequential Speech Dataset,” https://keithito.com/
learning problem, Academic Press, 1989. LJ-Speech-Dataset/, 2017.

[8] Heng-Jui Chang, Hung yi Lee, and Lin shan Lee, “To- [20] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-
wards lifelong learning of end-to-end asr,” in Confer- transformer: a no-recurrence sequence-to-sequence
ence of the International Speech Communication Asso- model for speech recognition,” in IEEE International
ciation (INTERSPEECH), 2021. Conference on Acoustics, Speech, and Signal Process-
ing (ICASSP), 2018.
[9] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura,
“Machine speech chain,” IEEE Transactions on Audio, [21] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and
Speech, and Language Processing, 2020. Ming Liu, “Neural speech synthesis with transformer
network,” in Proceedings of the AAAI Conference on
[10] David Lopez-Paz and Marc’Aurelio Ranzato, “Gradient Artificial Intelligence, 2019.
episodic memory for continual learning,” in Advances
in Neural Information Processing Systems, 2017. [22] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
Joel Veness, Guillaume Desjardins, Andrei A Rusu,
[11] Peter B. Denes and Elliot Pinson, The Speech Chain, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka
Worth Publishers, 1993. Grabska-Barwinska, Demis Hassabis, Claudia Clopath,
Dharshan Kumaran, and Raia Hadsell, “Overcoming
[12] Sashi Novitasari, Sakriani Sakti, and Satoshi Naka- catastrophic forgetting in neural networks,” Proceed-
mura, “A machine speech chain approach for dynam- ings of the national academy of sciences, 2017.
ically adaptive lombard tts in static and dynamic noise
environments,” IEEE Transactions on Audio, Speech,
and Language Processing, 2022.

[13] Heli Qi, Sashi Novitasari, Andros Tjandra, Sakriani


Sakti, and Satoshi Nakamura, “Speechain: A speech
toolkit for large scale machine speech chain,” arXiv
preprint arXiv:2301.02966, 2023.

[14] Rais Vaza Man Tazakka, Dessi Lestari, Ayu Purwarianti,


Dipta Tanaya, Kurniawati Azizah, and Sakriani Sakti,
“Indonesian-English code-switching speech recognition
using the machine speech chain based semi-supervised

You might also like