Continual Learning in Machine Speech Chain Using Gradient Episodic Memory.18320v1
Continual Learning in Machine Speech Chain Using Gradient Episodic Memory.18320v1
3
Nara Institute of Science and Technology, Japan
4
Japan Advanced Institute of Science and Technology, Japan
[email protected],
{kurniawati.azizah,diptatanaya}@cs.ui.ac.id,
{ayu,dessipuji}@itb.ac.id, [email protected]
1
min ∥g − g̃∥22
g̃ 2 (1)
s.t.⟨g̃, gk ⟩ ≥ 0, ∀k ∈ (0, ..., i − 1),
before it engages in learning with subsequent task. Specifi- significant error rate reductions, comparable to the ASRUpper
cally, we varied the proportion of the LJ Original training data model, with a 40% error rate reduction relative to the respec-
while keeping the LJ Noisy training data constant at 100% tive fine-tuning methods. We also demonstrate the results
of the train set. We used 30% of LJ Original train set for the with different split ratio of labeled and unlabeled data of the
lower bound condition, 30% of the train set as labeled data & base task in Table 2, where we can observe that with increas-
70% of the train set as unlabeled data for the machine speech ing labeled data the error rates are becoming smaller. These
chain condition, and 100% of the train set for the upper bound results emphasize that our proposed method is effective, mit-
condition. igating catastrophic forgetting and maintaining consistent
performance across tasks and varying semi-supervised learn-
ing scenarios.
4.2. Experiment Result
4.2.1. Continual Learning Performance 4.2.2. Continual Learning Comparison
The experimental results, as detailed in Table 1, demonstrate We also compared our semi-supervised method to the other
the efficacy of various continual learning approaches applied continual learning methods which are carried out in a fully
to the ASR model in both clean (LJ Original) and noisy (LJ supervised scenario. These other methods were gradient
Noisy) conditions. The ASRLower results show that the GEM episodic memory (GEM) and elastic weight consolidation
approach significantly reduces the character error rate (CER) (EWC) [22]. We can see from Figure 2 that the learn-
compared to fine-tuning and multitask learning. For instance, ing curves exhibit the superiority of GEM, as models that
GEM achieved a CER of 8.5% on LJ Original and 15.8% leveraged GEM as their replay process were able to prevent
on LJ Noisy, outperforming the fine-tuning method which re- catastrophic forgetting. Although EWC had worse forgetting
sulted in CERs of 19.0% and 31.3% respectively. Multitask prevention, it performed better on learning new task because
learning, however, showed the highest CERs of 74.8% and of its fully supervised scenario.
76.7%, indicating its limitation in handling noise without op- We also computed the continual learning metrics, such as
timal balance of data. average (AVG), backward transfer (BWT), and forward trans-
The ASRSpeechChain model trained with GEM outper- fer (FWT) character error rate, as shown in Figure 2, which
formed the fine-tuning method, achieving CERs of 11.1% were useful for comparing the three models to each other. In
and 15.5% for LJ Original and LJ Noisy respectively. This is our experiment, BWT is defined as the ability of a model to
a significant improvement over fine-tuning, which recorded transfer the lowest possible error to the previous task it has
CERs of 12.7% and 33.1%. Furthermore, comparing the encountered, while FWT is defined as the ability of a model
GEM method across different models, ASRUpper using GEM to learn a new task with the lowest possible error compared to
achieved the lowest CERs at 5.2% and 8.4%, compared to the error rate attained by the standard fine-tune method.
fine-tuning and multitask methods. However, it is impor- GEM, when applied in a supervised ASR system, as ex-
tant to highlight that the ASRSpeechChain model, despite not pected, achieved the lowest of all the metrics. EWC had
reaching the lowest error rates, still showed substantial im- a slightly lower AVG at 12.5% than ASRSpeechChain , which
provements. The ASRSpeechChain model with GEM achieved achieved 13.3%. Our model performed well in reducing for-
getting by introducing a lower error to the previous task with a donated voluntarily by the speaker to the public domain. The
BWT at 4.7% than EWC’s at 7.8%. For the FWT metric, our texts that were read by the speaker are also in the public
model and EWC performed relatively similarly at -0.3% and domain. We are aware of the usage of synthetic data that
-0.1% respectively. From these results, we can observe that is generated by text-to-speech (TTS) to assist the continual
our model works as intended to learn sequential tasks, pre- learning of automatic speech recognition (ASR). There is
vent catastrophic forgetting, and exploit accumulated knowl- potential to perpetuate ethical risk, such as bias and attribu-
edge to learn a new task, which are all the properties of a tion issues in the synthetic samples. However, our proposed
functioning continual learning process. method utilizes TTS within a closed-loop framework, allow-
ing us to better control the generation process and mitigate
5. CONCLUSION such issues. Furthermore, we believe this method can allevi-
ate key challenges, such as the reliance on large quantities of
We proposed a novel method to allow automatic speech real human speech data.
recognition (ASR) model to perform continual learning in a
semi-supervised manner of machine speech chain. We then 8. ACKNOWLEDGEMENTS
demonstrated first-hand the implementation of such replay
method with gradient episodic memory (GEM). Although Part of this work is supported by JSPS KAKENHI Grant
our upper bound supervised model achieved lower CER Numbers JP21H05054 and JP23K21681, as well as JST
than our proposed method, the machine speech chain-based Sakura Science Program.
method managed to get the same 40% averaged error rate
reduction. Furthermore, we compared both machine speech 9. REFERENCES
chain that was trained under the proposed continual learning
scenario with the machine speech chain under the fine-tuning [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
scenario. We found that our method worked and achieved Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
minimal forgetting, or prevented catastrophic forgetting. This and Illia Polosukhin, “Attention is all you need,” in Ad-
showed that our novel method has potential for further appli- vances in Neural Information Processing Systems, 2017.
cation of speech recognition and can serve as an alternative
to the fully supervised mechanism of continual learning. We [2] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary,
believe this paper provides the first exploration of contin- Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen,
ual learning in machine speech chain framework and makes and Ravi Teja Gadde, “Jasper: An end-to-end con-
a step towards realizing effective and efficient learning for volutional neural acoustic model,” arXiv preprint
speech recognition. arXiv:1904.03288, 2019.
[3] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi,
6. LIMITATIONS Erik McDermott, Stephen Koo, and Shankar Kumar,
“Transformer transducer: A streamable speech recog-
We acknowledge the need for further experiments to assess nition model with transformer encoders and rnn-t loss,”
the generalizability of our approach. While this work demon- in IEEE International Conference on Acoustics, Speech,
strates success on a simple task boundary of noise variation, and Signal Processing (ICASSP), 2020.
future work will involve applying our method to a wider range
of tasks, such as multilingual speech recognition (where the [4] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
model needs to adapt to different phonetic inventories) or Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
task-agnostic continual learning (where tasks are not prede- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang,
fined). This will allow us to investigate the effectiveness of “Conformer: Convolution-augmented transformer for
our method in handling more complex scenarios and poten- speech recognition,” in Conference of the International
tially lead to a more robust continual learning for ASR in Speech Communication Association (INTERSPEECH),
machine speech chain framework. 2020.
[5] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
7. ETHICS STATEMENT man, Christine McLeavey, and Ilya Sutskever, “Robust
speech recognition via large-scale weak supervision,” in
Our study followed the scientific methodology and ethics. International Conference on Machine Learning (ICML),
The LJ Speech dataset that we used is a public domain dataset 2023.
which is not in violation of license and data ethics. LJ Speech
dataset is an English language speech dataset consisting of [6] Seamless Communication, Loı̈c Barrault, Yu-An
13,100 short audio clips of a single speaker reading passages Chung, Mariano Cora Meglioli, David Dale, Ning
from 7 non-fiction books. The audio part was recorded and Dong, Paul-Ambroise Duquenne, Hady Elsahar,
Hongyu Gong, Kevin Heffernan, John Hoffman, learning,” in Proceedings of the 3rd Annual Meeting
Christopher Klaiber, Pengwei Li, Daniel Licht, Jean of the Special Interest Group on Under-resourced Lan-
Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, guages @ LREC-COLING 2024, 2024.
Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen
Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia [15] Friedemann Zenke, Ben Poole, and Surya Ganguli,
Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ “Continual learning through synaptic intelligence,” in
Howes, Bernie Huang, Min-Jae Hwang, Hirofumi International Conference on Machine Learning (ICML),
Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, 2017.
Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-
Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, ing the knowledge in a neural network,” arXiv preprint
Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan arXiv:1503.02531, 2015.
Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood,
Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, [17] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon
Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Kim, “Continual learning with deep generative replay,”
Cynthia Gao, Francisco Guzmán, Justine Kao, Ann in Advances in Neural Information Processing Systems,
Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, 2017.
Christophe Ropers, Safiyyah Saleem, Holger Schwenk,
Paden Tomasello, Changhan Wang, Jeff Wang, and [18] Craig Atkinson, Brendan McCane, Lech Szymanski,
Skyler Wang, “Seamlessm4t: Massively multilingual and Anthony Robins, “Pseudo-recursal: Solving the
& multimodal machine translation,” arXiv preprint catastrophic forgetting problem in deep neural net-
arXiv:2308.11596, 2023. works,” arXiv preprint arXiv:1802.03875, 2018.
[7] Michael McCloskey and Neal J Cohen, Catastrophic [19] Keith Ito and Linda Johnson, “The LJ
interference in connectionist networks: The sequential Speech Dataset,” https://keithito.com/
learning problem, Academic Press, 1989. LJ-Speech-Dataset/, 2017.
[8] Heng-Jui Chang, Hung yi Lee, and Lin shan Lee, “To- [20] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-
wards lifelong learning of end-to-end asr,” in Confer- transformer: a no-recurrence sequence-to-sequence
ence of the International Speech Communication Asso- model for speech recognition,” in IEEE International
ciation (INTERSPEECH), 2021. Conference on Acoustics, Speech, and Signal Process-
ing (ICASSP), 2018.
[9] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura,
“Machine speech chain,” IEEE Transactions on Audio, [21] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and
Speech, and Language Processing, 2020. Ming Liu, “Neural speech synthesis with transformer
network,” in Proceedings of the AAAI Conference on
[10] David Lopez-Paz and Marc’Aurelio Ranzato, “Gradient Artificial Intelligence, 2019.
episodic memory for continual learning,” in Advances
in Neural Information Processing Systems, 2017. [22] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
Joel Veness, Guillaume Desjardins, Andrei A Rusu,
[11] Peter B. Denes and Elliot Pinson, The Speech Chain, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka
Worth Publishers, 1993. Grabska-Barwinska, Demis Hassabis, Claudia Clopath,
Dharshan Kumaran, and Raia Hadsell, “Overcoming
[12] Sashi Novitasari, Sakriani Sakti, and Satoshi Naka- catastrophic forgetting in neural networks,” Proceed-
mura, “A machine speech chain approach for dynam- ings of the national academy of sciences, 2017.
ically adaptive lombard tts in static and dynamic noise
environments,” IEEE Transactions on Audio, Speech,
and Language Processing, 2022.