Delve Deep Into End-To-End Automatic Speech Recognition Models
Delve Deep Into End-To-End Automatic Speech Recognition Models
Abstract— Automatic Speech Recognition (ASR) has Moroccan dialect 'Darija' [9]. E2E will significantly reduce
experienced significant advancements in recent years, with end- the time needed to train speech recognition models for
to-end approaches emerging as a promising paradigm shift. different languages and their variants compared to hybrid
Unlike traditional ASR systems that rely on a pipeline of models.
separate components, end-to-end models aim to directly
transcribe speech inputs into text using deep learning
architectures. In this paper, we conduct a comprehensive study
on end-to-end ASR models. We review various architectures
employed in end-to-end ASR, including recurrent neural
networks (RNNs), convolutional neural networks (CNNs), and
transformer-based models. We explore different training
methodologies, loss functions, and optimization algorithms used
in end-to-end ASR. Additionally, we discuss large-scale datasets
commonly used for training and evaluate the performance of
end-to-end models using established evaluation metrics such as
word error rate (WER). Furthermore, we analyze the strengths
and weaknesses of end-to-end ASR models, highlight their
applications in real-world scenarios, and discuss open
challenges and future directions. By providing this
comprehensive study, we aim to facilitate a deeper
understanding of end-to-end ASR models and their potential for
driving advancements in speech recognition technology.
I. INTRODUCTION
Deep Neural Network (DNN) speech recognition models,
more precisely hybrid Automatic Speech Recognition (ASR)
models, have replaced the traditional ASR models, but they
still retain all disjoint components of traditional models (fig.1)
such as the lexicon, the acoustic, and the language models
[1][2]. Speech recognition modeling has made a major leap
from hybrid speech recognition models to End-to-End models
(E2E)[3][4][5][6][7]. Compared to hybrid models, E2E
models are composed of a single block[8] that jointly
optimizes acoustic, lexical, and language models
simultaneously. With E2E architectures, a single model needs
Fig. 1. End-to-End vs Hybrid ASR models
training (fig.1), with the possibility of using only the speech
signals and their target transcripts. Without the need for word Given the fast evolution of E2E speech recognition
alignments or lexicon dictionaries to train the models. These approaches, it is opportune to benchmark the most promising
E2E models are the most groundbreaking as they overturns all and popular E2E models in the ASR field. Some of those
the traditional ASR system modeling components that have widely used E2E approaches there is Recurrent Neural
been used for so many years. Now, with E2E models, we can Network-Transducer (RNN-T) [13][14][15], Connectionist
directly perform the transcription of an input speech into an Temporal Classification (CTC) [10][11][12], attention-based
output text using just a single neural network, which was encoder-decoder (AED) ; mainly RNN-AED and
impossible with hybrid models. With these E2E architectures, Transformer-AED [16][17]. CTC was the earliest E2E
we will be able to train speech recognition models for some approach that could match the input voice signal to the target
languages and dialects that have not been trained due to a lack tags with no need for external pre-alignments. But it assumes
of data resources, as in the case of speech recognition for the
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-3921-5/23/$31.00 ©2023 IEEE 164
2023 International Seminar on Application for Technology of Information and Communication (iSemantic)
that frames are independent. RNN-T extends the modeling sequence model that was initially proposed by Graves in 2012
philosophy of CTC and changes the model architecture and [13]. Since then, no real usage of these RNN-T models was
the objective function to consider the dependence of frames, considered, but currently, big attention is going toward these
also it was able to replace hybrid models, specifically in kinds of E2E models after the achievements of google
streaming contexts [18][19]. AED models were primarily research that have confirmed the low latency and
proposed for machine translation [19] but have also been enhancement of speech recognition using the RNN-T
successfully used in speech recognition [20][21][22][3]. transducer [33].
Recently, Transformer-AED with self-attention has gained
RNN-T model architecture consists of an encoder
prominence and currently is used as the fundamental block of
encoder and decoder models [23]. network, a predictor network, and a joiner network as
illustrated in figure 2(fig.2). RNN-T succeeded to removes the
The rest of this paper is organized as follows: In Section conditional independence assumption problem presented in
2, we give an overview of the most popular end-to-end (E2E) CTC-based models by adding both the predictor and the joiner
speech recognition architectures. In Section 3, we present a networks. The encoder network starts by converting the
benchmarking of the different E2E models in the speech acoustic features in a time step 𝑡 into a high-level
recognition field. Finally, we conclude the paper in Section 4. representation 𝐻 , then the predictor network, called also
the decoder takes the previous outputs as input for predicting
II. SPEECH RECOGNITION E2E MODELS in an autoregressive manner a high-level representation of the
E2E models have achieved great results in the majority next output 𝐻 . The joiner network is a simple neural
benchmarks tests in terms of ASR accuracy and efficiency, we network that takes the encoder output 𝐻 and the predictor
give in this section an overview of the three popular categories network output 𝐻 as input, and then combines them to
of speech recognition E2E models, namely CTC, RNN-T, and produce a joint representation matrix 𝐻 , . This joint
AED which consist of Transformer-AED and RNN-AED. The representation is then used to calculate the softmax output
architecture of these models shares similarities in the encoding (Eq.1).
part while they differ in the decoding part where each model
uses a specific decoding mechanism. 𝐻, = 𝑓 (𝐻 ,𝐻 ) (1)
A. Connectionist temporal classification models
CTC models were the earliest E2E approach that has C. Attention-encoder-decoder models
matched the input voice signal to its target tags without
needing to align the signal to a reference transcription, by
assuming that frames are independent. CTC-based models are
popular among the speech recognition
community[24][25][26][27] due to their ease of training and
efficiency in decoding [28].
CTC was specially designed for temporal classification
sequence labeling tasks without knowledge of any prior
alignment between input and output sequences[29]. CTC-
based models, allow repetition of labels and work by adding a
special blank label to distinguish the less informative frames.
Also, it removes the state alignment step in training by
Fig. 3. AED models Architecture
automatically inferring the frame alignment between speech
and label. Meanwhile, CTC E2E models give competitive The AED models are another type of E2E ASR model [35]
results when used in conjunction with sequence-to-sequence [20] known for their attention structure. This category of E2E
attention-based models [17] [30][27][31]. models shares the same architecture, mainly consists of an
encoder network, an attention module, and a decoder network
as illustrated in figure 3(fig.3). The encoder performs the
conversion of the input features into a hidden feature
sequences. While attention module produces a context vector
through calculating the attention weights between the
previous decoder output and each frame of the encoder output.
The decoding network then uses the preceding output label as
well as the context vector to produce its output in an
autoregressive manner based on the antecedent label outputs
without the conditional independence presumption.
D. RNN-AED
Speech recognition AED-based models are mainly bisecting
into two categories. The first one is the RNN-AED which
Fig. 2. RNN-T architecture uses Long-short-term memory RNN for the encoder and
decoder output. The encoder part of RNN-AED models is
B. Recurrent Neural Network-Transducer models
similar to the encoder part of RNN-T models. However, the
RNN-T models are the most popular E2E speech decoder is enhanced by the attention mechanism. The
recognition models and the predominant in the ASR industry attention mechanism is used to calculate a context vector,
nowadays [18][32][33][34]. RNN-T is a sequence-to-
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
165
2023 International Seminar on Application for Technology of Information and Communication (iSemantic)
which is a representation of the encoder output that is relevant audio used in self-attention. The results obtained while
to the current decoder state. The context vector is then used training the Transformer-Transducer model on the
to generate the next token in the output sequence as illustrated LibriSpeech [40] dataset show that training this model on
in Eq.2. LibriSpeech clean/other test sets with 139M parameters
without a language model achieves a WER of 2.4%/5.6%.
𝐻 = 𝐿𝑆𝑇𝑀(𝐶 , 𝑌 , 𝐻 ) With the usage of an external language model, the
Transformer-Transducer model on the same dataset achieves
a WER of 2%/4.6%.
E. Transformer-AED
The second category of AED models is Transformer- Transformer and Convolution neural networks (CNNs)
AED which is mainly based on the Transformers concept in models have achieved promising results in ASR. By
both encoder and decoder parts. Transformers are known for Benefiting from the fact that transformers are good at
their long terms dependencies and they outperform the RNNs capturing content-based global interactions and from the
at this level. In the Transformer AED models, the encoder effective exploitation of local features by CNN, new E2E
consists of a stack of Transformer blocks, In which each block transformers-CNN-based models have been proposed. In this
has a feedforward layer and a multi-head self-attention layer, context, a convolution-augmented transformer model for
where the connection between different layers and blocks is speech recognition subscribed under the name of Conformer
performed using layer normalization [36] and residual has been introduced.
connections. A third layer in the decoder part, is used to Conformers significantly outperform the previous ASR
perform multi-head attention on the encoder output. models based on Transformers and CNN, trained on the
Transformer models based on the attention mechanism widely used speech recognition dataset Librispeech
have been extensively adopted for sequence modeling due to Conformer without a language model on Librispeech test-
their training efficiency and long-range interaction clean set to achieve 2.1% as a WER which mean an accuracy
capturing[37], however they are not very effective at of accuracy of 98% and a WER of 3% on test-other set. while
extracting local feature patterns. Recent works show that with an external language model Conformer achieved a WER
combining transformers with convolution improves their of 1.9% on the test-clean set and 3.9% on the test-other set.
capabilities compared to using them alone[38]. In [39] a This result was achieved with a large version of this model
combination of transformer and convolutional neural network with about 118M parameters. Other competitive performances
have been proposed to benefit from the best of both of Conformer have been achieved by the medium and the
architectures under the name of Conformer. Transformers small version of this model with about 30M and 10M
with attention learn the global interaction while a parameters respectively. Conformer show 15% improvement
convolutional neural network captures local correlations compared to transform-transducer-based models.
based on the relative offset. Conformer-based models have Inspired by the wav2letter approach, another family of
shown a competitive accuracy compared to existing neural architectures for E2E speech recognition has been
transformer-based attention end-to-end speech recognition presented. Named Jasper, this model consists in replacing the
models. acoustic and pronunciation models with a convolutional
III. BENCHMARKING neural network. Jasper models consist of a block architecture
like this; Jasper BxR with B as the number of blocks, and R as
Advances in deep learning techniques have enhanced the the number of sub-blocks within each block. Where each
performance of ASR systems. End-to-end ASR approaches block consists of one 1D convolutions, batch normalization,
take advantage of these advances by benefiting from ReLU, dropout layers, and residual connections to enable the
independent intermediate models (acoustic, pronunciation, depth architecture. The smaller version of Jasper uses 34
and language models) and the ASR model training process convolutional layers with about 201M parameters, while the
reduced complexity. Several easy to use and easy to update deepest version of Jasper uses 54 convolutional layers with
E2E models have emerged, based on the main categories of about 333M parameters. For more training efficiency, a
E2E ASR architectures; CTC, RNN-T, and AED. These end- smaller memory footprint NovoGrad optimizer has been used
to-end models require the accessibility to a massive training with this model, which represents a new variant of the Adam
dataset, to train extensive complex deep architectures. The optimizer. Evaluated on LibriSpeech Jasper show competitive
findings revealed by many works dedicated to E2E ASR results, With Jasper10x5 architecture, with about
models rely on the scenarios and the availability of datasets. 201parameters on LibriSpeech clean/other test sets, this model
Most of these end-to-end ASR models have reached state-of- achieves a 2.95%/8.79% WER using an external language
the-art accuracy on the LibriSpeech dataset. model with a beam-search decoder, and 3.86%/11.95% WER
The first E2E models based on RNN-T architecture used with a greedy decoder without a language model.
as encoders the Long Short-term Memory models (LSTMs). Large E2E models have achieved very good accuracy but
By the time replacing the LSTM encoders with Transformer at the cost of high computational and memory requirements.
encoders gives a competitive model named Transformer- Some research has focused on building E2E ASR models that
transducer. The experiments done in this work [37] found that can achieve the same accuracy but are faster to train and
with an equal number of parameters, Transformer-Transducer require fewer parameters while providing a higher inference
models trained much faster than the RNN-T models based on rate and easy to deploy on hardware with limited computing
LSTM. Also, the proposed model can be improved by memory.
applying the RNN-T loss function, which is suitable for
synchronized decoding and efficiently marginalizes all In this work [41] an end-to-end neural acoustic model
possible alignments. This Transformer-Transducer is suitable having fewer parameters was proposed, named QuartzNet.
for streaming ASR through limiting the context of label and This model is subscribed under CNN E2E models and
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
166
2023 International Seminar on Application for Technology of Information and Communication (iSemantic)
designed based on jasper E2E model architecture [42], with one or more architectures. which can lead to a more efficient
the replacement of the one-dimensional convolutions with and less expensive model.
one-dimensional time-channel separable convolutions. The
model architecture consists mainly of multiple blocks with
residual connections between them. Each block consists of TABLE I. E2E AUTOMATIC SPEECH RECOGNITION MODELS
one or more modules with 1D time-channel separable
convolutional layers, batch normalization, ReLU layers, and Without LM With LM
uses the CTC loss as the training loss function. This model is Params
Model name
(M)
one of the most accurate speech recognition models on the clean other clean other
LibriSpeech dataset. It achieves a Word Error Rate (WER) of
3.9% and 11.28% on the clean and other LibriSpeech test sets, Conformer(S)[39] 10.30 2.7 6.3 2.1 5
respectively, without using an external language model. With
Conformer(M)[39] 30.70 2.3 5 2 4.3
the usage of an external language model, the WER is further
improved to 2.69% and 7.25%. The small size of this model Conformer(L)[39] 118.80 2.1 4.3 1.9 3.9
offers new scope for speech recognition on embedded and
mobile devices. ContextNet(S)[43] 10 2.9 7 2.3 5.5
Despite the promising results obtained by the CNN E2E ContextNet(M)[43] 30 2.4 5.4 2 4.5
speech recognition models, these models are by no means
equal to the performance of transformer-based or RNN-based ContextNet(L)[43] 112 2.1 4.6 1.9 4.1
models. Recent research [43] has focused on creating a fully Transformer-
convolutional encoder capable of incorporating global Tranducer[37]
139 2.4 5.6 2 4.6
contextual information into the convolution layers by adding
squeeze-and-excite modules, this proposed model has been QuartzNet 15x5[41] 19 3.9 11.28 2.69 7.25
called ContextNet, which is based on the RNN-Transducer
CitriNet-256[44] 9.80 2.52 5.95 3.78 9.6
architecture. Trained on clean/other LibriSpeech test sets,
ContextNet with about 112M parameters achieves a WER of CitriNet-512[44] 36.50 2.19 5.5 3.11 7.82
2.1%/4.6% without a language model and a 1.9%/4.1% when
using an external language model. Also, ContextNet showed CitriNet-768[44] 81 2.04 4.79 2.57 6.35
competitive results with its 10M and 30M parameters
versions. CitriNet-1024[44] 142 2 4.69 2.52 6.22
Sequence-to-sequence models and RNN-T models are Jasper DR 10x5[42] 201 3.86 11.95 2.95 8.79
autoregressive which makes them much slower to train or JasperDR 10x5 (+
evaluate, CTC base models are non-autoregressive which 333 4.32 11.82 2.84 7.84
Time/Freq Masks)[42]
makes them more stable and much easier to train.
Nevertheless, Seq2Seq and RNN-T models outperform CTC
models' accuracy. Due to the CTC conditional independence From the benchmark above we deduce that the main
assumption, it is necessary to use a language model when CTC advantage of a CNN model is its parameter throughput; to
is used. In [44] a new deep convolutional CTC model named improve both the speed and accuracy of the CNN models the
CitriNet, has been introduced to overcome the CTC model's depth-separable convolutions [45] have been used [46],
weaknesses and benefit from the advances in neural network however, the overall WER obtained by the first CNNs models,
architectures. Jasper and QuartzNet [41], is still higher than that of the
RNN/transformer-based models [47][37], which is argued by
CitriNet is a non-autoregressive CTC-based model the small length of CNN models context. RNN/Transformer
proposed to bridge the gap between CTC and the best Seq2Seq models Benefit from the bidirectional nature of RNN models,
and Transducers models, by introducing an encoder that which allows information access in the whole context, and
integrates the squeeze-and-excite mechanism of the from the attention mechanism of transformers models.
ContextNet model and the 1D time-channel separable ContextNet incorporates both the advantages of CNN models
convolutions of QuartzNet model. Citrinet-1024 the large and RNN/ transformer models, the state of art results of this
version of Citrinet with about 142M parameters Trained on model shows a competitive WER compared to the existing
LibriSpeech clean/other test sets achieved a WER of RNN/Transformer models with 112M parameters on
2%/4.69% without a language model and a WER of LibriSpeech clean/other test sets this model achieved a WER
2.52%/6.22% with an external language model. Thus, CitriNet 2.1/4.6 without incorporating a language model and 1.9/4.1
model accuracy on LibriSpeech dataset without any external with the inclusion of a language model which is clearly under
language model is close to the autoregressive models' the WER achieved by the RNN/Transformer models on the
accuracy, which goes against the popular notion that CTC same LibriSpeech test sets.
models need an external language model to output accurate
results. By comparing the different WER displayed on the
TABLE.I we can see that the training method impact the
IV. DISCUSSION performance of the E2E ASR models. Except CitriNet model
In this work, we have collected the most promising E2E the WER achieved when train the E2E ASR model with a
models in the speech recognition fields. Each of these models language model is lower than the one achieved when no
has its advantages and drawbacks. We realize that it is always language model is used. Also, we can see that the models
necessary to think about the combination of the advantages of WER decrease when using more parameters except the case
of JasperDR (10x5).
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
167
2023 International Seminar on Application for Technology of Information and Communication (iSemantic)
0 5 10 15 0 1 2 3 4 5
Fig. 4. End-to-End models training with and without a language model comparison
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
168
2023 International Seminar on Application for Technology of Information and Communication (iSemantic)
[12] A. Hannun et al., “Deep speech: Scaling up end-to-end speech Annual Meeting of the Association for Computational Linguistics (Volume
recognition,” arXiv Prepr. arXiv1412.5567, 2014. 1: Long Papers), 2017, pp. 518–529.
[13] A. Graves, “Sequence Transduction with Recurrent Neural [31] C.-X. Qin, W.-L. Zhang, and D. Qu, “A new joint CTC-attention-
Networks,” 2012, [Online]. Available: http://arxiv.org/abs/1211.3711. based speech recognition model with multi-level multi-head attention,”
[14] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN Transducer EURASIP J. Audio, Speech, Music Process., vol. 2019, no. 1, p. 18, Dec.
Modeling for End-to-End Speech Recognition,” in 2019 IEEE Automatic 2019, doi: 10.1186/s13636-019-0161-0.
Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. [32] S. Punjabi et al., “Joint ASR and Language Identification Using
114–121, doi: 10.1109/ASRU46091.2019.9003906. RNN-T: An Efficient Approach to Dynamic Language Switching,” in
[15] H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, “Exploring Pre- ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech
Training with Alignments for RNN Transducer Based End-to-End Speech and Signal Processing (ICASSP), Jun. 2021, pp. 7218–7222, doi:
Recognition,” in ICASSP 2020 - 2020 IEEE International Conference on 10.1109/ICASSP39728.2021.9413734.
Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7079– [33] J. Li et al., “Developing RNN-T models surpassing high-
7083, doi: 10.1109/ICASSP40776.2020.9054663. performance hybrid models with customization capability,” 2020.
[16] L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural [34] T. Makino et al., “Recurrent Neural Network Transducer for
network encoder-decoder for large vocabulary end-to-end speech Audio-Visual Speech Recognition,” in 2019 IEEE Automatic Speech
recognition,” in 2016 IEEE International Conference on Acoustics, Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 905–912,
and Signal Processing (ICASSP), Mar. 2016, pp. 5060–5064, doi: doi: 10.1109/ASRU46091.2019.9004036.
10.1109/ICASSP.2016.7472641. [35] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y.
[17] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved Training Bengio, “End-to-end attention-based large vocabulary speech recognition,”
of End-to-end Attention Models for Speech Recognition,” in Interspeech in 2016 IEEE International Conference on Acoustics, Speech and Signal
2018, Sep. 2018, pp. 7–11, doi: 10.21437/Interspeech.2018-1616. Processing (ICASSP), Mar. 2016, pp. 4945–4949, doi:
[18] Y. He et al., “Streaming End-to-end Speech Recognition for 10.1109/ICASSP.2016.7472618.
Mobile Devices,” in ICASSP 2019 - 2019 IEEE International Conference on [36] J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin, “Understanding and
Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6381– improving layer normalization,” arXiv Prepr. arXiv1911.07013, 2019.
6385, doi: 10.1109/ICASSP.2019.8682336. [37] Q. Zhang et al., “Transformer Transducer: A Streamable Speech
[19] G. Saon, Z. Tuske, D. Bolanos, and B. Kingsbury, “Advancing Recognition Model with Transformer Encoders and RNN-T Loss,” in
RNN Transducer Technology for Speech Recognition,” in ICASSP 2021 - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
2021 IEEE International Conference on Acoustics, Speech and Signal and Signal Processing (ICASSP), May 2020, pp. 7829–7833, doi:
Processing (ICASSP), Jun. 2021, pp. 5654–5658, doi: 10.1109/ICASSP40776.2020.9053896.
10.1109/ICASSP39728.2021.9414716. [38] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V Le, “Attention
[20] A. Vaswani et al., “Attention is all you need,” in Advances in augmented convolutional networks,” in Proceedings of the IEEE/CVF
Neural Information Processing Systems, 2017, vol. 2017-December. international conference on computer vision, 2019, pp. 3286–3295.
[21] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-Based End- [39] A. Gulati et al., “Conformer: Convolution-augmented
to-End Speech Recognition on Voice Search,” in 2018 IEEE International transformer for speech recognition,” arXiv Prepr. arXiv2005.08100, 2020.
Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. [40] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,
2018, pp. 4764–4768, doi: 10.1109/ICASSP.2018.8462492. “Librispeech: An ASR corpus based on public domain audio books,” in 2015
[22] M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Attention- IEEE International Conference on Acoustics, Speech and Signal Processing
Passing Models for Robust and Data-Efficient End-to-End Speech (ICASSP), Apr. 2015, pp. 5206–5210, doi: 10.1109/ICASSP.2015.7178964.
Translation,” Trans. Assoc. Comput. Linguist., vol. 7, pp. 313–325, Nov. [41] S. Kriman et al., “Quartznet: Deep Automatic Speech
2019, doi: 10.1162/tacl_a_00270. Recognition with 1D Time-Channel Separable Convolutions,” in ICASSP
[23] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A No- 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal
Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 Processing (ICASSP), May 2020, pp. 6124–6128, doi:
IEEE International Conference on Acoustics, Speech and Signal Processing 10.1109/ICASSP40776.2020.9053889.
(ICASSP), Apr. 2018, pp. 5884–5888, doi: 10.1109/ICASSP.2018.8462506. [42] J. Li et al., “Jasper: An end-to-end convolutional neural acoustic
[24] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, model,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH,
“Connectionist temporal classification : labelling unsegmented sequence vol. 2019-Septe, pp. 71–75, 2019, doi: 10.21437/Interspeech.2019-1819.
data with recurrent neural networks,” in Proceedings of the 23rd [43] W. Han et al., “ContextNet: Improving convolutional neural
international conference on Machine learning - ICML ’06, 2006, pp. 369– networks for automatic speech recognition with global context,” Proc. Annu.
376, doi: 10.1145/1143844.1143891. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-Octob, no. 1,
[25] D. Amodei et al., “Deep speech 2: End-to-end speech recognition pp. 3610–3614, 2020, doi: 10.21437/Interspeech.2020-2059.
in english and mandarin,” in International conference on machine learning, [44] S. Majumdar, J. Balam, O. Hrinchuk, V. Lavrukhin, V. Noroozi,
2016, pp. 173–182. and B. Ginsburg, “Citrinet: Closing the Gap between Non-Autoregressive
[26] S. Kim, M. L. Seltzer, J. Li, and R. Zhao, “Improved training for and Autoregressive End-to-End Models for Automatic Speech Recognition,”
online end-to-end speech recognition systems,” arXiv Prepr. pp. 1–5, 2021, [Online]. Available: http://arxiv.org/abs/2104.01721.
arXiv1711.02212, 2017. [45] F. Chollet, “Xception: Deep learning with depthwise separable
[27] A. Das, J. Li, R. Zhao, and Y. Gong, “Advancing Connectionist convolutions,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition,
Temporal Classification with Attention Modeling,” in 2018 IEEE CVPR 2017, vol. 2017-Janua, pp. 1800–1807, 2017, doi:
International Conference on Acoustics, Speech and Signal Processing 10.1109/CVPR.2017.195.
(ICASSP), Apr. 2018, pp. 4769–4773, doi: 10.1109/ICASSP.2018.8461558. [46] A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-
[28] T. Zhao, “A Novel Topology for End-to-end Temporal sequence speech recognition with time-depth separable convolutions,” Proc.
Classification and Segmentation with Recurrent Neural Network,” arXiv Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-Septe,
Prepr. arXiv1912.04784, 2019. pp. 3785–3789, 2019, doi: 10.21437/Interspeech.2019-2460.
[29] A. Graves, “Connectionist Temporal Classification,” 2012, pp. [47] S. Karita et al., “A Comparative Study on Transformer vs RNN
61–93. in Speech Applications,” in 2019 IEEE Automatic Speech Recognition and
[30] T. Hori, S. Watanabe, and J. R. Hershey, “Joint CTC/attention Understanding Workshop (ASRU), Dec. 2019, pp. 449–456, doi:
decoding for end-to-end speech recognition,” in Proceedings of the 55th 10.1109/ASRU46091.2019.9003750.
Authorized licensed use limited to: STMicroelectronics international NV. Downloaded on June 03,2024 at 17:44:16 UTC from IEEE Xplore. Restrictions apply.
169