0% found this document useful (0 votes)
82 views5 pages

Interspeech 2022 GZX

DENT-DDSP is a model that uses differentiable digital signal processing (DDSP) components to simulate noisy speech from clean speech for noise-robust automatic speech recognition (ASR). It contains two parallel signal chains, one for processing the audio and one for processing noise, which are then combined. Novel DDSP components include a waveshaper and computationally efficient dynamic range compressor. Experiments show DENT-DDSP achieves higher simulation quality than other models and enables an ASR system trained on its simulated noisy data to perform comparably to one trained on real noisy data.

Uploaded by

郭子勋
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views5 pages

Interspeech 2022 GZX

DENT-DDSP is a model that uses differentiable digital signal processing (DDSP) components to simulate noisy speech from clean speech for noise-robust automatic speech recognition (ASR). It contains two parallel signal chains, one for processing the audio and one for processing noise, which are then combined. Novel DDSP components include a waveshaper and computationally efficient dynamic range compressor. Experiments show DENT-DDSP achieves higher simulation quality than other models and enables an ASR system trained on its simulated noisy data to perform comparably to one trained on real noisy data.

Uploaded by

郭子勋
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DENT-DDSP: Data-efficient noisy speech generator using differentiable digital

signal processors for explicit distortion modelling and noise-robust speech


recognition
Guo Zixun1 , Chen Chen1 , Chng Eng Siong1
1
Nanyang Technological University, Singapore
[email protected]

Abstract passes them through pre-defined codecs to obtain the simu-


lated noisy data. These two static methods [10, 11], however,
The performances of automatic speech recognition (ASR) are untrainable thus cannot be adapted under different circum-
systems degrade drastically under noisy conditions. Explicit stances. Recently, GAN-based methods [12, 13] treat EDM as
distortion modelling (EDM), as a feature compensation step, is a style transfer problem and have obtained promising results.
able to enhance ASR systems under such conditions by simu- We find SimuGAN [12] the closest match to our work. It op-
lating in-domain noisy speeches from the clean counterparts. erates directly on magnitude spectrograms and uses GAN and
However, existing distortion models are either non-trainable contrastive learning methods to distort the clean spectrograms.
or unexplainable and often lack controllability and general- However, GAN-based methods contain large amounts of pa-
ization ability. In this paper, we propose a fully explainable rameters making the distortion model unexplainable and uncon-
and controllable model: DENT-DDSP to achieve EDM. DENT- trollable.
DDSP utilizes trainable differentiable digital signal processing
(DDSP) components and requires only 10 seconds of training Recently with the advent of differentiable digital signal pro-
data to achieve high fidelity. The experiment shows that the cessing (DDSP) [14], traditional digital signal processors (DSP)
simulated noisy data from DENT-DDSP achieves the best sim- become differentiable and trainable and have seen success in the
ulation quality compared to other static or GAN-based distor- field of speech and voice synthesis [15, 16]. Inspired by the suc-
tion models in terms of multi-scale spectral loss (i.e., MSSL). cess of DDSP in the synthesis domain, we find DDSP a viable
Furthermore, a downstream ASR task is designed to evaluate solution to achieve EDM. In this paper, we focus on simulating
whether the simulated noisy data can be utilized for ASR and VHF/UHF transmitted data (e.g., air traffic control speech). By
achieve similar performances using the real noisy data. The comparing the characteristics of the VHF/UHF transmitted data
experiment shows that the ASR model trained with simulated with their clean counterparts, we observe that the VHF/UHF
data from DENT-DDSP achieves the lowest word to error rate transmitted data are distorted, compressed to a fixed dynamic
(i.e., WER) among all distortion models and has achieved com- range, equalized and also contain colored noise. We hence find
parable performances to the upper-bound performance model DDSP capable of achieving such exact conversion.
trained with in-domain real noisy data in terms of WER. The In this work, we propose a data-efficent noisy speech gener-
code of the model is released on Github1 . ator using novel DDSP components (DENT-DDSP) to simulate
Index Terms: explicit distortion modelling, DDSP, noise- VHF/UHF transmitted data. To our best knowledge, DENT-
robust ASR DDSP is the very first model that utilizes DDSP to achieve ex-
plicit distortion modelling. It only requires 10 seconds of train-
ing data, yet, the trained distortion model achieves high simula-
1. Introduction tion quality. As a result, the VHF/UHF transmitted data can eas-
End-to-end automatic speech recognition (ASR) systems has ily be simulated by DENT-DDSP mimicking real noisy data be-
achieved a remarkable success, yet, such systems are prone to haviour. The simulated data can be further used for other down-
noisy conditions. Several datasets [1, 2, 3, 4] containing noisy stream tasks (e.g., noise-robust ASR, speech enhancement).
speeches and the corresponding transcribed text have been col- To be more specific, the proposed DENT-DDSP consists
lected and utilized to boost ASR performances thus achieving of two signal chains: an audio signal chain and a noise signal
noise-robustness. Yet, such datasets are usually scarce and hard chain. Each signal chain consists of trainable DDSP compo-
to obtain. Explicit distortion modelling (EDM) [5], as an al- nents connected in series. The audio signal chain is able to dis-
ternative, is able to simulate noisy speeches from their clean tort, compress and equalize the clean speeches. The noise signal
counterparts. However, distortion models with high simulation chain aims to filter the input white noise to the desired spectral
quality are hard to be obtained. behaviour. The outputs from the two signal chains are added to
To achieve EDM, traditional methods such as parallel form the final simulated noisy speech. In the audio signal chain,
model combination (PMC) [6] and vector taylor series (VTS)[7, two novel DDSP components are proposed: waveshaper and
8, 9] are widely used to estimate the distortion model before computionally-efficient dynamic range compressor. The experi-
the deep learning era. An intuitive method is proposed in [10] ment has shown that the proposed computationally-efficient dy-
where static noise within the limited real noisy data are ag- namic range compressor drastically improves the computational
gregated and added to the clean speeches. Yet, this method efficiency while maintaining the simulation quality.
is unable to achieve convolutive channel distortion. Another Extensive experiments have shown that DENT-DDSP has
method [11] adds pre-recorded noise to the clean speeches and the highest simulation quality among all distortion models in
terms of multi-scale spectral loss(MSSL) . Moreover, it has
1 https://github.com/guozixunnicolas/DENT-ddsp shown strong generalization ability over unseen training data
and outperforms GAN-based models. To further evaluate the The equations of a computationally-efficient DRC are
simulation quality, we perform a downstream noise-robust ASR shown in Equation 3-7. Equation 3 calculates the amplitude
task using a dual-path ASR system [12] trained with real and reduction gain g(t) based on compression threshold T and
simulated noisy data. The model trained with simulated data compression ratio R. Equation 4-6 shows our efficient imple-
from DENT-DDSP has achieved similar performance to the mentation of gain smoothing using a novel companding op-
upper-bound performance model which is trained using in- eration. The amplitude reduction gain g(t) is firstly down-
domain real noisy data with 2.7% difference in terms of word sampled via a linear interpolation by a downsampling factor
error rate (i.e., WER). It also outperforms the best baseline dis- ds f actor in Equation 4. The downsampled gain function
tortion model by 7.3% in terms of WER. gd (t) is then smoothed by αA and αR which represents at-
tack and release time constant respectively in Equation 5. The
2. Model Architecture smoothed gain function gds (t) is then upsampled with over-
lapping hanning windows by an upsampling factor us f actor
The model architecture of DENT-DDSP is shown in Figure 1. in Equation 6. ds f actor and us f actor are set to be equal
DENT-DDSP contains 2 parallel signal chains. The audio sig- such that ges (t) will have the same shape as g(t). gds (t) is the
nal chain contains a waveshaper, a DRC and an equalizer and main performance bottleneck in gain calculation since the op-
receives clean speech: sin (t) as input. The noise signal chain eration is auto-agressive and recursive with time complexity of
contains only an equalizer and receives white noise: nin (t) as O(L/ds f actor) where L is the total number of audio sam-
input. In each signal chain, the input signal will pass through ples. Hence, the time complexity will increase linearly with
each DDSP component in series. The simulated noisy speech: a decreasing ds f actor. We will later prove the companding
ssimulated (t) will be formed by adding audio signal chain out- operation in Equation 4 and 6 drastically improve the efficiency
put: sout (t) and weighted noise signal chain output:nout (t) while maintaining the simulation quality in Section 3.3. Finally,
following Equation 1. λ is a weighting and non-trainable pa- the output ydB (t) is obtained in Equation 7 with an additional
rameter and is set to 1 during training. However, it can be ad- makeup gain gmakeup . The following parameters are trainable:
justed during the generation phase to simulate noisy data with T , R, αA , αR , gmakeup .
different SNRs in order to achieve data augmentation. A spec-
tral loss function: MSSL (multi-scale spectral loss)[14] will be

0 xdB (t) ≤ T
calculated between the simulated noisy speech: ssimulated (t) g(t) = xdB (t) − T (3)
and the real noisy data to update the model’s parameters via  xdB (t) > T
backpropagation. We calculate MSSL using the following FFT R
sizes: 2048, 1024, 512, 256, 128, 64, to reflect the spectral dis- gd (t) = downsample(g(t), ds f actor) (4)
tance in different spectral resolutions. (
αA gds (t − 1) + (1 − αA )gd (t) gd (t) > gds (t − 1)
gds (t) = (5)
ssimulated (t) = sout (t) + λ · nout (t) (1) αR gds (t − 1) + (1 − αR )gd (t) gd (t) ≤ gds (t − 1)

We then formulate the DDSP components in the signal chains ges (t) = upsample(gds (t), us f actor) (6)
and introduce their functionalities. For simplicity, we represent
each component’s input as x(t) and output as y(t). xdB (t) and ydB (t) = xdB (t) + ges (t) + gmakeup (7)
ydB (t) represents the input and output in the dB scale.
2.3. Equalizer(EQ)
2.1. Waveshaper
By observation, the VHF/UHF transmitted data are band-
The waveshaper aims to introduce a distortion effect to the in- limited and equalized in the frequency domain. To achieve
put signal as shown in Equation 2. The distortion gain gdistort equalization, we adopt the equalizer (EQ) implementation
represents the amount of distortion added to the input signal. from [14] where the EQ is implemented using a linear time
With gdistort close to 0, the transfer function of a waveshaper invariant FIR filter (LTI-FIR). For each EQ, the trainable pa-
is almost linear hence little distortion is applied. However, with rameters are set to be the magnitude of the filter frequency re-
an increasing gdistort , the transfer function becomes non-linear sponse: F Rmag and the number of the frequency bins are set to
and saturates quickly. Such non-linearity introduces the distor- be 1000. Additionally, for the EQ in the noise signal chain, the
tion to the input signal. The distortion gain gdistort is set to be noise amplitude is made trainable in order to adjust the volume
a trainable parameter. of the filtered noise.

2 π 3. Experiments and results


y(t) = arctan(gdistort · · x(t)) , gdistort > 0 (2)
π 2
3.1. Dataset
Robust automatic transcription of speech (RATS) project [1] has
2.2. Dynamic range compressor(DRC)
collected a parallel dataset which contains clean speeches, the
Dynamic range compressors(DRC) [17] are widely used in mu- corresponding VHF/UHF transmitted noisy speeches and the
sic production to limit the dynamic ranges of different audio transcribed texts. To obtain such dataset, pre-recorded conver-
tracks. By observation, VHF/UHF transmitted data are com- sational speeches are broadcasted over 8 radio channels (Chan-
pressed to a fixed dynamic range in the time domain. We hence nel A-H) and the transmitted noisy audio are captured concur-
find DRC a suitable DDSP component to achieve such limiting rently. We select data from Channel A which contains 57.4
behaviour. A detailed survey [17] has introduced different kinds hours of data for training and testing. To address the data
of DRCs and we have chosen a hard-knee DRC as our DDSP scarcity problem mentioned in Section 1, only less than 60 sec-
backbone. We then formulate our novel computationally- onds of parallel audio are selected as training data. During train-
efficient DRC. ing, the data are batched into 1-second chunks.
Figure 1: DENT-DDSP architecture. DRC stands for dynamic range compressor and EQ stands for equalizer. The parameters of the
DDSP components are trainable and controllable.

3.2. Experiment setup During testing, the ASR models are evaluated using real noisy
data from the test set and the WERs are compared. We adopted
3.2.1. Training data selection and model comparison
the conformer-based[18] dual-path ASR system from [12] with
To obtain the best performing distortion model, several models 12 conformer layers in the encoder and 6 transformer layers in
are trained and compared using different amounts of training the decoder. The dual-path ASR system is specifically designed
data from 10 seconds to 60 seconds with various speech to total for noise-robust ASR. Parallel data containing clean and noisy
ratio:s2t. s2t is defined as the duration of active speech over speeches are fed to the clean and noisy ASR paths with shared
total duration for each 1-second chunk. Active speech duration parameters respectively. The KL-divergence between the two
is calculated based on the clean audio energy in a sliding win- paths is optimized during training. We pre-train a language
dow with a threshold of 50dB. By analyzing the dataset, with model[19] with 2 RNN layers using existing transcribed text
s2t < 0.4, the data contains mostly non-speech audio or si- and utilize it during decoding.
lence. Hence, we only select data with 0.4 ≤ s2t < 1.0. To
acquire the desired training data, 1-second of data chunks with 3.3. Results
the desired s2t are randomly chosen and aggregated to the de- 3.3.1. Effect of different amounts of training data with various
sired total duration. To evaluate the simulation quality, testing s2t and model comparison
clean speeches are input to the trained distortion model and the
MSSL[14] is calculated between the real and simulated noisy Table 1 shows the MSSL and between the real and simulated
data. We also compare the proposed DENT-DDSP with the fol- audio using different amounts of training data with various s2t.
lowing distortion models in terms of simulation quality: In general, MSSL decreases with fewer amounts of training data
and increasing s2t. The best DENT-DDSP model is hence ob-
• Clean augment model[10]: simulated noisy speeches are
tained using 10 seconds of training data with 0.8 ≤ s2t < 1.0.
obtained by adding the aggregated stationary noise from
Comparing DENT-DDSP with other distortion models, we ob-
RATs Channel A to the clean speeches.
serve that DENT-DDSP is a trainable model, contains much
• G.726 augment model[11]: The clean speeches are first fewer trainable parameters requires fewer amount of training
passed through the G.726 codec. The aggregated station- data. Moreover, noisy audio simulated by DENT-DDSP has
ary noise from RATs Channel A is added subsequently lower MSSL which indicates DENT-DDSP has the best simula-
to the codec output. tion quality among all baseline models. More listening samples
• Codec2 augment model[11]: We replace the codec are available online2 . To visualize the simulation quality, ex-
choice in G.726 augment model to codec2 with 700 ample spectrograms of the clean audio, audio simulated from
bits/s. various distortion models and the real noisy data are plotted in
• SimuGAN[12]: simulated noisy speeches are obtained Fig 2.
by feeding the clean speeches to a trained SimuGAN.
3.3.2. Effectiveness of the companding operation
3.2.2. Noise-robust automatic speech recognition (ASR) Table 2 shows the effectiveness of the proposed companding
A downstream noise-robust ASR task is performed to validate operation in the computationally-efficient DRC. With decreas-
the effectiveness of DENT-DDSP. The noise-robust ASR mod- ing ds f actors using the best DENT-DDSP obtained, the train-
els are trained using simulated data generated by distortion ing time increases linearly, however, without obvious improve-
models and real noisy data. The 57.4-hour parallel data from ments on MSSL. This justifies the effectiveness of the compand-
Channel A and split them into 3 folds: 44.3-hour data for train-
ing; 4.9-hour data for validation and 8.2-hour data for testing. 2 https://guozixunnicolas.github.io/DENT-DDSP-demo/
Table 1: Simulation quality of DENT-DDSP using different
amounts of training data with various s2t and model compar-
ison

no. of
amount of
model trainable s2t MSSL
training data
param.
[0.8, 1, 0) 60-sec parallel 0.184
[0.6, 0.8) 60-sec parallel 0.187
[0.4, 0.6) 60-sec parallel 0.186
[0.8, 1, 0) 40-sec parallel 0.177
[0.6, 0.8) 40-sec parallel 0.187
[0.4, 0.6) 40-sec parallel 0.188
DENT-DDSP 2k
[0.8, 1, 0) 20-sec parallel 0.170
[0.6, 0.8) 20-sec parallel 0.189
[0.4, 0.6) 20-sec parallel 0.189
[0.8, 1, 0) 10-sec parallel 0.170
[0.6, 0.8) 10-sec parallel 0.181
[0.4, 0.6) 10-sec parallel 0.189
Figure 3: Noisy data simulated by DENT-DDSP and SimuGAN
SimuGAN 14M - 10-min unparallel 0.173
using unseen non-speech data
Clean augment non-trainable - - 0.192
G.726 augment non-trainable - - 0.197 3.3.4. Noise robust automatic speech recognition(ASR)
Codec2 augment non-trainable - - 0.197
Table 3 shows the WER comparison using the same dual-path
ASR model trained with simulated data from different distor-
tion models and real noisy data. Data simulated by DENT aug-
ment model is obtained by setting λ = 0.79, 1.26 (See Equa-
tion 1) in a trained DENT-DDSP model so that the noise gain
of the nout (t) become ±2dB. Additionally, a clean model is
trained only using real clean speeches. The DENT augment
model achieves the lowest WER among all baselines and out-
performs the best baseline model: SimuGAN by 7.3%. More-
over, it achieves almost similar WER to the upper-bound per-
formance model which is trained using in-domain real noisy
data with a difference of 2.7%. This proves that DENT-DDSP
is able to simulate noisy data with similar characteristics to the
real noisy data and enhance ASR-systems under noisy condi-
Figure 2: Magnitude spectrogram comparison between clean tions.
Table 3: WER comparison using same ASR model with simu-
audio, audio simulated from various distortion models and the
lated data from different distortion models and real noisy data
real noisy data. (b) represents the real noisy data from Channel
A of RATs dataset
model noisy speech source WER
Clean - 93.4%
ing operation which is able to increase efficiency while main-
taining the generation quality. Clean clean speech +
73.8%
augment aggregated noise
G.726 g726 codec speech +
3.3.3. Generalization ability augment aggregated noise
75.6%

To reflect the generalization ability of the trainable distor- Codec2 codec2 speech +
83.2%
tion models: DENT-DDSP and SimuGAN, non-speech audio augment aggregated noise
are used to test the distortion models. The non-speech au- SimuGAN SimuGAN simulated 65.9%
dio(available online2 ) include ambient noise, guitar and piano DENT DENT simulated 66.4%
music and synthesized audio. The spectrograms of the testing DENT DENT simulated with
non-speech audio and the simulated noisy audio from the two 58.6%
augment noise gain = ±2dB
models are shown in Figure 3. By observation, DENT-DDSP
Upper bound real noisy data 55.9%
simulated data have consistent noise spectrums and the audio
are distorted to the desired spectral behaviour: the spectrum be-
comes blurred with certain frequencies being dampened. Yet,
SimuGAN simulated audio fail to produce such consistent dis-
4. Conclusion
tortion behaviour. Moreover, SimuGAN simulated results ei- We propose a fully explainable and controllable model DENT-
ther contain artefacts or become unintelligible. Hence, DENT- DDSP based on novel DDSP components which only utilizes
DDSP shows stronger generalization ability compared to Simu- 10-second training data to achieve explicit distortion modelling.
GAN due to the use of explainable DDSP components. Besides the existing DDSP components, we propose 2 novel
DDSP components: waveshaper and computationally-efficient
Table 2: Training time and MSSL comparison among DENT- DRC which can be easily integrated to other DDSP models.
DDSPs with different ds f actors Experiments have shown that data simulated by DENT-DDSP
achieve the highest simulation quality among all distortion
ds factor training time MSSL models. Moreover, the noise-robust ASR system trained with
16 7min 0.170 DENT-DDSP simulated data achieves similar WER compared
8 13.5min 0.169 to the same model trained with real noisy data with a 2.7% of
4 25.5min 0.170 difference in terms of WER. It has also achieved 7.3% WER
2 60.2min 0.169
improvement over the best baseline distortion model.
5. References [19] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watan-
[1] D. Graff, K. Walker, S. Strassel, X. Ma, K. Jones, and A. Sawyer, abe, T. Yoshimura, and W. Zhang, “A comparative study on trans-
“The RATS collection: Supporting HLT research with degraded former vs rnn in speech applications,” IEEE Automatic Speech
audio data,” in Proc. of the 9th Int. Conf. on Lang. Resour. and Recognition and Understanding Workshop (ASRU), Dec 2019.
Eval. (LREC), 2014, pp. 1970–1977.
[2] S. Badrinath and H. Balakrishnan, “Automatic speech recognition
for air traffic control communications,” Transportation Research
Record, vol. 2676, no. 1, pp. 798–810, 2022.
[3] T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air-
bus air traffic control speech recognition 2018 challenge: towards
atc automatic transcription and call sign detection,” in Proc. Int.
Speech Commun. Assoc.(Interspeech), 2018, pp. 2993–2997.
[4] J. Godfrey, E. Holliman, and J. McDaniel, “Switchboard: tele-
phone speech corpus for research and development,” in Proc.
IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1992, pp. 517–520 vol.1.
[5] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of
noise-robust automatic speech recognition,” IEEE ACM Trans. on
Audio, Speech, and Lang. Process., vol. 22, no. 4, pp. 745–777,
2014.
[6] M. J. F. Gales, “Model-based techniques for noise robust speech
recognition,” in Ph.D. thesis, University of Cambridge, 1995.
[7] Y. Minami and S. Furui, “A maximum likelihood procedure for
a universal adaptation method based on hmm composition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1995, pp. 129–132.
[8] A. Sankar and C.-H. Lee, “Robust speech recognition based on
stochastic matching,” in Proc. IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, 1995, pp. 121–
124.
[9] A. Acero, l. Deng, T. Kristjansson, and J. Zhang, “Hmm adapta-
tion using vector taylor series for noisy speech recognition,” 2000,
pp. 869–872.
[10] D. Ma, G. Li, H. Xu, and E. S. Chng, “Improving code-switching
speech recognition with data augmentation and system combi-
nation,” in in Proc. Asia-Pacific Signal and Inf. Process. Assoc.
Annu. Summit and Conf. (APSIPA ASC), 2019, pp. 1308–1312.
[11] M. Ferràs, S. Madikeri, P. Motlicek, S. Dey, and H. Bourlard, “A
large-scale open-source acoustic simulator for speaker recogni-
tion,” IEEE Signal Processing Letters, vol. 23, no. 4, pp. 527–531,
2016.
[12] C. Chen, N. Hou, Y. Hu, S. Shirol, and E. S. Chng, “Noise-robust
speech recognition with 10 minutes unparalleled in-domain data,”
in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing (ICASSP), 2022.
[13] H. Hu, T. Tan, and Y. Qian, “Generative adversarial networks
based data augmentation for noise robust speech recognition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5044–5048.
[14] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: differ-
entiable digital signal processing,” in in Proc. Int. Conf. on Learn.
Representations(ICLR), 2020.
[15] G. Fabbro, V. Golkov, T. Kemp, and D. Cremers, “Speech
synthesis and control using differentiable dsp,” 2020. [Online].
Available: https://arxiv.org/abs/2010.15084
[16] J. Alonso and C. Erkut, “Latent space explorations of singing
voice synthesis using ddsp,” in Proc. of 18th Sound and Music
Computing Conf., 2021.
[17] D. Giannoulis, M. Massberg, and J. Reiss, “Digital dynamic range
compressor design—a tutorial and analysis,” J. of the Audio Eng.
Soc.(AES), vol. 60, pp. 399–408, 2012.
[18] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
Convolution-augmented transformer for speech recognition,” in
Proc. Int. Speech Commun. Assoc.(Interspeech), 2020, pp. 5036–
5040.

You might also like