Interspeech 2022 GZX
Interspeech 2022 GZX
We then formulate the DDSP components in the signal chains ges (t) = upsample(gds (t), us f actor) (6)
and introduce their functionalities. For simplicity, we represent
each component’s input as x(t) and output as y(t). xdB (t) and ydB (t) = xdB (t) + ges (t) + gmakeup (7)
ydB (t) represents the input and output in the dB scale.
2.3. Equalizer(EQ)
2.1. Waveshaper
By observation, the VHF/UHF transmitted data are band-
The waveshaper aims to introduce a distortion effect to the in- limited and equalized in the frequency domain. To achieve
put signal as shown in Equation 2. The distortion gain gdistort equalization, we adopt the equalizer (EQ) implementation
represents the amount of distortion added to the input signal. from [14] where the EQ is implemented using a linear time
With gdistort close to 0, the transfer function of a waveshaper invariant FIR filter (LTI-FIR). For each EQ, the trainable pa-
is almost linear hence little distortion is applied. However, with rameters are set to be the magnitude of the filter frequency re-
an increasing gdistort , the transfer function becomes non-linear sponse: F Rmag and the number of the frequency bins are set to
and saturates quickly. Such non-linearity introduces the distor- be 1000. Additionally, for the EQ in the noise signal chain, the
tion to the input signal. The distortion gain gdistort is set to be noise amplitude is made trainable in order to adjust the volume
a trainable parameter. of the filtered noise.
3.2. Experiment setup During testing, the ASR models are evaluated using real noisy
data from the test set and the WERs are compared. We adopted
3.2.1. Training data selection and model comparison
the conformer-based[18] dual-path ASR system from [12] with
To obtain the best performing distortion model, several models 12 conformer layers in the encoder and 6 transformer layers in
are trained and compared using different amounts of training the decoder. The dual-path ASR system is specifically designed
data from 10 seconds to 60 seconds with various speech to total for noise-robust ASR. Parallel data containing clean and noisy
ratio:s2t. s2t is defined as the duration of active speech over speeches are fed to the clean and noisy ASR paths with shared
total duration for each 1-second chunk. Active speech duration parameters respectively. The KL-divergence between the two
is calculated based on the clean audio energy in a sliding win- paths is optimized during training. We pre-train a language
dow with a threshold of 50dB. By analyzing the dataset, with model[19] with 2 RNN layers using existing transcribed text
s2t < 0.4, the data contains mostly non-speech audio or si- and utilize it during decoding.
lence. Hence, we only select data with 0.4 ≤ s2t < 1.0. To
acquire the desired training data, 1-second of data chunks with 3.3. Results
the desired s2t are randomly chosen and aggregated to the de- 3.3.1. Effect of different amounts of training data with various
sired total duration. To evaluate the simulation quality, testing s2t and model comparison
clean speeches are input to the trained distortion model and the
MSSL[14] is calculated between the real and simulated noisy Table 1 shows the MSSL and between the real and simulated
data. We also compare the proposed DENT-DDSP with the fol- audio using different amounts of training data with various s2t.
lowing distortion models in terms of simulation quality: In general, MSSL decreases with fewer amounts of training data
and increasing s2t. The best DENT-DDSP model is hence ob-
• Clean augment model[10]: simulated noisy speeches are
tained using 10 seconds of training data with 0.8 ≤ s2t < 1.0.
obtained by adding the aggregated stationary noise from
Comparing DENT-DDSP with other distortion models, we ob-
RATs Channel A to the clean speeches.
serve that DENT-DDSP is a trainable model, contains much
• G.726 augment model[11]: The clean speeches are first fewer trainable parameters requires fewer amount of training
passed through the G.726 codec. The aggregated station- data. Moreover, noisy audio simulated by DENT-DDSP has
ary noise from RATs Channel A is added subsequently lower MSSL which indicates DENT-DDSP has the best simula-
to the codec output. tion quality among all baseline models. More listening samples
• Codec2 augment model[11]: We replace the codec are available online2 . To visualize the simulation quality, ex-
choice in G.726 augment model to codec2 with 700 ample spectrograms of the clean audio, audio simulated from
bits/s. various distortion models and the real noisy data are plotted in
• SimuGAN[12]: simulated noisy speeches are obtained Fig 2.
by feeding the clean speeches to a trained SimuGAN.
3.3.2. Effectiveness of the companding operation
3.2.2. Noise-robust automatic speech recognition (ASR) Table 2 shows the effectiveness of the proposed companding
A downstream noise-robust ASR task is performed to validate operation in the computationally-efficient DRC. With decreas-
the effectiveness of DENT-DDSP. The noise-robust ASR mod- ing ds f actors using the best DENT-DDSP obtained, the train-
els are trained using simulated data generated by distortion ing time increases linearly, however, without obvious improve-
models and real noisy data. The 57.4-hour parallel data from ments on MSSL. This justifies the effectiveness of the compand-
Channel A and split them into 3 folds: 44.3-hour data for train-
ing; 4.9-hour data for validation and 8.2-hour data for testing. 2 https://guozixunnicolas.github.io/DENT-DDSP-demo/
Table 1: Simulation quality of DENT-DDSP using different
amounts of training data with various s2t and model compar-
ison
no. of
amount of
model trainable s2t MSSL
training data
param.
[0.8, 1, 0) 60-sec parallel 0.184
[0.6, 0.8) 60-sec parallel 0.187
[0.4, 0.6) 60-sec parallel 0.186
[0.8, 1, 0) 40-sec parallel 0.177
[0.6, 0.8) 40-sec parallel 0.187
[0.4, 0.6) 40-sec parallel 0.188
DENT-DDSP 2k
[0.8, 1, 0) 20-sec parallel 0.170
[0.6, 0.8) 20-sec parallel 0.189
[0.4, 0.6) 20-sec parallel 0.189
[0.8, 1, 0) 10-sec parallel 0.170
[0.6, 0.8) 10-sec parallel 0.181
[0.4, 0.6) 10-sec parallel 0.189
Figure 3: Noisy data simulated by DENT-DDSP and SimuGAN
SimuGAN 14M - 10-min unparallel 0.173
using unseen non-speech data
Clean augment non-trainable - - 0.192
G.726 augment non-trainable - - 0.197 3.3.4. Noise robust automatic speech recognition(ASR)
Codec2 augment non-trainable - - 0.197
Table 3 shows the WER comparison using the same dual-path
ASR model trained with simulated data from different distor-
tion models and real noisy data. Data simulated by DENT aug-
ment model is obtained by setting λ = 0.79, 1.26 (See Equa-
tion 1) in a trained DENT-DDSP model so that the noise gain
of the nout (t) become ±2dB. Additionally, a clean model is
trained only using real clean speeches. The DENT augment
model achieves the lowest WER among all baselines and out-
performs the best baseline model: SimuGAN by 7.3%. More-
over, it achieves almost similar WER to the upper-bound per-
formance model which is trained using in-domain real noisy
data with a difference of 2.7%. This proves that DENT-DDSP
is able to simulate noisy data with similar characteristics to the
real noisy data and enhance ASR-systems under noisy condi-
Figure 2: Magnitude spectrogram comparison between clean tions.
Table 3: WER comparison using same ASR model with simu-
audio, audio simulated from various distortion models and the
lated data from different distortion models and real noisy data
real noisy data. (b) represents the real noisy data from Channel
A of RATs dataset
model noisy speech source WER
Clean - 93.4%
ing operation which is able to increase efficiency while main-
taining the generation quality. Clean clean speech +
73.8%
augment aggregated noise
G.726 g726 codec speech +
3.3.3. Generalization ability augment aggregated noise
75.6%
To reflect the generalization ability of the trainable distor- Codec2 codec2 speech +
83.2%
tion models: DENT-DDSP and SimuGAN, non-speech audio augment aggregated noise
are used to test the distortion models. The non-speech au- SimuGAN SimuGAN simulated 65.9%
dio(available online2 ) include ambient noise, guitar and piano DENT DENT simulated 66.4%
music and synthesized audio. The spectrograms of the testing DENT DENT simulated with
non-speech audio and the simulated noisy audio from the two 58.6%
augment noise gain = ±2dB
models are shown in Figure 3. By observation, DENT-DDSP
Upper bound real noisy data 55.9%
simulated data have consistent noise spectrums and the audio
are distorted to the desired spectral behaviour: the spectrum be-
comes blurred with certain frequencies being dampened. Yet,
SimuGAN simulated audio fail to produce such consistent dis-
4. Conclusion
tortion behaviour. Moreover, SimuGAN simulated results ei- We propose a fully explainable and controllable model DENT-
ther contain artefacts or become unintelligible. Hence, DENT- DDSP based on novel DDSP components which only utilizes
DDSP shows stronger generalization ability compared to Simu- 10-second training data to achieve explicit distortion modelling.
GAN due to the use of explainable DDSP components. Besides the existing DDSP components, we propose 2 novel
DDSP components: waveshaper and computationally-efficient
Table 2: Training time and MSSL comparison among DENT- DRC which can be easily integrated to other DDSP models.
DDSPs with different ds f actors Experiments have shown that data simulated by DENT-DDSP
achieve the highest simulation quality among all distortion
ds factor training time MSSL models. Moreover, the noise-robust ASR system trained with
16 7min 0.170 DENT-DDSP simulated data achieves similar WER compared
8 13.5min 0.169 to the same model trained with real noisy data with a 2.7% of
4 25.5min 0.170 difference in terms of WER. It has also achieved 7.3% WER
2 60.2min 0.169
improvement over the best baseline distortion model.
5. References [19] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watan-
[1] D. Graff, K. Walker, S. Strassel, X. Ma, K. Jones, and A. Sawyer, abe, T. Yoshimura, and W. Zhang, “A comparative study on trans-
“The RATS collection: Supporting HLT research with degraded former vs rnn in speech applications,” IEEE Automatic Speech
audio data,” in Proc. of the 9th Int. Conf. on Lang. Resour. and Recognition and Understanding Workshop (ASRU), Dec 2019.
Eval. (LREC), 2014, pp. 1970–1977.
[2] S. Badrinath and H. Balakrishnan, “Automatic speech recognition
for air traffic control communications,” Transportation Research
Record, vol. 2676, no. 1, pp. 798–810, 2022.
[3] T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air-
bus air traffic control speech recognition 2018 challenge: towards
atc automatic transcription and call sign detection,” in Proc. Int.
Speech Commun. Assoc.(Interspeech), 2018, pp. 2993–2997.
[4] J. Godfrey, E. Holliman, and J. McDaniel, “Switchboard: tele-
phone speech corpus for research and development,” in Proc.
IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1992, pp. 517–520 vol.1.
[5] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of
noise-robust automatic speech recognition,” IEEE ACM Trans. on
Audio, Speech, and Lang. Process., vol. 22, no. 4, pp. 745–777,
2014.
[6] M. J. F. Gales, “Model-based techniques for noise robust speech
recognition,” in Ph.D. thesis, University of Cambridge, 1995.
[7] Y. Minami and S. Furui, “A maximum likelihood procedure for
a universal adaptation method based on hmm composition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1995, pp. 129–132.
[8] A. Sankar and C.-H. Lee, “Robust speech recognition based on
stochastic matching,” in Proc. IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, 1995, pp. 121–
124.
[9] A. Acero, l. Deng, T. Kristjansson, and J. Zhang, “Hmm adapta-
tion using vector taylor series for noisy speech recognition,” 2000,
pp. 869–872.
[10] D. Ma, G. Li, H. Xu, and E. S. Chng, “Improving code-switching
speech recognition with data augmentation and system combi-
nation,” in in Proc. Asia-Pacific Signal and Inf. Process. Assoc.
Annu. Summit and Conf. (APSIPA ASC), 2019, pp. 1308–1312.
[11] M. Ferràs, S. Madikeri, P. Motlicek, S. Dey, and H. Bourlard, “A
large-scale open-source acoustic simulator for speaker recogni-
tion,” IEEE Signal Processing Letters, vol. 23, no. 4, pp. 527–531,
2016.
[12] C. Chen, N. Hou, Y. Hu, S. Shirol, and E. S. Chng, “Noise-robust
speech recognition with 10 minutes unparalleled in-domain data,”
in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing (ICASSP), 2022.
[13] H. Hu, T. Tan, and Y. Qian, “Generative adversarial networks
based data augmentation for noise robust speech recognition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5044–5048.
[14] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: differ-
entiable digital signal processing,” in in Proc. Int. Conf. on Learn.
Representations(ICLR), 2020.
[15] G. Fabbro, V. Golkov, T. Kemp, and D. Cremers, “Speech
synthesis and control using differentiable dsp,” 2020. [Online].
Available: https://arxiv.org/abs/2010.15084
[16] J. Alonso and C. Erkut, “Latent space explorations of singing
voice synthesis using ddsp,” in Proc. of 18th Sound and Music
Computing Conf., 2021.
[17] D. Giannoulis, M. Massberg, and J. Reiss, “Digital dynamic range
compressor design—a tutorial and analysis,” J. of the Audio Eng.
Soc.(AES), vol. 60, pp. 399–408, 2012.
[18] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
Convolution-augmented transformer for speech recognition,” in
Proc. Int. Speech Commun. Assoc.(Interspeech), 2020, pp. 5036–
5040.