Interspeech 2022 GZX

DENT-DDSP is a model that uses differentiable digital signal processing (DDSP) components to simulate noisy speech from clean speech for noise-robust automatic speech recognition (ASR). It contains two parallel signal chains, one for processing the audio and one for processing noise, which are then combined. Novel DDSP components include a waveshaper and computationally efficient dynamic range compressor. Experiments show DENT-DDSP achieves higher simulation quality than other models and enables an ASR system trained on its simulated noisy data to perform comparably to one trained on real noisy data.

Uploaded by

郭子勋

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views5 pages

Interspeech 2022 GZX

Uploaded by

郭子勋

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

DENT-DDSP: Data-efficient noisy speech generator using differentiable digital

signal processors for explicit distortion modelling and noise-robust speech

recognition
Guo Zixun1 , Chen Chen1 , Chng Eng Siong1
1
Nanyang Technological University, Singapore
[email protected]

Abstract passes them through pre-defined codecs to obtain the simu-

lated noisy data. These two static methods [10, 11], however,
The performances of automatic speech recognition (ASR) are untrainable thus cannot be adapted under different circum-
systems degrade drastically under noisy conditions. Explicit stances. Recently, GAN-based methods [12, 13] treat EDM as
distortion modelling (EDM), as a feature compensation step, is a style transfer problem and have obtained promising results.
able to enhance ASR systems under such conditions by simu- We find SimuGAN [12] the closest match to our work. It op-
lating in-domain noisy speeches from the clean counterparts. erates directly on magnitude spectrograms and uses GAN and
However, existing distortion models are either non-trainable contrastive learning methods to distort the clean spectrograms.
or unexplainable and often lack controllability and general- However, GAN-based methods contain large amounts of pa-
ization ability. In this paper, we propose a fully explainable rameters making the distortion model unexplainable and uncon-
and controllable model: DENT-DDSP to achieve EDM. DENT- trollable.
DDSP utilizes trainable differentiable digital signal processing
(DDSP) components and requires only 10 seconds of training Recently with the advent of differentiable digital signal pro-
data to achieve high fidelity. The experiment shows that the cessing (DDSP) [14], traditional digital signal processors (DSP)
simulated noisy data from DENT-DDSP achieves the best sim- become differentiable and trainable and have seen success in the
ulation quality compared to other static or GAN-based distor- field of speech and voice synthesis [15, 16]. Inspired by the suc-
tion models in terms of multi-scale spectral loss (i.e., MSSL). cess of DDSP in the synthesis domain, we find DDSP a viable
Furthermore, a downstream ASR task is designed to evaluate solution to achieve EDM. In this paper, we focus on simulating
whether the simulated noisy data can be utilized for ASR and VHF/UHF transmitted data (e.g., air traffic control speech). By
achieve similar performances using the real noisy data. The comparing the characteristics of the VHF/UHF transmitted data
experiment shows that the ASR model trained with simulated with their clean counterparts, we observe that the VHF/UHF
data from DENT-DDSP achieves the lowest word to error rate transmitted data are distorted, compressed to a fixed dynamic
(i.e., WER) among all distortion models and has achieved com- range, equalized and also contain colored noise. We hence find
parable performances to the upper-bound performance model DDSP capable of achieving such exact conversion.
trained with in-domain real noisy data in terms of WER. The In this work, we propose a data-efficent noisy speech gener-
code of the model is released on Github1 . ator using novel DDSP components (DENT-DDSP) to simulate
Index Terms: explicit distortion modelling, DDSP, noise- VHF/UHF transmitted data. To our best knowledge, DENT-
robust ASR DDSP is the very first model that utilizes DDSP to achieve ex-
plicit distortion modelling. It only requires 10 seconds of train-
ing data, yet, the trained distortion model achieves high simula-
1. Introduction tion quality. As a result, the VHF/UHF transmitted data can eas-
End-to-end automatic speech recognition (ASR) systems has ily be simulated by DENT-DDSP mimicking real noisy data be-
achieved a remarkable success, yet, such systems are prone to haviour. The simulated data can be further used for other down-
noisy conditions. Several datasets [1, 2, 3, 4] containing noisy stream tasks (e.g., noise-robust ASR, speech enhancement).
speeches and the corresponding transcribed text have been col- To be more specific, the proposed DENT-DDSP consists
lected and utilized to boost ASR performances thus achieving of two signal chains: an audio signal chain and a noise signal
noise-robustness. Yet, such datasets are usually scarce and hard chain. Each signal chain consists of trainable DDSP compo-
to obtain. Explicit distortion modelling (EDM) [5], as an al- nents connected in series. The audio signal chain is able to dis-
ternative, is able to simulate noisy speeches from their clean tort, compress and equalize the clean speeches. The noise signal
counterparts. However, distortion models with high simulation chain aims to filter the input white noise to the desired spectral
quality are hard to be obtained. behaviour. The outputs from the two signal chains are added to
To achieve EDM, traditional methods such as parallel form the final simulated noisy speech. In the audio signal chain,
model combination (PMC) [6] and vector taylor series (VTS)[7, two novel DDSP components are proposed: waveshaper and
8, 9] are widely used to estimate the distortion model before computionally-efficient dynamic range compressor. The experi-
the deep learning era. An intuitive method is proposed in [10] ment has shown that the proposed computationally-efficient dy-
where static noise within the limited real noisy data are ag- namic range compressor drastically improves the computational
gregated and added to the clean speeches. Yet, this method efficiency while maintaining the simulation quality.
is unable to achieve convolutive channel distortion. Another Extensive experiments have shown that DENT-DDSP has
method [11] adds pre-recorded noise to the clean speeches and the highest simulation quality among all distortion models in
terms of multi-scale spectral loss(MSSL) . Moreover, it has
1 https://github.com/guozixunnicolas/DENT-ddsp shown strong generalization ability over unseen training data
and outperforms GAN-based models. To further evaluate the The equations of a computationally-efficient DRC are
simulation quality, we perform a downstream noise-robust ASR shown in Equation 3-7. Equation 3 calculates the amplitude
task using a dual-path ASR system [12] trained with real and reduction gain g(t) based on compression threshold T and
simulated noisy data. The model trained with simulated data compression ratio R. Equation 4-6 shows our efficient imple-
from DENT-DDSP has achieved similar performance to the mentation of gain smoothing using a novel companding op-
upper-bound performance model which is trained using in- eration. The amplitude reduction gain g(t) is firstly down-
domain real noisy data with 2.7% difference in terms of word sampled via a linear interpolation by a downsampling factor
error rate (i.e., WER). It also outperforms the best baseline dis- ds f actor in Equation 4. The downsampled gain function
tortion model by 7.3% in terms of WER. gd (t) is then smoothed by αA and αR which represents at-
tack and release time constant respectively in Equation 5. The
2. Model Architecture smoothed gain function gds (t) is then upsampled with over-
lapping hanning windows by an upsampling factor us f actor
The model architecture of DENT-DDSP is shown in Figure 1. in Equation 6. ds f actor and us f actor are set to be equal
DENT-DDSP contains 2 parallel signal chains. The audio sig- such that ges (t) will have the same shape as g(t). gds (t) is the
nal chain contains a waveshaper, a DRC and an equalizer and main performance bottleneck in gain calculation since the op-
receives clean speech: sin (t) as input. The noise signal chain eration is auto-agressive and recursive with time complexity of
contains only an equalizer and receives white noise: nin (t) as O(L/ds f actor) where L is the total number of audio sam-
input. In each signal chain, the input signal will pass through ples. Hence, the time complexity will increase linearly with
each DDSP component in series. The simulated noisy speech: a decreasing ds f actor. We will later prove the companding
ssimulated (t) will be formed by adding audio signal chain out- operation in Equation 4 and 6 drastically improve the efficiency
put: sout (t) and weighted noise signal chain output:nout (t) while maintaining the simulation quality in Section 3.3. Finally,
following Equation 1. λ is a weighting and non-trainable pa- the output ydB (t) is obtained in Equation 7 with an additional
rameter and is set to 1 during training. However, it can be ad- makeup gain gmakeup . The following parameters are trainable:
justed during the generation phase to simulate noisy data with T , R, αA , αR , gmakeup .
different SNRs in order to achieve data augmentation. A spec-
tral loss function: MSSL (multi-scale spectral loss)[14] will be

0 xdB (t) ≤ T
calculated between the simulated noisy speech: ssimulated (t) g(t) = xdB (t) − T (3)
and the real noisy data to update the model’s parameters via  xdB (t) > T
backpropagation. We calculate MSSL using the following FFT R
sizes: 2048, 1024, 512, 256, 128, 64, to reflect the spectral dis- gd (t) = downsample(g(t), ds f actor) (4)
tance in different spectral resolutions. (
αA gds (t − 1) + (1 − αA )gd (t) gd (t) > gds (t − 1)
gds (t) = (5)
ssimulated (t) = sout (t) + λ · nout (t) (1) αR gds (t − 1) + (1 − αR )gd (t) gd (t) ≤ gds (t − 1)

We then formulate the DDSP components in the signal chains ges (t) = upsample(gds (t), us f actor) (6)
and introduce their functionalities. For simplicity, we represent
each component’s input as x(t) and output as y(t). xdB (t) and ydB (t) = xdB (t) + ges (t) + gmakeup (7)
ydB (t) represents the input and output in the dB scale.
2.3. Equalizer(EQ)
2.1. Waveshaper
By observation, the VHF/UHF transmitted data are band-
The waveshaper aims to introduce a distortion effect to the in- limited and equalized in the frequency domain. To achieve
put signal as shown in Equation 2. The distortion gain gdistort equalization, we adopt the equalizer (EQ) implementation
represents the amount of distortion added to the input signal. from [14] where the EQ is implemented using a linear time
With gdistort close to 0, the transfer function of a waveshaper invariant FIR filter (LTI-FIR). For each EQ, the trainable pa-
is almost linear hence little distortion is applied. However, with rameters are set to be the magnitude of the filter frequency re-
an increasing gdistort , the transfer function becomes non-linear sponse: F Rmag and the number of the frequency bins are set to
and saturates quickly. Such non-linearity introduces the distor- be 1000. Additionally, for the EQ in the noise signal chain, the
tion to the input signal. The distortion gain gdistort is set to be noise amplitude is made trainable in order to adjust the volume
a trainable parameter. of the filtered noise.

2 π 3. Experiments and results

y(t) = arctan(gdistort · · x(t)) , gdistort > 0 (2)
π 2
3.1. Dataset
Robust automatic transcription of speech (RATS) project [1] has
2.2. Dynamic range compressor(DRC)
collected a parallel dataset which contains clean speeches, the
Dynamic range compressors(DRC) [17] are widely used in mu- corresponding VHF/UHF transmitted noisy speeches and the
sic production to limit the dynamic ranges of different audio transcribed texts. To obtain such dataset, pre-recorded conver-
tracks. By observation, VHF/UHF transmitted data are com- sational speeches are broadcasted over 8 radio channels (Chan-
pressed to a fixed dynamic range in the time domain. We hence nel A-H) and the transmitted noisy audio are captured concur-
find DRC a suitable DDSP component to achieve such limiting rently. We select data from Channel A which contains 57.4
behaviour. A detailed survey [17] has introduced different kinds hours of data for training and testing. To address the data
of DRCs and we have chosen a hard-knee DRC as our DDSP scarcity problem mentioned in Section 1, only less than 60 sec-
backbone. We then formulate our novel computationally- onds of parallel audio are selected as training data. During train-
efficient DRC. ing, the data are batched into 1-second chunks.
Figure 1: DENT-DDSP architecture. DRC stands for dynamic range compressor and EQ stands for equalizer. The parameters of the
DDSP components are trainable and controllable.

3.2. Experiment setup During testing, the ASR models are evaluated using real noisy
data from the test set and the WERs are compared. We adopted
3.2.1. Training data selection and model comparison
the conformer-based[18] dual-path ASR system from [12] with
To obtain the best performing distortion model, several models 12 conformer layers in the encoder and 6 transformer layers in
are trained and compared using different amounts of training the decoder. The dual-path ASR system is specifically designed
data from 10 seconds to 60 seconds with various speech to total for noise-robust ASR. Parallel data containing clean and noisy
ratio:s2t. s2t is defined as the duration of active speech over speeches are fed to the clean and noisy ASR paths with shared
total duration for each 1-second chunk. Active speech duration parameters respectively. The KL-divergence between the two
is calculated based on the clean audio energy in a sliding win- paths is optimized during training. We pre-train a language
dow with a threshold of 50dB. By analyzing the dataset, with model[19] with 2 RNN layers using existing transcribed text
s2t < 0.4, the data contains mostly non-speech audio or si- and utilize it during decoding.
lence. Hence, we only select data with 0.4 ≤ s2t < 1.0. To
acquire the desired training data, 1-second of data chunks with 3.3. Results
the desired s2t are randomly chosen and aggregated to the de- 3.3.1. Effect of different amounts of training data with various
sired total duration. To evaluate the simulation quality, testing s2t and model comparison
clean speeches are input to the trained distortion model and the
MSSL[14] is calculated between the real and simulated noisy Table 1 shows the MSSL and between the real and simulated
data. We also compare the proposed DENT-DDSP with the fol- audio using different amounts of training data with various s2t.
lowing distortion models in terms of simulation quality: In general, MSSL decreases with fewer amounts of training data
and increasing s2t. The best DENT-DDSP model is hence ob-
• Clean augment model[10]: simulated noisy speeches are
tained using 10 seconds of training data with 0.8 ≤ s2t < 1.0.
obtained by adding the aggregated stationary noise from
Comparing DENT-DDSP with other distortion models, we ob-
RATs Channel A to the clean speeches.
serve that DENT-DDSP is a trainable model, contains much
• G.726 augment model[11]: The clean speeches are first fewer trainable parameters requires fewer amount of training
passed through the G.726 codec. The aggregated station- data. Moreover, noisy audio simulated by DENT-DDSP has
ary noise from RATs Channel A is added subsequently lower MSSL which indicates DENT-DDSP has the best simula-
to the codec output. tion quality among all baseline models. More listening samples
• Codec2 augment model[11]: We replace the codec are available online2 . To visualize the simulation quality, ex-
choice in G.726 augment model to codec2 with 700 ample spectrograms of the clean audio, audio simulated from
bits/s. various distortion models and the real noisy data are plotted in
• SimuGAN[12]: simulated noisy speeches are obtained Fig 2.
by feeding the clean speeches to a trained SimuGAN.
3.3.2. Effectiveness of the companding operation
3.2.2. Noise-robust automatic speech recognition (ASR) Table 2 shows the effectiveness of the proposed companding
A downstream noise-robust ASR task is performed to validate operation in the computationally-efficient DRC. With decreas-
the effectiveness of DENT-DDSP. The noise-robust ASR mod- ing ds f actors using the best DENT-DDSP obtained, the train-
els are trained using simulated data generated by distortion ing time increases linearly, however, without obvious improve-
models and real noisy data. The 57.4-hour parallel data from ments on MSSL. This justifies the effectiveness of the compand-
Channel A and split them into 3 folds: 44.3-hour data for train-
ing; 4.9-hour data for validation and 8.2-hour data for testing. 2 https://guozixunnicolas.github.io/DENT-DDSP-demo/
Table 1: Simulation quality of DENT-DDSP using different
amounts of training data with various s2t and model compar-
ison

no. of
amount of
model trainable s2t MSSL
training data
param.
[0.8, 1, 0) 60-sec parallel 0.184
[0.6, 0.8) 60-sec parallel 0.187
[0.4, 0.6) 60-sec parallel 0.186
[0.8, 1, 0) 40-sec parallel 0.177
[0.6, 0.8) 40-sec parallel 0.187
[0.4, 0.6) 40-sec parallel 0.188
DENT-DDSP 2k
[0.8, 1, 0) 20-sec parallel 0.170
[0.6, 0.8) 20-sec parallel 0.189
[0.4, 0.6) 20-sec parallel 0.189
[0.8, 1, 0) 10-sec parallel 0.170
[0.6, 0.8) 10-sec parallel 0.181
[0.4, 0.6) 10-sec parallel 0.189
Figure 3: Noisy data simulated by DENT-DDSP and SimuGAN
SimuGAN 14M - 10-min unparallel 0.173
using unseen non-speech data
Clean augment non-trainable - - 0.192
G.726 augment non-trainable - - 0.197 3.3.4. Noise robust automatic speech recognition(ASR)
Codec2 augment non-trainable - - 0.197
Table 3 shows the WER comparison using the same dual-path
ASR model trained with simulated data from different distor-
tion models and real noisy data. Data simulated by DENT aug-
ment model is obtained by setting λ = 0.79, 1.26 (See Equa-
tion 1) in a trained DENT-DDSP model so that the noise gain
of the nout (t) become ±2dB. Additionally, a clean model is
trained only using real clean speeches. The DENT augment
model achieves the lowest WER among all baselines and out-
performs the best baseline model: SimuGAN by 7.3%. More-
over, it achieves almost similar WER to the upper-bound per-
formance model which is trained using in-domain real noisy
data with a difference of 2.7%. This proves that DENT-DDSP
is able to simulate noisy data with similar characteristics to the
real noisy data and enhance ASR-systems under noisy condi-
Figure 2: Magnitude spectrogram comparison between clean tions.
Table 3: WER comparison using same ASR model with simu-
audio, audio simulated from various distortion models and the
lated data from different distortion models and real noisy data
real noisy data. (b) represents the real noisy data from Channel
A of RATs dataset
model noisy speech source WER
Clean - 93.4%
ing operation which is able to increase efficiency while main-
taining the generation quality. Clean clean speech +
73.8%
augment aggregated noise
G.726 g726 codec speech +
3.3.3. Generalization ability augment aggregated noise
75.6%

To reflect the generalization ability of the trainable distor- Codec2 codec2 speech +
83.2%
tion models: DENT-DDSP and SimuGAN, non-speech audio augment aggregated noise
are used to test the distortion models. The non-speech au- SimuGAN SimuGAN simulated 65.9%
dio(available online2 ) include ambient noise, guitar and piano DENT DENT simulated 66.4%
music and synthesized audio. The spectrograms of the testing DENT DENT simulated with
non-speech audio and the simulated noisy audio from the two 58.6%
augment noise gain = ±2dB
models are shown in Figure 3. By observation, DENT-DDSP
Upper bound real noisy data 55.9%
simulated data have consistent noise spectrums and the audio
are distorted to the desired spectral behaviour: the spectrum be-
comes blurred with certain frequencies being dampened. Yet,
SimuGAN simulated audio fail to produce such consistent dis-
4. Conclusion
tortion behaviour. Moreover, SimuGAN simulated results ei- We propose a fully explainable and controllable model DENT-
ther contain artefacts or become unintelligible. Hence, DENT- DDSP based on novel DDSP components which only utilizes
DDSP shows stronger generalization ability compared to Simu- 10-second training data to achieve explicit distortion modelling.
GAN due to the use of explainable DDSP components. Besides the existing DDSP components, we propose 2 novel
DDSP components: waveshaper and computationally-efficient
Table 2: Training time and MSSL comparison among DENT- DRC which can be easily integrated to other DDSP models.
DDSPs with different ds f actors Experiments have shown that data simulated by DENT-DDSP
achieve the highest simulation quality among all distortion
ds factor training time MSSL models. Moreover, the noise-robust ASR system trained with
16 7min 0.170 DENT-DDSP simulated data achieves similar WER compared
8 13.5min 0.169 to the same model trained with real noisy data with a 2.7% of
4 25.5min 0.170 difference in terms of WER. It has also achieved 7.3% WER
2 60.2min 0.169
improvement over the best baseline distortion model.
5. References [19] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watan-
[1] D. Graff, K. Walker, S. Strassel, X. Ma, K. Jones, and A. Sawyer, abe, T. Yoshimura, and W. Zhang, “A comparative study on trans-
“The RATS collection: Supporting HLT research with degraded former vs rnn in speech applications,” IEEE Automatic Speech
audio data,” in Proc. of the 9th Int. Conf. on Lang. Resour. and Recognition and Understanding Workshop (ASRU), Dec 2019.
Eval. (LREC), 2014, pp. 1970–1977.
[2] S. Badrinath and H. Balakrishnan, “Automatic speech recognition
for air traffic control communications,” Transportation Research
Record, vol. 2676, no. 1, pp. 798–810, 2022.
[3] T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air-
bus air traffic control speech recognition 2018 challenge: towards
atc automatic transcription and call sign detection,” in Proc. Int.
Speech Commun. Assoc.(Interspeech), 2018, pp. 2993–2997.
[4] J. Godfrey, E. Holliman, and J. McDaniel, “Switchboard: tele-
phone speech corpus for research and development,” in Proc.
IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1992, pp. 517–520 vol.1.
[5] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of
noise-robust automatic speech recognition,” IEEE ACM Trans. on
Audio, Speech, and Lang. Process., vol. 22, no. 4, pp. 745–777,
2014.
[6] M. J. F. Gales, “Model-based techniques for noise robust speech
recognition,” in Ph.D. thesis, University of Cambridge, 1995.
[7] Y. Minami and S. Furui, “A maximum likelihood procedure for
a universal adaptation method based on hmm composition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), vol. 1, 1995, pp. 129–132.
[8] A. Sankar and C.-H. Lee, “Robust speech recognition based on
stochastic matching,” in Proc. IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, 1995, pp. 121–
124.
[9] A. Acero, l. Deng, T. Kristjansson, and J. Zhang, “Hmm adapta-
tion using vector taylor series for noisy speech recognition,” 2000,
pp. 869–872.
[10] D. Ma, G. Li, H. Xu, and E. S. Chng, “Improving code-switching
speech recognition with data augmentation and system combi-
nation,” in in Proc. Asia-Pacific Signal and Inf. Process. Assoc.
Annu. Summit and Conf. (APSIPA ASC), 2019, pp. 1308–1312.
[11] M. Ferràs, S. Madikeri, P. Motlicek, S. Dey, and H. Bourlard, “A
large-scale open-source acoustic simulator for speaker recogni-
tion,” IEEE Signal Processing Letters, vol. 23, no. 4, pp. 527–531,
2016.
[12] C. Chen, N. Hou, Y. Hu, S. Shirol, and E. S. Chng, “Noise-robust
speech recognition with 10 minutes unparalleled in-domain data,”
in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing (ICASSP), 2022.
[13] H. Hu, T. Tan, and Y. Qian, “Generative adversarial networks
based data augmentation for noise robust speech recognition,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5044–5048.
[14] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: differ-
entiable digital signal processing,” in in Proc. Int. Conf. on Learn.
Representations(ICLR), 2020.
[15] G. Fabbro, V. Golkov, T. Kemp, and D. Cremers, “Speech
synthesis and control using differentiable dsp,” 2020. [Online].
Available: https://arxiv.org/abs/2010.15084
[16] J. Alonso and C. Erkut, “Latent space explorations of singing
voice synthesis using ddsp,” in Proc. of 18th Sound and Music
Computing Conf., 2021.
[17] D. Giannoulis, M. Massberg, and J. Reiss, “Digital dynamic range
compressor design—a tutorial and analysis,” J. of the Audio Eng.
Soc.(AES), vol. 60, pp. 399–408, 2012.
[18] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
Convolution-augmented transformer for speech recognition,” in
Proc. Int. Speech Commun. Assoc.(Interspeech), 2020, pp. 5036–
5040.

Max Born, Albert Einstein-The Born-Einstein Letters-Macmillan (1971)
100% (1)
Max Born, Albert Einstein-The Born-Einstein Letters-Macmillan (1971)
132 pages
Education and Development in India
No ratings yet
Education and Development in India
672 pages
Manju Kapur Novel
100% (1)
Manju Kapur Novel
32 pages
Vocoder Summer School 2021
No ratings yet
Vocoder Summer School 2021
298 pages
Cultural Contents of An EFL Textbook
No ratings yet
Cultural Contents of An EFL Textbook
6 pages
Cs401 Midterm Solved Mcqs by Junaid
100% (1)
Cs401 Midterm Solved Mcqs by Junaid
47 pages
Critically Analysis The Marxism (Political Science-I)
No ratings yet
Critically Analysis The Marxism (Political Science-I)
13 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Datasheet
No ratings yet
Datasheet
78 pages
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
No ratings yet
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
23 pages
Thesis
No ratings yet
Thesis
37 pages
Denoising Convolutional Autoencoders For Noisy Speech Recognition
No ratings yet
Denoising Convolutional Autoencoders For Noisy Speech Recognition
6 pages
Csvsimple l3
No ratings yet
Csvsimple l3
62 pages
Applsci 15 02919
No ratings yet
Applsci 15 02919
19 pages
参考7
No ratings yet
参考7
24 pages
BTP Group-1 Report
No ratings yet
BTP Group-1 Report
21 pages
Keynote Slides
No ratings yet
Keynote Slides
33 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
No ratings yet
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
29 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Discrete Time Signal Processing - Oppenheim
No ratings yet
Discrete Time Signal Processing - Oppenheim
75 pages
003 ITP UG Piping
100% (1)
003 ITP UG Piping
4 pages
Good Matter
No ratings yet
Good Matter
57 pages
Diff-TTS A Denoising Diffusion Model For Text-To-Speech
No ratings yet
Diff-TTS A Denoising Diffusion Model For Text-To-Speech
5 pages
Metadata of The Chapter That Will Be Visualized Online: Samui
No ratings yet
Metadata of The Chapter That Will Be Visualized Online: Samui
14 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
No ratings yet
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
17 pages
Towards End-to-End Synthetic Speech Detection: Member, IEEE Senior Member, IEEE Member, IEEE
No ratings yet
Towards End-to-End Synthetic Speech Detection: Member, IEEE Senior Member, IEEE Member, IEEE
5 pages
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
No ratings yet
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models
14 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
How To Respond To Unhappy Customers
No ratings yet
How To Respond To Unhappy Customers
10 pages
Sist TS Cen TS 16555 7 2016
No ratings yet
Sist TS Cen TS 16555 7 2016
11 pages
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
No ratings yet
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
5 pages
CER Not SE ASR
No ratings yet
CER Not SE ASR
5 pages
أنظمة المعالجة والتطهير بالأشعة الفوق بنفسجية الصديقة للبيئة بدون مواد كيميائية
No ratings yet
أنظمة المعالجة والتطهير بالأشعة الفوق بنفسجية الصديقة للبيئة بدون مواد كيميائية
8 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
SE Via Token
No ratings yet
SE Via Token
5 pages
Bosch Rexroth Gearbox Product Line
100% (1)
Bosch Rexroth Gearbox Product Line
10 pages
David Crawford Epson
No ratings yet
David Crawford Epson
31 pages
Detailed Lesson Plan in Pe6
No ratings yet
Detailed Lesson Plan in Pe6
5 pages
Also Dog
No ratings yet
Also Dog
10 pages
High-Fidelity Noise Reduction With Differentiable Signal
No ratings yet
High-Fidelity Noise Reduction With Differentiable Signal
10 pages
CDiffSEwRL 1113 Chu Final
No ratings yet
CDiffSEwRL 1113 Chu Final
9 pages
Liu22c Interspeech
No ratings yet
Liu22c Interspeech
5 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
Bill Payment Receipt - Feb 2019
No ratings yet
Bill Payment Receipt - Feb 2019
4 pages
PRACTICAL RESEARCH 2 - Set B
No ratings yet
PRACTICAL RESEARCH 2 - Set B
1 page
Application of Deep Learning-Based Speech Signal P
No ratings yet
Application of Deep Learning-Based Speech Signal P
6 pages
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
No ratings yet
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
13 pages
70mai Hardware Kit User Manual EN
100% (1)
70mai Hardware Kit User Manual EN
5 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
1 Base
No ratings yet
1 Base
5 pages
PREMIO User's Manual
No ratings yet
PREMIO User's Manual
11 pages
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
No ratings yet
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
7 pages
Nonlinear Speech Synthesis
No ratings yet
Nonlinear Speech Synthesis
8 pages
Better Speech Synthesis Through Scaling
No ratings yet
Better Speech Synthesis Through Scaling
12 pages
A Perceptually-Motivated Approach For Low-Complexity Real-Time Enhancement of Fullband Speech
No ratings yet
A Perceptually-Motivated Approach For Low-Complexity Real-Time Enhancement of Fullband Speech
5 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
TorToiSe - Spending Compute For High Quality TTS
No ratings yet
TorToiSe - Spending Compute For High Quality TTS
12 pages
A Neural Network-Based Nonlinear Acoustic Echo Canceller
No ratings yet
A Neural Network-Based Nonlinear Acoustic Echo Canceller
5 pages
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
No ratings yet
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
13 pages
Implementation of Digital Hearing AID For Sensory Neural Impairment
No ratings yet
Implementation of Digital Hearing AID For Sensory Neural Impairment
3 pages
Paper 3
No ratings yet
Paper 3
8 pages
Experimental Optimization of Mild Steel
No ratings yet
Experimental Optimization of Mild Steel
4 pages
Buss Point Notes
No ratings yet
Buss Point Notes
8 pages
ICASSP 2021 DNS Challenge
No ratings yet
ICASSP 2021 DNS Challenge
5 pages
Speech Recognition Using A DSP: Lunds Universitet
No ratings yet
Speech Recognition Using A DSP: Lunds Universitet
12 pages
Ethernet Twist Per Inch
No ratings yet
Ethernet Twist Per Inch
8 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
Published in The Quality Management Forum, Winter Edition, Vol. 23, Number 4, 1997
No ratings yet
Published in The Quality Management Forum, Winter Edition, Vol. 23, Number 4, 1997
4 pages
Upsc/Jpsc/Bpsc Online Class
No ratings yet
Upsc/Jpsc/Bpsc Online Class
4 pages
DeepFilterNet2205 05474
No ratings yet
DeepFilterNet2205 05474
5 pages
Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation
No ratings yet
Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation
5 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
L13 2 f10
No ratings yet
L13 2 f10
36 pages
Unit-4 Controlling: Definitions: Knootz and O'Donnel
No ratings yet
Unit-4 Controlling: Definitions: Knootz and O'Donnel
5 pages
Rubrics For Student Engagement or Class Participation
No ratings yet
Rubrics For Student Engagement or Class Participation
2 pages
Mini Explorer Club
No ratings yet
Mini Explorer Club
4 pages
SWEDISH FA150K Lite
No ratings yet
SWEDISH FA150K Lite
2 pages
Reaction Paper Template
No ratings yet
Reaction Paper Template
5 pages
Apply For Siemens Automation and Control Engineer Job - Engineering - Shanghai, China
No ratings yet
Apply For Siemens Automation and Control Engineer Job - Engineering - Shanghai, China
1 page
Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications
No ratings yet
Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications
4 pages
Acoustic Echo Cancellation Using Nlms-Neural Network Structures
No ratings yet
Acoustic Echo Cancellation Using Nlms-Neural Network Structures
4 pages
Best Poster Award Poster
No ratings yet
Best Poster Award Poster
1 page
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
From Everand
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
Fouad Sabry
No ratings yet
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Analog Dialogue, Volume 46, Number 4: Analog Dialogue, #8
From Everand
Analog Dialogue, Volume 46, Number 4: Analog Dialogue, #8
Analog Dialogue
No ratings yet

Interspeech 2022 GZX

Uploaded by

Interspeech 2022 GZX

Uploaded by

DENT-DDSP: Data-efficient noisy speech generator using differentiable digital

signal processors for explicit distortion modelling and noise-robust speech

Abstract passes them through pre-defined codecs to obtain the simu-

2 π 3. Experiments and results

You might also like