Audio Generation With Diffusion Models
Audio Generation With Diffusion Models
Models
Rongjie Huang * 1 Jiawei Huang * 1 Dongchao Yang * 2 Yi Ren 3 Luping liu 1 Mingze Li 1 Zhenhui Ye 1
Jinglin Liu 1 Xiang Yin 3 Zhou Zhao 1
2022), it achieves a deep level of language understanding leveraging large-scale diffusion models. GLIDE (Nichol
with high-fidelity generation. et al., 2021) trains a T2I upsampling model for a cascaded
generation. Imagen (Saharia et al., 2022) presents T2I with
While conceptually simple and easy to train, Make-An-
an unprecedented degree of photorealism and a deep level of
Audio yields surprisingly strong results. Both subjective
language understanding. Stable diffusion (Rombach et al.,
and objective evaluations demonstrate that Make-An-Audio
2022) utilizes latent space diffusion instead of pixel space
achieves new state-of-the-art in text-to-audio with natural
to improve computational efficiency. A large body of work
and controllable synthesis. Make-An-Audio exhibits su-
also explores the usage of T2I models for video genera-
perior audio quality and text-audio alignment faithfulness
tion. CogVideo (Hong et al., 2022) is built on top of a
on the benchmark AudioCaption dataset and even general-
CogView2 (Ding et al., 2022) T2I model with a multi-frame-
izes well to the unsupervised Clotho dataset in a zero-shot
rate hierarchical training strategy. Make-A-Video (Singer
fashion.
et al., 2022) extends a diffusion-based T2I model to T2V
For the first time, we contextualize the need for audio gen- through a spatiotemporally factorized diffusion model.
eration with different input modalities. Besides natural
Moving beyond visual generation, our approach aims to
language, Make-An-Audio generalizes well to multiple user-
generate high-fidelity audio from arbitrary natural language,
defined input modalities (audio, image, and video), which
which has been relatively overlooked.
empowers humans to create rich and diverse audio content
and opens up a host of applications for personalized transfer
and fine-grained control. 2.2. Text-Guided Audio Synthesis
Key contributions of the paper include: While there is remarkable progress in text-guided visual gen-
eration, the progress of text-to-audio (T2A) generation lags
behind mainly due to two main reasons: the lack of large-
• We present Make-An-Audio – an effective method that
scale datasets with high-quality text-audio pairs, and the
leverages latent diffusion with a spectrogram autoen-
complexity of modeling long continuous waveforms data.
coder to model the long continuous waveforms.
DiffSound (Yang et al., 2022) is the first to explore text-
• We introduce a pseudo prompt enhancement with the to-audio generation with a discrete diffusion process that
distill-then-reprogram approach, it includes a large operates on audio codes obtained from a VQ-VAE, lever-
number of concept compositions by opening up the aging masked text generation with CLIP representations.
usage of language-free audios to alleviate data scarcity. AudioLM (Borsos et al., 2022) introduces the discretized ac-
tivations of a masked language model pre-trained on audio
• We investigate textual representation and emphasize and generates syntactically plausible speech or music.
the advantages of contrastive language-audio pretrain- Very recently, the concurrent work AudioGen (Kreuk et al.,
ing for a deep understanding of natural languages with 2022) propose to generate audio samples autoregressively
computational efficiency. conditioned on text inputs, while our proposed method dif-
• We evaluate Make-An-Audio and present state-of-the- ferentiates from it in the following: 1) we introduce pseudo
art quantitative results and thorough evaluation with prompt enhancement and leverage the power of contrastive
qualitative findings. language-audio pre-training and diffusion models for high-
fidelity generation. 2) We predict the continuous spectro-
• We generalize the powerful model to X-to-Audio gen- gram representations, significantly improving computational
eration, for the first time unlocking the ability to gen- efficiency and reducing training costs.
erate high-definition, high-fidelity audios given a user-
defined modality input. 2.3. Audio Representation Learning
Different from modeling fine-grain details of the signal, the
2. Related Works usage of high-level self-supervised learning (SSL) (Baevski
et al., 2020; Hsu et al., 2021; He et al., 2022) has been shown
2.1. Text-Guided Image/Video Synthesis
to effectively reduce the sampling space of generative algo-
With the rapid development of deep generative models, text- rithms. Inspired by vector quantization (VQ) techniques,
guided synthesis has been widely studied in images and SoundStream (Zeghidour et al., 2021) presents a hierar-
videos. The pioneering work of DALL-E (Ramesh et al., chical architecture for high-level representations that carry
2021) encodes images into discrete latent tokens using VQ- semantic information. Data2vec (Baevski et al., 2022) uses
VAE (Van Den Oord et al., 2017) and considers T2I genera- a fast convolutional decoder and explores the contextualized
tion as a sequence-to-sequence translation problem. More target representations in a self-supervised manner.
recently, impressive visual results have been achieved by
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
E zt U-Net θ zt−1 z0 G
Diffusion Denoising
Audio q(xt |xt−1 ) Audio
pθ (xt |xt−1 )
Encoder Decoder
x ⋯ x!
×N
“Rain falls softly Text Transformer
in the distance” Encoder Cross-attention Vocoder
Generated Audio
Figure 2. A high-level overview of Make-An-Audio. Note that some modules (printed with a lock) are frozen for training the T2A model.
Dynamic Reprogramming
Expert Distillation Data Sample Birds
Base
Footsteps
Audio Captioning
c̃ CLAPS “Rain falls softly in the
Language-Free “Rain falls softly distance before hearing
Audio Audio-Text Retrieval in the distance” sounds of birds and footsteps”
Candidates
Figure 3. The process of pseudo prompt enhancement. Our semi-parametric diffusion model consists of a fixed expert distillation and a
dynamic reprogramming stage. The database D contains audio examples with a sampling strategy ξ to create unseen object compositions.
We use CLAPS to denote the CLAP selection.
Recently, spectrograms (akin to 1-channel 2D images) au- As illustrated in Figure 2, Make-An-Audio consists of the
toencoder (Gong et al., 2022; He et al., 2022) with recon- following main components: 1) the pseudo prompt enhance-
struction objective as self-supervision have demonstrated ment to alleviate the issue of data scarcity, opening up the us-
the effectiveness of heterogeneous image-to-audio transfer, age of orders of magnitude language-free audios; 2) a spec-
advancing the field of speech and audio processing on a vari- trogram autoencoder for predicting self-supervised represen-
ety of downstream tasks. Among these approaches, Xu et al. tation instead of long continuous waveforms; 3) a diffusion
(2022) study the Masked Autoencoders (MAE) (He et al., model that maps natural language to latent representations
2022) to self-supervised representation learning from audio with the power of contrastive language-audio pretraining
spectrograms. Gong et al. (2022) adopt audio spectrogram (CLAP) and 4) a separately-trained neural vocoder to con-
transformer with joint discriminative and generative masked vert mel-spectrograms to raw waveforms. In the following
spectrogram modeling. Inspired by these, we inherit the sections, we describe these components in detail.
recent success of spectrogram SSL in the frequency domain,
which guarantees efficient compression and high-level se- 3.2. Pseudo Prompt Enhancement:
mantic understanding. Distill-then-Reprogram
To mitigate the data scarcity, we propose to construct
3. Make-An-Audio prompts aligned well with audios, enabling a better under-
In this section, we first overview the Make-An-Audio frame- standing of the text-audio dynamics from orders of magni-
work and illustrate pseudo prompt enhancement to better tude unsupervised data. As illustrated in Figure 3, it consists
align text and audio semantics, following which we intro- of two stages: an expert distillation approach to produce
duce textual and audio representations for multimodal learn- prompts aligned with audio, and a dynamic reprogramming
ing. Together with the power of diffusion models with procedure to construct a variety of concept compositions.
classifier-free guidance, Make-An-Audio explicits high-
fidelity synthesis with superior generalization. 3.2.1. E XPERT D ISTILLATION
We consider the pre-trained automatic audio captioning (Xu
3.1. Overview et al., 2020) and audio-text retrieval (Deshmukh et al., 2022;
Koepke et al., 2022) systems as our experts for prompt gen-
Deep generative models have achieved leading perfor-
eration. Captioning models aim to generate diverse natural
mances in text-guided visual synthesis. However, the current
language sentences to describe the content of audio clips.
development of text-to-audio (T2A) generation is hampered
Audio-text retrieval takes a natural language as a query to
by two major challenges: 1) Model training is faced with
retrieve relevant audio files in a database. To this end, ex-
data scarcity, as human-labeled audios are expensive to
perts jointly distill knowledge to construct a caption aligned
create, and few audio resources provide natural language
with audio, following which we select from these candidates
descriptions. 2) Modeling long continuous waveforms (e.g.,
that endow high CLAP (Elizalde et al., 2022) score as the
typically 16,000 data points for 1s 16 kHz waveforms) poses
final caption (we include a threshold to selectly consider
a challenge for all high-quality neural synthesizers.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
faithful results). This simple yet effective procedure largely a variety of downstream tasks. The audio signal is a se-
alleviates data scarcity issues and explicit generalization quence of mel-spectrogram sample x ∈ [0, 1]Ca ×T , where
to different audio domains, and we refer the reader to Sec- Ca , T respectively denote the mel channels and the number
tion 6.3.2 for a summary of our findings. Details have been of frames. Our spectrogram autoencoder is composed of
attached in Appendix E.2. 1) an encoder network E which takes samples x as input
and outputs latent representations z; 2) a decoder network
3.2.2. DYNAMIC R EPROGRAMMING G reconstructs the mel-spectrogram signals x0 from the
compressed representation z; and 3) a multi-window dis-
To prevent overfitting and enable a better understanding
criminator Dis learns to distinguish the generated samples
of concept compositions, we introduce a dynamic repro-
G(z) from real ones in different multi-receptive fields of
gramming technique that constructs a variety of concept
mel-spectrograms.
compositions. It proceeds in three steps as illustrated in
Figure 3, where we elaborate the process as follows: 1) We The whole system is trained end-to-end to minimize 1)
first prepare our sound event database D annotated with a Reconstruction loss Lre , which improves the training ef-
single label. 2) Each time N concepts are sampled from ficiency and the fidelity of the generated spectrograms; 2)
the database D, where N ∈ {0, 1, 2}. 3) The original text- GAN losses LGAN , where the discriminator and genera-
audio pair data has been randomly concatenated with the tor play an adversarial game; and 3) KL-penalty loss LKL ,
sampled events according to the template, constructing a which restricts spectrogram encoders to learn standard z and
new training example with varied concept compositions. It avoid arbitrarily high-variance latent spaces.
can be conducted online, significantly reducing the time con-
To this end, Make-An-Audio takes advantage of the spec-
sumed for data preparation. The reprogramming templates
trogram autoencoder to predict the self-supervised repre-
are attached in Appendix F.
sentations instead of waveforms. It largely alleviates the
challenges of modeling long continuous data and guarantees
3.3. Textual Representation high-level semantic understanding.
Text-guided synthesis models need powerful semantic text
encoders to capture the meaning of arbitrary natural lan- 3.5. Generative Latent Diffusion
guage inputs, which could be grouped into two major cate-
We implement our method over Latent Diffusion Models
gories: 1) Contrastive pretraining. Similar to CLIP (Radford
(LDMs) (Rombach et al., 2022), a recently introduced class
et al., 2021) pre-trained on image-text data, recent progress
of Denoising Diffusion Probabilistic Models (DDPMs) (Ho
on contrastive language-audio pretraining (CLAP) (Elizalde
et al., 2020) that operate in the latent space. It is conditioned
et al., 2022) brings audio and text descriptions into a joint
on textual representation, breaking the generation process
space and demonstrates the outperformed zero-shot gener-
into several conditional diffusion steps. The training loss is
alization to multiple downstream domains. 2) Large-scale
defined as the mean squared error in the noise ∼ N (0, I)
language modeling (LLM). Saharia et al. (2022) and Kreuk
space, and efficient training is optimizing a random term of
et al. (2022) utilize language models (e.g., BERT (Devlin
t with stochastic gradient descent:
et al., 2018), T5 (Raffel et al., 2020)) for text-guided gen-
2
eration. Language models are trained on text-only corpus Lθ = kθ (zt , t, c) − k2 , (1)
significantly larger than paired multimodal data, thus being
where α denotes the small positive constant, and θ denotes
exposed to a rich distribution of text.
the denoising network. To conclude, the diffusion model can
Following the common practice (Saharia et al., 2022; be efficiently trained by optimizing ELBO without adver-
Ramesh et al., 2022), we freeze the weights of these text sarial feedback, ensuring extremely faithful reconstructions
encoders. We find that both CLAP and T5-Large achieve that match the ground-truth distribution. Detailed formula-
similar results on benchmark evaluation, while CLAP could tion of DDPM has been attached in Appendix D.
be more efficient without offline computation of embed-
dings required by LLM. We refer the reader to Section 6.3.1 3.6. Classifier-Free Guidance
for a summary of our findings.
For classifier-free guidance shown in (Dhariwal & Nichol,
2021; Ho & Salimans, 2022), by jointly training a condi-
3.4. Audio Representation
tional and an unconditional diffusion model, it could be pos-
Recently, spectrograms (akin to 1-channel 2D images) au- sible to combine the conditional and unconditional scores
toencoder (Gong et al., 2022; He et al., 2022) with recon- to attain a trade-off between sample quality and diversity.
struction objective as self-supervision have demonstrated The textual condition in a latent diffusion model θ (zt , t, c)
the effectiveness of heterogeneous image-to-audio trans- is replaced by an empty prompt c∅ with a fixed probability
fer, advancing the field of speech and audio processing on during training. During sampling, the output of the model is
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Pooling
… Pretrained Model
Not used Module
CLIP Image Encoder
Video-to-Audio … User-optional Flow
extrapolated further in the direction of θ (zt , t, c) and away A trade-off between faithfulness (text-caption alignment)
from θ (zt , t, c∅ ) with the guidance scale s ≥ 1: and realism (audio quality) could be witnessed: As T in-
˜θ (zt , t, c) = θ (zt , t, c∅ ) + s · (θ (zt , t, c) − θ (zt , t, c∅ )) (2) creases, a large amount of noise would be added to the
initial audio, and the generated samples become more real-
4. X-To-Audio: No Modality Left Behind istic while less faithful. We refer the reader to Figure 5 for
a summary of our findings.
In this section, we generalize our powerful conditional dif-
fusion model for X-To-Audio generation. For the first time, 4.2. Audio Inpainting
we contextualize the need for audio generation with different
conditional modalities, including: 1) text, 2) audio (inpaint- Inpainting (Liu et al., 2020; Nazeri et al., 2019) is the task
ing), and 3) visual. Make-An-Audio empowers humans to of filling masked regions of an audio with new content
create rich and diverse audio content with unprecedented since parts of the audio are corrupted or undesired. Though
ease, unlocking the ability to generate high-definition, high- diffusion model inpainting can be performed by adding
fidelity audio given a user-defined modality input. noise to initial audio and sampling with SDEdit, it may
result in undesired edge artifacts since there could be an
4.1. Personalized Text-To-Audio Generation information loss during the sampling process (the model
can only see a noised version of the context). To achieve
Adapting models (Chen et al., 2020b; Huang et al., 2022) better results, we explicitly fine-tune Make-An-Audio for
to a specific individual or object is a long-standing goal audio inpainting.
in machine learning research. More recently, personaliza-
During training, the way masks are generated greatly in-
tion (Gal et al., 2022; Benhamdi et al., 2017) efforts can be
fluences the final performance of the system. As such, we
found in vision and graphics, which allows to inject unique
adopt irregular masks (thick, medium, and thin masks) sug-
objects into new scenes, transform them across different
gested by LaMa (Suvorov et al., 2022), which uniformly
styles, and even produce new products. For instance, when
uses polygonal chains dilated by a high random width (wide
asked to generate “baby crying” given the initial sound of
masks) and rectangles of arbitrary aspect ratios (box masks).
“thunder”, our model produces realistic and faithful audio
In addition, we investigate the frame-based masking strategy
describing “a baby cries in the thunder day”. Distinctly,
commonly adopted in speech liteature (Baevski et al., 2020;
it has a wide range of uses for audio mixing and tuning,
Hsu et al., 2021). It is implemented using the algorithm
e.g., adding background sound for an existing clip or editing
from wav2vec 2.0 (Baevski et al., 2020), where spans of
audio by inserting a speaking object.
length are masked with a p probability.
We investigate the personalized text-to-audio generation by
stochastic differential editing (Meng et al., 2021), which has 4.3. Visual-To-Audio Generation
been demonstrated to produce realistic samples with high-
fidelity manipulation. Given input audio with a user guide Recent advances in deep generative models have shown im-
(prompt), we select a particular time t0 with total denoising pressive results in the visually-induced audio generation (Su
steps N , and add noise to the raw data z0 for zT (T = et al., 2020; Gan et al., 2020), towards generating realistic
t0 × N ) according to Equation 4. It is then subsequently audio that describes the content of images or videos: Hsu
denoised through a reverse process parameterized by shared et al. (2020) show that spoken language could be learned
θ to increase its realism according to Equation 6. by a visually-grounded generative model of speech. Iashin
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
& Rahtu (2021) propose a multi-class visual guided sound 5.2. Model Configurations
synthesis that relies on a codebook prior-based transformer.
We train a continuous autoencoder to compress the percep-
To pursue this research further, we extend Make-An-Audio tual space with downsampling to a 4-channel latent represen-
for visual-to-audio generation. For the lack of large-scale tation, which balances efficiency and perceptually faithful
visual-audio datasets in image-to-audio (I2A) research, our results. For our main experiments, we train a U-Net (Ron-
main idea is to utilize contrastive language-image pretrain- neberger et al., 2015) based text-conditional diffusion model,
ing (CLIP) with CLIP-guided T2A model and leverage tex- which is optimized using 18 NVIDIA V100 GPU until 2M
tual representations to bridge the modality gap between optimization steps. The base learning rate is set to 0.005,
visual and audio world. As CLIP encoders embed images and we scale it by the number of GPUs and the batch size
and text to the joint latent space, our T2A model provides following LDM. We utilize HiFi-GAN (Kong et al., 2020)
a unique opportunity to visualize what the CLIP image (V1) trained on VGGSound dataset (Chen et al., 2020a)
encoder is seeing. Considering the complexity of V2A gen- as the vocoder to synthesize waveform from the generated
eration, it is natural to leverage image priors for videos to mel-spectrogram in all our experiments. Hyperparameters
simplify the learning process. On this account, we uniformly are included in Appendix B.
pick up 4 frames from the video and pool these CLIP image
features to formulate the “averaged” video representation, 5.3. Evaluation Metrics
which is then deteriorated to I2A generation.
We evaluate models using objective and subjective metrics
To conclude, the visual-to-audio inference scheme can be over audio quality and text-audio alignment faithfulness.
formulated in Figure 4. It significantly reduces the require- Following common practice (Yang et al., 2022; Iashin &
ment for pair visual datasets, and the plug-and-play module Rahtu, 2021), the key automated performance metrics used
with pre-trained Make-An-Audio empowers humans to cre- are melception-based (Koutini et al., 2021) FID (Heusel
ate rich and diverse audio content from the visual world. et al., 2017) and KL divergence to measure audio fidelity.
Additionally, we introduce the CLAP score to measure
5. Training and Evaluation audio-text alignment for this work. CLAP score is adapted
from the CLIP score (Hessel et al., 2021; Radford et al.,
5.1. Dataset 2021) to the audio domain and is a reference-free evaluation
We train on a combination of several datasets: AudioSet, metric that closely correlates with human perception.
BBC sound effects, Audiostock, AudioCaps-train, ESC-50, For subjective metrics, we use crowd-sourced human evalu-
FSD50K, Free To Use Sounds, Sonniss Game Effects, We- ation via Amazon Mechanical Turk, where raters are asked
SoundEffects, MACS, Epidemic Sound, UrbanSound8K, to rate MOS (mean opinion score) on a 20-100 Likert scale.
WavText5Ks, LibriSpeech, and Medley-solos-DB. For au- We assess the audio quality and text-audio alignment faith-
dios without natural language annotation, we apply the fulness by respectively scoring MOS-Q and MOS-F, which
pseudo prompt enhancement to construct captions aligned is reported with 95% confidence intervals (CI). More infor-
well with the audio. Overall we have ∼3k hours with mation on evaluation has been attached in Appendix C.
1M audio-text pairs for training data. For evaluating text-
to-audio models (Yang et al., 2022; Kreuk et al., 2022),
the AudioCaption validation set is adopted as the stan-
6. Results
dard benchmark, which contains 494 samples with five 6.1. Quantitative Results
human-annotated captions in each audio clip. For a more
challenging zero-shot scenario, we also provide results in Automatic Objective Evaluation The objective evaluation
Clotho (Drossos et al., 2020) validation set which contain comparison with baseline Diffsound (the only publicly-
multiple audio events. A more detailed data setup has been available T2A generation model) are presented in Table 1,
attached in Appendix A. and we have the following observations: 1) In terms of au-
dio qualty, Make-An-Audio achieves the highest perceptual
We conduct preprocessing on the text and audio data: 1) quality in AudioCaption with FID of 4.61 and KL of 2.79.
convert the sampling rate of audios to 16kHz and pad short For zero-shot generation, it also demonstrates the outper-
clips to 10-second long; 2) extract the spectrogram with formed results superior to the baseline model; 2) On text-
the FFT size of 1024, hop size of 256 and crop it to a audio similarity, Make-An-Audio scores the highest CLAP
mel-spectrogram of size 80 × 624; 3) non-standard words with a gap of 0.037 compared to the ground truth audio, sug-
(e.g., abbreviations, numbers, and currency expressions) and gesting Make-An-Audio’s ability to generate faithful audio
semiotic classes (Taylor, 2009) (text tokens that represent that aligns well with descriptions.
particular entities that are semantically constrained, such as
measure phrases, addresses, and dates) are normalized. Subjective Human Evaluation The evaluation of the T2A
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Table 2. Audio inpainting evaluation with variety masking strategies. Table 3. Image/Video-to-audio evaluation.
models is very challenging due to its subjective nature in trade-off curves between CLAP and FID scores in Figure 7.
perceptual quality, and thus we include a human evaluation Consistent with the observations in Ho & Salimans (2022),
in Table 1: Make-An-Audio (CLAP) achieves the highest the choice of the classifier guidance weight could scale
perceptual quality with MOS-Q of 72.5 and MOS-F of 78.6. conditional and unconditional synthesis, offering a trade-off
It indicates that raters prefer our model synthesis against between sample faithfulness and realism with respect to the
baselines in terms of audio naturalness and faithfulness. conditioning text.
For audio-inpainting, we compare different masking designs, For better comparison in audio inpainting, we visualize
including the irregular (thick, medium, and thin) strategy different masking strategies and synthesis results in Fig-
from visual world (Suvorov et al., 2022), as well as the ure 6. As can be seen, given the initial audio with undesired
frame-based (with varying p) strategy commonly used in content, our model correctly fills and reconstruct the audio
speech (Baevski et al., 2020; Hsu et al., 2021). During robust to different shapes of masked regions, suggesting that
evaluation, we randomly mask the wide or narrow regions it is capable of a high-level understanding of audio content.
and utilize FID and KL metrics to measure performance.
On the personalized text-to-audio generation, we explore
The results have been presented in Table 2, and we have the
different t0 ∈ (0, 1) to add Gaussian noise and conduct
following observations: 1) In both frame-based or irregular
reverse sampling. As shown in Figure 5, a trade-off between
strategies, larger masked regions in training have witnessed
faithfulness (measured by CLAP score) and realism (mea-
the improved perceptual quality, which force the network to
sured by 1-MSE distance) could be witnessed. We find that
exploit the high receptive field of continuous spectrograms
t0 ∈ [0.2, 0.5] works well for faithful guidance with realistic
fully. 2) With the similar size of the masked region, the
generation, suggesting that audio variants (e.g., speed, tim-
frame-based strategy consistently outperforms the irregular
bre, and energy) could be easily destroyed as t0 increases.
one, suggesting that it could be better to mask the audio
spectrograms which align in time series.
6.3. Analysis and Ablation Studies
We also present our visual-to-audio generation results in
Table 3. As can be seen, Make-An-Audio can generalize to To verify the effectiveness of several designs in Make-An-
a wide variety of images and videos. Leveraging contrastive Audio, including pseudo prompt enhancement, textual and
pre-training, the model provides a high-level understanding audio representation, we conduct ablation studies and dis-
of visual input, which generates high-fidelity audio spectro- cuss the key findings as follows. More analysis on audio
grams well-aligned with their semantic meanings. representation has been attached in Appendix E.1.
Better align with initial audio (More realistic) Better align with prompt (More faithful)
Figure 5. We illustrate personalized text-to-audio results with various t0 initializations. t0 = 0 indicates the initial audio itself, whereas
t0 = 1 indicates a text-to-audio synthesis from scratch. For comparison, realism is measured by the 1-MSE distance between generated
and initial audio, and faithfulness is measured by the CLAP score between the generated sample. Prompt: A clock ticktocks.
(a) Sample 1 (b) Sample 2 (c) Masking Strategy To highlight the effectiveness of the proposed dynamic re-
programming strategy to create unseen object compositions,
Figure 6. Qualitative results with our inpainting model. we additionally train our Make-An-Audio in the static train-
ing dataset, and attach the results in Table 7 in Appendix E:
1) Removing the dynamic reprogramming approach results
in a slight drop in evaluation; 2) When migrating to a more
challenging scenario to Clotho in a zero-shot fashion, a sig-
nificant degradation could be witnessed, demonstrating its
effectiveness in constructing diverse object compositions
for better generalization.
7. Conclusion
In this work, we presented Make-An-Audio with a prompt-
enhanced diffusion model for text-to-audio generation.
Leveraging the prompt enhancement with the distill-then-
Figure 7. Classifier-free guidance trade-off curves.
reprogram approach, Make-An-Audio was endowed with
various concept compositions with orders of magnitude un-
fel et al., 2020), as well as the multimodal contrastive supervised data. We investigated textual representation and
pre-trained encoder CLIP (Radford et al., 2021) and emphasized the advantages of contrastive pre-training for
CLAP (Elizalde et al., 2022). We freeze the weights of a deep understanding of natural languages with computa-
text encoders for T2A generation. For easy comparison, we tional efficiency. Both objective and subjective evaluation
present the results in Table 1 and have the following obser- demonstrated that Make-An-Audio achieved new state-of-
vations: 1) Since CLIP is introduced as a scalable approach the-art results in text-to-audio with realistic and faithful
for learning joint representations between text and images, synthesis. Make-An-Audio was the first attempt to gener-
it could be less useful in deriving semantic representation ate high-definition, high-fidelity audio given a user-defined
for T2A in contrast to Yang et al. (2022). 2) CLAP and T5- modality input, opening up a host of applications for person-
Large achieve similar performances on benchmarks dataset, alized transfer and fine-grained control. We envisage that
while CLAP could be more computationally efficient (with our work serve as a basis for future audio synthesis studies.
only %59 params), without the need for offline computation
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Nazeri, K., Ng, E., Joseph, T., Qureshi, F. Z., and Ebrahimi,
Salakhutdinov, R., and Mohamed, A. Hubert: Self- M. Edgeconnect: Generative image inpainting with ad-
supervised speech representation learning by masked versarial edge learning. arXiv preprint arXiv:1901.00212,
prediction of hidden units. IEEE/ACM Transactions on 2019.
Audio, Speech, and Language Processing, 29:3451–3460,
2021. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
P., McGrew, B., Sutskever, I., and Chen, M. Glide:
Huang, R., Ren, Y., Liu, J., Cui, C., and Zhao, Z. Gen- Towards photorealistic image generation and editing
erspeech: Towards style transfer for generalizable out- with text-guided diffusion models. arXiv preprint
of-domain text-to-speech synthesis. arXiv preprint arXiv:2112.10741, 2021.
arXiv:2205.07211, 2022.
Piczak, K. J. Esc: Dataset for environmental sound classi-
Iashin, V. and Rahtu, E. Taming visually guided sound fication. In Proceedings of the 23rd ACM international
generation. arXiv preprint arXiv:2110.08791, 2021. conference on Multimedia, pp. 1015–1018, 2015.
Kim, C. D., Kim, B., Lee, H., and Kim, G. Audiocaps: Gen- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
erating captions for audios in the wild. In Proceedings Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
of the 2019 Conference of the North American Chapter et al. Learning transferable visual models from natural
of the Association for Computational Linguistics: Hu- language supervision. In International Conference on
man Language Technologies, Volume 1 (Long and Short Machine Learning, pp. 8748–8763. PMLR, 2021.
Papers), pp. 119–132, 2019.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Kingma, D. P. and Dhariwal, P. Glow: Generative flow Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
with invertible 1x1 convolutions. Advances in Neural the limits of transfer learning with a unified text-to-text
Information Processing Systems, 31:10215–10224, 2018. transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Koepke, A. S., Oncescu, A.-M., Henriques, J., Akata, Z., Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
and Albanie, S. Audio retrieval with natural language ford, A., Chen, M., and Sutskever, I. Zero-shot text-
queries: A benchmark study. IEEE Transactions on Mul- to-image generation. In International Conference on
timedia, 2022. Machine Learning, pp. 8821–8831. PMLR, 2021.
Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative ad- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
versarial networks for efficient and high fidelity speech M. Hierarchical text-conditional image generation with
synthesis. arXiv preprint arXiv:2010.05646, 2020. clip latents. arXiv preprint arXiv:2204.06125, 2022.
Koutini, K., Schlüter, J., Eghbal-zadeh, H., and Widmer,
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
G. Efficient training of audio transformers with patchout.
Ommer, B. High-resolution image synthesis with latent
arXiv preprint arXiv:2110.05069, 2021.
diffusion models. In Proceedings of the IEEE/CVF Con-
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez, ference on Computer Vision and Pattern Recognition, pp.
A., Copet, J., Parikh, D., Taigman, Y., and Adi, Y. Audio- 10684–10695, 2022.
gen: Textually guided audio generation. arXiv preprint
Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-
arXiv:2209.15352, 2022.
tional networks for biomedical image segmentation. In In-
Liu, H., Jiang, B., Song, Y., Huang, W., and Yang, C. Re- ternational Conference on Medical image computing and
thinking image inpainting via a mutual encoder-decoder computer-assisted intervention, pp. 234–241. Springer,
with feature equalizations. In European Conference on 2015.
Computer Vision, pp. 725–741. Springer, 2020.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton,
Martı́n-Morató, I. and Mesaros, A. What is the ground E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S.,
truth? reliability of multi-annotator data for audio tagging. Lopes, R. G., et al. Photorealistic text-to-image diffusion
In 2021 29th European Signal Processing Conference models with deep language understanding. arXiv preprint
(EUSIPCO), pp. 76–80. IEEE, 2021. arXiv:2205.11487, 2022.
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Salamon, J., Jacoby, C., and Bello, J. P. A dataset and
Ermon, S. Sdedit: Guided image synthesis and editing taxonomy for urban sound research. In 22nd ACM Inter-
with stochastic differential equations. In International national Conference on Multimedia (ACM-MM’14), pp.
Conference on Learning Representations, 2021. 1041–1044, Orlando, FL, USA, Nov. 2014.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S.,
Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-
video: Text-to-video generation without text-video data.
arXiv preprint arXiv:2209.14792, 2022.
Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. In Proc. of ICLR, 2020.
Van Den Oord, A., Vinyals, O., et al. Neural discrete rep-
resentation learning. Advances in neural information
processing systems, 30, 2017.
Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F.,
Feichtenhofer, C., et al. Masked autoencoders that listen.
arXiv preprint arXiv:2207.06405, 2022.
Xu, X., Dinkel, H., Wu, M., and Yu, K. A crnn-gru based
reinforcement learning approach to audio captioning. In
DCASE, pp. 225–229, 2020.
Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y.,
and Yu, D. Diffsound: Discrete diffusion model for text-
to-sound generation. arXiv preprint arXiv:2207.09983,
2022.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and
Tagliasacchi, M. Soundstream: An end-to-end neural
audio codec. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 30:495–507, 2021.
Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J.,
Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus de-
rived from librispeech for text-to-speech. arXiv preprint
arXiv:1904.02882, 2019.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Appendices
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models
As shown in Table 5, we collect a large-scale audio-text dataset consisting of 1M audio samples with a total duration of
∼3k hours. It contains audio of human activities, natural sounds, and audio effects, consisting of several data sources from
publicly available websites. For audio with text descriptions, we download the parallel audio-text data. For audios without
natural language annotation (or with labels), we discard the corresponding class label (if any) and apply the pseudo prompt
enhancement to construct natural language descriptions aligned well with the audio.
As speech and music are the dominant classes in Audioset, we filter these samples to construct a more balanced dataset.
Overall we are left with 3k hours with 1M audio-text pairs for training data. For evaluating text-to-audio models (Yang
et al., 2022; Kreuk et al., 2022), the AudioCaption validation set is the standard benchmark, which contains 494 samples
with five human-annotated captions in each audio clip. In both training and inference, we pad short clips to 10-second long
and randomly crop a 624 × 80 mel-spectrogram from 10-second 16 kHz audio.
Table 5. Text-audio alignment CLAP score averaged across the single-label dataset.
B. Model Configurations
We list the model hyper-parameters of Make-An-Audio in Table 6.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Hyperparameter Make-An-Audio
Input/Output Channels 1
Hidden Channels 4
Spectrogram Autoencoders Residual Blocks 2
Spectrogram Size 80 × 624
Channel Mult [1, 2, 2, 4]
Input/Output Channels 4
Model Channels 320
Attention Heads 8
Denoising Unet
Condition Channels 1024
Latent Size 10 × 78
Channel Mult [1, 2]
Transformer Embed Channels 768
CLAP Text Encoder Output Project Channels 1024
Token Length 77
Total Number of Parameters 332M
C. Evaluation
To probe audio quality, we conduct the MOS (mean opinion score) tests and explicitly instruct the raters to “focus on
examining the audio quality and naturalness.”. The testers present and rate the samples, and each tester is asked to evaluate
the subjective naturalness on a 20-100 Likert scale.
To probe text-audio alignment, human raters are shown an audio and a prompt and asked “Does the natural language
description align with audio faithfully?”. They must respond with “completely”, “mostly”, or “somewhat” on a 20-100
Likert scale.
Our subjective evaluation tests are crowd-sourced and conducted via Amazon Mechanical Turk. These ratings are obtained
independently for model samples and reference audio, and both are reported. The screenshots of instructions for testers have
been shown in Figure 8. We paid $8 to participants hourly and totally spent about $750 on participant compensation. A
small subset of speech samples used in the test is available at https://Text-to-Audio.github.io/.
Unlike the diffusion process, the reverse process is to recover samples from Gaussian noises. The reverse process is a
Markov chain from xT to x0 parameterized by shared θ:
T
Y
pθ (x0 , · · · , xT −1 |xT ) = pθ (xt−1 |xt ), (6)
t=1
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
where each iteration eliminates the Gaussian noise added in the diffusion process:
E. Implementation Details
E.1. Spectrogram Autoencoders
We also investigate the effectiveness of several audio autoencoder variants in Table 7, and find that deeper representation
(i.e., 32 or 128) relatively brings more compression, while the information deterioration could burden the Unet model in
generative modeling.
Table 7. Audio quality comparisons for ablation study with Make-An-Audio BERT. We use PPE to denote pseudo prompt enhancement.
E.2. Text-to-audio
We first encode the text into a sequence of K tokens, and utilize the cross-attention mechanism to learn a language and
mel-spectrograms representation mapping in a powerful model. After the initial training run, we fine-tuned our base model
to support unconditional generation, with 20% of text token sequences being replaced with the empty sequence. This
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
way, the model retains its ability to generate text-conditional outputs, but can also generate spectrogram representation
unconditionally.
We consider the pre-trained automatic audio captioning (Xu et al., 2020) and audio-text retrieval (Deshmukh et al., 2022;
Koepke et al., 2022) systems as our experts for prompt generation. Regarding automatic audio captioning, the model consists
of a 10-layer convolution neural network (CNN) encoder and a temporal attentional single-layer gated recurrent unit (GRU)
decoder. The CNN encoder is pre-trained on a large-scale Audioset dataset. As for audio-text retrieval, the model leverages
BERT with a multi-modal transformer encoder for representation learning. It is trained on AudioCaps and Clotho datasets.
E.3. Visual-to-audio
For visual-to-audio (image/video) synthesis, we utilize the CLIP-guided T2A model and leverage global textual represen-
tations to bridge the modality gap between the visual and audio worlds. However, we empirically find that global CLIP
conditions have a limited ability to control faithful synthesis with high text-audio similarity. On that account, we use the
110h FSD50K audios annotated with a class label for training, and this simplification avoids multimodal prediction (a
conditional vector may refer to different concepts) with complex distribution.
We conduct ablation studies to compare various training settings, including datasets and global conditions. The results have
been presented in Table 8, and we have the following observations: 1) Replacing the FSD50K dataset with AudioCaps (Kim
et al., 2019) have witnessed a significant decrease in faithfulness. The dynamic concepts compositions confuse the
global-condition models, and the multimodal distribution hinders its capacity for controllable synthesis; 2) Removing the
normalization in the condition vector has witnessed the realism degradation measured by FID, demonstrating its efficiency
in reducing variance in latent space.
• before v q a n of &, X
• X before v q a n of &,
• in front of v q a n of &, X
• after X, v q a n of &
• after v q a n of &, X
• behind v q a n of &, X
• v q a n of &, then X
• v q a n of &, following X
• v q a n of &, later X
• X after v q a n of &
• before X, v q a n of &
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Specifically, we replace X and &, respectively, with the natural language of sampled data and the class label of sampled
events from the database.
For verb (denoted as v), we have {‘hearing’, ‘noticing’, ‘listening to’, ‘appearing’}; for adjective (denoted as a), we
have {‘clear’, ‘noisy’, ‘close-up’, ‘weird’, ‘clean’}; for noun (denoted as n), we have {‘audio’, ‘sound’, ‘voice’}; for
numeral/quantifier (denoted as q), we have {‘a’, ‘the’, ‘some’};
H. Limitations
Make-An-Audio adopts generative diffusion models for high-quality synthesis, and thus it inherently requires multiple
iterative refinements for better results. Besides, latent diffusion models require typically require more computational
resources, and degradation could be witnessed with decreased training data. One of our future directions is to develop
lightweight and fast diffusion models for accelerating sampling.