0% found this document useful (0 votes)

19 views

Audio Generation With Diffusion Models

Audio Generation with deep learning diffusion models

Uploaded by

shaukat.hussain1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Audio Generation With Diffusion Models

Audio Generation with deep learning diffusion models

Uploaded by

shaukat.hussain1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion

Models

Rongjie Huang * 1 Jiawei Huang * 1 Dongchao Yang * 2 Yi Ren 3 Luping liu 1 Mingze Li 1 Zhenhui Ye 1
Jinglin Liu 1 Xiang Yin 3 Zhou Zhao 1

Abstract Text-to-audio Rain falls softly in the distance

arXiv:2301.12661v1 [cs.SD] 30 Jan 2023

Large-scale multimodal generative modeling has Audio-to-audio (Inpainting) Image-to-audio

created milestones in text-to-image and text-to-
video generation. Its application to audio still
lags behind for two main reasons: the lack of
large-scale datasets with high-quality text-audio
pairs, and the complexity of modeling long con-
tinuous audio data. In this work, we propose
Make-An-Audio with a prompt-enhanced diffu- Video-to-audio
sion model that addresses these gaps by 1) in-
troducing pseudo prompt enhancement with a
distill-then-reprogram approach, it alleviates data
scarcity with orders of magnitude concept compo-
sitions by using language-free audios; 2) lever-
aging spectrogram autoencoder to predict the
self-supervised audio representation instead of Figure 1. No Modality Left Behind: Make-An-Audio general-
waveforms. Together with robust contrastive izes well to X-to-Audio with multiple user-defined inputs (text,
language-audio pretraining (CLAP) representa- audio, image and video), it empowers humans to create rich and
tions, Make-An-Audio achieves state-of-the-art diverse audio content, opening up to a various applications with
results in both objective and subjective bench- personalized transfer and fine-grained control.
mark evaluation. Moreover, we present its control-
lability and generalization for X-to-Audio with et al., 2022) models are now able to vividly depict the visual
“No Modality Left Behind”, for the first time scene described by a text prompt, and empower humans to
unlocking the ability to generate high-definition, create rich and diverse visual content with unprecedented
high-fidelity audios given a user-defined modality ease. However, replicating this success for audios is limited
input. Audio samples are available at https: for the lack of large-scale datasets with high-quality text-
//Text-to-Audio.github.io audio pairs, and the extreme complexity of modeling long
continuous signal data.
In this work, we propose Make-An-Audio, with a prompt-
enhanced diffusion model for text-to-audio (T2A) genera-
1. Introduction
tion. To alleviate the issue of data scarcity, we introduce a
Deep generative models (Goodfellow et al., 2020; Kingma pseudo prompt enhancement approach to construct natural
& Dhariwal, 2018; Ho et al., 2020) have recently exhibited languages that align well with audio, opening up the usage
high-quality samples in various data modalities. With large- of orders of magnitude unsupervised language-free data. To
scale training data and powerful models, kinds of text-to- tackle the challenge of modeling complex audio signals in
image (Saharia et al., 2022; Ramesh et al., 2021; Nichol T2A generation, we introduce a spectrogram autoencoder to
et al., 2021) and text-to-video (Singer et al., 2022; Hong predict the self-supervised representations instead of wave-
*
forms, which guarantees efficient compression and high-
Equal contribution 1 Zhejiang University 2 Peking University level semantic understanding. Together with the power of
3
Speech & Audio Team, ByteDance AI Lab. Correspondence to:
Zhou Zhao <[email protected]>. contrastive language-audio pretraining (CLAP) (Radford
et al., 2021; Elizalde et al., 2022) and high-fidelity diffusion
models (Ho et al., 2020; Song et al., 2020; Rombach et al.,
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

2022), it achieves a deep level of language understanding leveraging large-scale diffusion models. GLIDE (Nichol
with high-fidelity generation. et al., 2021) trains a T2I upsampling model for a cascaded
generation. Imagen (Saharia et al., 2022) presents T2I with
While conceptually simple and easy to train, Make-An-
an unprecedented degree of photorealism and a deep level of
Audio yields surprisingly strong results. Both subjective
language understanding. Stable diffusion (Rombach et al.,
and objective evaluations demonstrate that Make-An-Audio
2022) utilizes latent space diffusion instead of pixel space
achieves new state-of-the-art in text-to-audio with natural
to improve computational efficiency. A large body of work
and controllable synthesis. Make-An-Audio exhibits su-
also explores the usage of T2I models for video genera-
perior audio quality and text-audio alignment faithfulness
tion. CogVideo (Hong et al., 2022) is built on top of a
on the benchmark AudioCaption dataset and even general-
CogView2 (Ding et al., 2022) T2I model with a multi-frame-
izes well to the unsupervised Clotho dataset in a zero-shot
rate hierarchical training strategy. Make-A-Video (Singer
fashion.
et al., 2022) extends a diffusion-based T2I model to T2V
For the first time, we contextualize the need for audio gen- through a spatiotemporally factorized diffusion model.
eration with different input modalities. Besides natural
Moving beyond visual generation, our approach aims to
language, Make-An-Audio generalizes well to multiple user-
generate high-fidelity audio from arbitrary natural language,
defined input modalities (audio, image, and video), which
which has been relatively overlooked.
empowers humans to create rich and diverse audio content
and opens up a host of applications for personalized transfer
and fine-grained control. 2.2. Text-Guided Audio Synthesis

Key contributions of the paper include: While there is remarkable progress in text-guided visual gen-
eration, the progress of text-to-audio (T2A) generation lags
behind mainly due to two main reasons: the lack of large-
• We present Make-An-Audio – an effective method that
scale datasets with high-quality text-audio pairs, and the
leverages latent diffusion with a spectrogram autoen-
complexity of modeling long continuous waveforms data.
coder to model the long continuous waveforms.
DiffSound (Yang et al., 2022) is the first to explore text-
• We introduce a pseudo prompt enhancement with the to-audio generation with a discrete diffusion process that
distill-then-reprogram approach, it includes a large operates on audio codes obtained from a VQ-VAE, lever-
number of concept compositions by opening up the aging masked text generation with CLIP representations.
usage of language-free audios to alleviate data scarcity. AudioLM (Borsos et al., 2022) introduces the discretized ac-
tivations of a masked language model pre-trained on audio
• We investigate textual representation and emphasize and generates syntactically plausible speech or music.
the advantages of contrastive language-audio pretrain- Very recently, the concurrent work AudioGen (Kreuk et al.,
ing for a deep understanding of natural languages with 2022) propose to generate audio samples autoregressively
computational efficiency. conditioned on text inputs, while our proposed method dif-
• We evaluate Make-An-Audio and present state-of-the- ferentiates from it in the following: 1) we introduce pseudo
art quantitative results and thorough evaluation with prompt enhancement and leverage the power of contrastive
qualitative findings. language-audio pre-training and diffusion models for high-
fidelity generation. 2) We predict the continuous spectro-
• We generalize the powerful model to X-to-Audio gen- gram representations, significantly improving computational
eration, for the first time unlocking the ability to gen- efficiency and reducing training costs.
erate high-definition, high-fidelity audios given a user-
defined modality input. 2.3. Audio Representation Learning
Different from modeling fine-grain details of the signal, the
2. Related Works usage of high-level self-supervised learning (SSL) (Baevski
et al., 2020; Hsu et al., 2021; He et al., 2022) has been shown
2.1. Text-Guided Image/Video Synthesis
to effectively reduce the sampling space of generative algo-
With the rapid development of deep generative models, text- rithms. Inspired by vector quantization (VQ) techniques,
guided synthesis has been widely studied in images and SoundStream (Zeghidour et al., 2021) presents a hierar-
videos. The pioneering work of DALL-E (Ramesh et al., chical architecture for high-level representations that carry
2021) encodes images into discrete latent tokens using VQ- semantic information. Data2vec (Baevski et al., 2022) uses
VAE (Van Den Oord et al., 2017) and considers T2I genera- a fast convolutional decoder and explores the contextualized
tion as a sequence-to-sequence translation problem. More target representations in a self-supervised manner.
recently, impressive visual results have been achieved by
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

E zt U-Net θ zt−1 z0 G
Diffusion Denoising
Audio q(xt |xt−1 ) Audio
pθ (xt |xt−1 )
Encoder Decoder
x ⋯ x!
×N
“Rain falls softly Text Transformer
in the distance” Encoder Cross-attention Vocoder
Generated Audio

Figure 2. A high-level overview of Make-An-Audio. Note that some modules (printed with a lock) are frozen for training the T2A model.

Dynamic Reprogramming
Expert Distillation Data Sample Birds
Base
Footsteps
Audio Captioning
c̃ CLAPS “Rain falls softly in the
Language-Free “Rain falls softly distance before hearing
Audio Audio-Text Retrieval in the distance” sounds of birds and footsteps”
Candidates

Figure 3. The process of pseudo prompt enhancement. Our semi-parametric diffusion model consists of a fixed expert distillation and a
dynamic reprogramming stage. The database D contains audio examples with a sampling strategy ξ to create unseen object compositions.
We use CLAPS to denote the CLAP selection.

Recently, spectrograms (akin to 1-channel 2D images) au- As illustrated in Figure 2, Make-An-Audio consists of the
toencoder (Gong et al., 2022; He et al., 2022) with recon- following main components: 1) the pseudo prompt enhance-
struction objective as self-supervision have demonstrated ment to alleviate the issue of data scarcity, opening up the us-
the effectiveness of heterogeneous image-to-audio transfer, age of orders of magnitude language-free audios; 2) a spec-
advancing the field of speech and audio processing on a vari- trogram autoencoder for predicting self-supervised represen-
ety of downstream tasks. Among these approaches, Xu et al. tation instead of long continuous waveforms; 3) a diffusion
(2022) study the Masked Autoencoders (MAE) (He et al., model that maps natural language to latent representations
2022) to self-supervised representation learning from audio with the power of contrastive language-audio pretraining
spectrograms. Gong et al. (2022) adopt audio spectrogram (CLAP) and 4) a separately-trained neural vocoder to con-
transformer with joint discriminative and generative masked vert mel-spectrograms to raw waveforms. In the following
spectrogram modeling. Inspired by these, we inherit the sections, we describe these components in detail.
recent success of spectrogram SSL in the frequency domain,
which guarantees efficient compression and high-level se- 3.2. Pseudo Prompt Enhancement:
mantic understanding. Distill-then-Reprogram
To mitigate the data scarcity, we propose to construct
3. Make-An-Audio prompts aligned well with audios, enabling a better under-
In this section, we first overview the Make-An-Audio frame- standing of the text-audio dynamics from orders of magni-
work and illustrate pseudo prompt enhancement to better tude unsupervised data. As illustrated in Figure 3, it consists
align text and audio semantics, following which we intro- of two stages: an expert distillation approach to produce
duce textual and audio representations for multimodal learn- prompts aligned with audio, and a dynamic reprogramming
ing. Together with the power of diffusion models with procedure to construct a variety of concept compositions.
classifier-free guidance, Make-An-Audio explicits high-
fidelity synthesis with superior generalization. 3.2.1. E XPERT D ISTILLATION
We consider the pre-trained automatic audio captioning (Xu
3.1. Overview et al., 2020) and audio-text retrieval (Deshmukh et al., 2022;
Koepke et al., 2022) systems as our experts for prompt gen-
Deep generative models have achieved leading perfor-
eration. Captioning models aim to generate diverse natural
mances in text-guided visual synthesis. However, the current
language sentences to describe the content of audio clips.
development of text-to-audio (T2A) generation is hampered
Audio-text retrieval takes a natural language as a query to
by two major challenges: 1) Model training is faced with
retrieve relevant audio files in a database. To this end, ex-
data scarcity, as human-labeled audios are expensive to
perts jointly distill knowledge to construct a caption aligned
create, and few audio resources provide natural language
with audio, following which we select from these candidates
descriptions. 2) Modeling long continuous waveforms (e.g.,
that endow high CLAP (Elizalde et al., 2022) score as the
typically 16,000 data points for 1s 16 kHz waveforms) poses
final caption (we include a threshold to selectly consider
a challenge for all high-quality neural synthesizers.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

faithful results). This simple yet effective procedure largely a variety of downstream tasks. The audio signal is a se-
alleviates data scarcity issues and explicit generalization quence of mel-spectrogram sample x ∈ [0, 1]Ca ×T , where
to different audio domains, and we refer the reader to Sec- Ca , T respectively denote the mel channels and the number
tion 6.3.2 for a summary of our findings. Details have been of frames. Our spectrogram autoencoder is composed of
attached in Appendix E.2. 1) an encoder network E which takes samples x as input
and outputs latent representations z; 2) a decoder network
3.2.2. DYNAMIC R EPROGRAMMING G reconstructs the mel-spectrogram signals x0 from the
compressed representation z; and 3) a multi-window dis-
To prevent overfitting and enable a better understanding
criminator Dis learns to distinguish the generated samples
of concept compositions, we introduce a dynamic repro-
G(z) from real ones in different multi-receptive fields of
gramming technique that constructs a variety of concept
mel-spectrograms.
compositions. It proceeds in three steps as illustrated in
Figure 3, where we elaborate the process as follows: 1) We The whole system is trained end-to-end to minimize 1)
first prepare our sound event database D annotated with a Reconstruction loss Lre , which improves the training ef-
single label. 2) Each time N concepts are sampled from ficiency and the fidelity of the generated spectrograms; 2)
the database D, where N ∈ {0, 1, 2}. 3) The original text- GAN losses LGAN , where the discriminator and genera-
audio pair data has been randomly concatenated with the tor play an adversarial game; and 3) KL-penalty loss LKL ,
sampled events according to the template, constructing a which restricts spectrogram encoders to learn standard z and
new training example with varied concept compositions. It avoid arbitrarily high-variance latent spaces.
can be conducted online, significantly reducing the time con-
To this end, Make-An-Audio takes advantage of the spec-
sumed for data preparation. The reprogramming templates
trogram autoencoder to predict the self-supervised repre-
are attached in Appendix F.
sentations instead of waveforms. It largely alleviates the
challenges of modeling long continuous data and guarantees
3.3. Textual Representation high-level semantic understanding.
Text-guided synthesis models need powerful semantic text
encoders to capture the meaning of arbitrary natural lan- 3.5. Generative Latent Diffusion
guage inputs, which could be grouped into two major cate-
We implement our method over Latent Diffusion Models
gories: 1) Contrastive pretraining. Similar to CLIP (Radford
(LDMs) (Rombach et al., 2022), a recently introduced class
et al., 2021) pre-trained on image-text data, recent progress
of Denoising Diffusion Probabilistic Models (DDPMs) (Ho
on contrastive language-audio pretraining (CLAP) (Elizalde
et al., 2020) that operate in the latent space. It is conditioned
et al., 2022) brings audio and text descriptions into a joint
on textual representation, breaking the generation process
space and demonstrates the outperformed zero-shot gener-
into several conditional diffusion steps. The training loss is
alization to multiple downstream domains. 2) Large-scale
defined as the mean squared error in the noise ∼ N (0, I)
language modeling (LLM). Saharia et al. (2022) and Kreuk
space, and efficient training is optimizing a random term of
et al. (2022) utilize language models (e.g., BERT (Devlin
t with stochastic gradient descent:
et al., 2018), T5 (Raffel et al., 2020)) for text-guided gen-
2
eration. Language models are trained on text-only corpus Lθ = kθ (zt , t, c) − k2 , (1)
significantly larger than paired multimodal data, thus being
where α denotes the small positive constant, and θ denotes
exposed to a rich distribution of text.
the denoising network. To conclude, the diffusion model can
Following the common practice (Saharia et al., 2022; be efficiently trained by optimizing ELBO without adver-
Ramesh et al., 2022), we freeze the weights of these text sarial feedback, ensuring extremely faithful reconstructions
encoders. We find that both CLAP and T5-Large achieve that match the ground-truth distribution. Detailed formula-
similar results on benchmark evaluation, while CLAP could tion of DDPM has been attached in Appendix D.
be more efficient without offline computation of embed-
dings required by LLM. We refer the reader to Section 6.3.1 3.6. Classifier-Free Guidance
for a summary of our findings.
For classifier-free guidance shown in (Dhariwal & Nichol,
2021; Ho & Salimans, 2022), by jointly training a condi-
3.4. Audio Representation
tional and an unconditional diffusion model, it could be pos-
Recently, spectrograms (akin to 1-channel 2D images) au- sible to combine the conditional and unconditional scores
toencoder (Gong et al., 2022; He et al., 2022) with recon- to attain a trade-off between sample quality and diversity.
struction objective as self-supervision have demonstrated The textual condition in a latent diffusion model θ (zt , t, c)
the effectiveness of heterogeneous image-to-audio trans- is replaced by an empty prompt c∅ with a fixed probability
fer, advancing the field of speech and audio processing on during training. During sampling, the output of the model is
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Text Encoder Make-An-Audio

Image-to-Audio Transformer ⋯ Vocoder

Generated Audio

Pooling

… Pretrained Model
Not used Module
CLIP Image Encoder
Video-to-Audio … User-optional Flow

Frame #1 Frame #2 Frame #4

Figure 4. A high-level overview of visual-to-audio generation (I2A/V2A) pipeline using Make-An-Audio.

extrapolated further in the direction of θ (zt , t, c) and away A trade-off between faithfulness (text-caption alignment)
from θ (zt , t, c∅ ) with the guidance scale s ≥ 1: and realism (audio quality) could be witnessed: As T in-
˜θ (zt , t, c) = θ (zt , t, c∅ ) + s · (θ (zt , t, c) − θ (zt , t, c∅ )) (2) creases, a large amount of noise would be added to the
initial audio, and the generated samples become more real-
4. X-To-Audio: No Modality Left Behind istic while less faithful. We refer the reader to Figure 5 for
a summary of our findings.
In this section, we generalize our powerful conditional dif-
fusion model for X-To-Audio generation. For the first time, 4.2. Audio Inpainting
we contextualize the need for audio generation with different
conditional modalities, including: 1) text, 2) audio (inpaint- Inpainting (Liu et al., 2020; Nazeri et al., 2019) is the task
ing), and 3) visual. Make-An-Audio empowers humans to of filling masked regions of an audio with new content
create rich and diverse audio content with unprecedented since parts of the audio are corrupted or undesired. Though
ease, unlocking the ability to generate high-definition, high- diffusion model inpainting can be performed by adding
fidelity audio given a user-defined modality input. noise to initial audio and sampling with SDEdit, it may
result in undesired edge artifacts since there could be an
4.1. Personalized Text-To-Audio Generation information loss during the sampling process (the model
can only see a noised version of the context). To achieve
Adapting models (Chen et al., 2020b; Huang et al., 2022) better results, we explicitly fine-tune Make-An-Audio for
to a specific individual or object is a long-standing goal audio inpainting.
in machine learning research. More recently, personaliza-
During training, the way masks are generated greatly in-
tion (Gal et al., 2022; Benhamdi et al., 2017) efforts can be
fluences the final performance of the system. As such, we
found in vision and graphics, which allows to inject unique
adopt irregular masks (thick, medium, and thin masks) sug-
objects into new scenes, transform them across different
gested by LaMa (Suvorov et al., 2022), which uniformly
styles, and even produce new products. For instance, when
uses polygonal chains dilated by a high random width (wide
asked to generate “baby crying” given the initial sound of
masks) and rectangles of arbitrary aspect ratios (box masks).
“thunder”, our model produces realistic and faithful audio
In addition, we investigate the frame-based masking strategy
describing “a baby cries in the thunder day”. Distinctly,
commonly adopted in speech liteature (Baevski et al., 2020;
it has a wide range of uses for audio mixing and tuning,
Hsu et al., 2021). It is implemented using the algorithm
e.g., adding background sound for an existing clip or editing
from wav2vec 2.0 (Baevski et al., 2020), where spans of
audio by inserting a speaking object.
length are masked with a p probability.
We investigate the personalized text-to-audio generation by
stochastic differential editing (Meng et al., 2021), which has 4.3. Visual-To-Audio Generation
been demonstrated to produce realistic samples with high-
fidelity manipulation. Given input audio with a user guide Recent advances in deep generative models have shown im-
(prompt), we select a particular time t0 with total denoising pressive results in the visually-induced audio generation (Su
steps N , and add noise to the raw data z0 for zT (T = et al., 2020; Gan et al., 2020), towards generating realistic
t0 × N ) according to Equation 4. It is then subsequently audio that describes the content of images or videos: Hsu
denoised through a reverse process parameterized by shared et al. (2020) show that spoken language could be learned
θ to increase its realism according to Equation 6. by a visually-grounded generative model of speech. Iashin
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

& Rahtu (2021) propose a multi-class visual guided sound 5.2. Model Configurations
synthesis that relies on a codebook prior-based transformer.
We train a continuous autoencoder to compress the percep-
To pursue this research further, we extend Make-An-Audio tual space with downsampling to a 4-channel latent represen-
for visual-to-audio generation. For the lack of large-scale tation, which balances efficiency and perceptually faithful
visual-audio datasets in image-to-audio (I2A) research, our results. For our main experiments, we train a U-Net (Ron-
main idea is to utilize contrastive language-image pretrain- neberger et al., 2015) based text-conditional diffusion model,
ing (CLIP) with CLIP-guided T2A model and leverage tex- which is optimized using 18 NVIDIA V100 GPU until 2M
tual representations to bridge the modality gap between optimization steps. The base learning rate is set to 0.005,
visual and audio world. As CLIP encoders embed images and we scale it by the number of GPUs and the batch size
and text to the joint latent space, our T2A model provides following LDM. We utilize HiFi-GAN (Kong et al., 2020)
a unique opportunity to visualize what the CLIP image (V1) trained on VGGSound dataset (Chen et al., 2020a)
encoder is seeing. Considering the complexity of V2A gen- as the vocoder to synthesize waveform from the generated
eration, it is natural to leverage image priors for videos to mel-spectrogram in all our experiments. Hyperparameters
simplify the learning process. On this account, we uniformly are included in Appendix B.
pick up 4 frames from the video and pool these CLIP image
features to formulate the “averaged” video representation, 5.3. Evaluation Metrics
which is then deteriorated to I2A generation.
We evaluate models using objective and subjective metrics
To conclude, the visual-to-audio inference scheme can be over audio quality and text-audio alignment faithfulness.
formulated in Figure 4. It significantly reduces the require- Following common practice (Yang et al., 2022; Iashin &
ment for pair visual datasets, and the plug-and-play module Rahtu, 2021), the key automated performance metrics used
with pre-trained Make-An-Audio empowers humans to cre- are melception-based (Koutini et al., 2021) FID (Heusel
ate rich and diverse audio content from the visual world. et al., 2017) and KL divergence to measure audio fidelity.
Additionally, we introduce the CLAP score to measure
5. Training and Evaluation audio-text alignment for this work. CLAP score is adapted
from the CLIP score (Hessel et al., 2021; Radford et al.,
5.1. Dataset 2021) to the audio domain and is a reference-free evaluation
We train on a combination of several datasets: AudioSet, metric that closely correlates with human perception.
BBC sound effects, Audiostock, AudioCaps-train, ESC-50, For subjective metrics, we use crowd-sourced human evalu-
FSD50K, Free To Use Sounds, Sonniss Game Effects, We- ation via Amazon Mechanical Turk, where raters are asked
SoundEffects, MACS, Epidemic Sound, UrbanSound8K, to rate MOS (mean opinion score) on a 20-100 Likert scale.
WavText5Ks, LibriSpeech, and Medley-solos-DB. For au- We assess the audio quality and text-audio alignment faith-
dios without natural language annotation, we apply the fulness by respectively scoring MOS-Q and MOS-F, which
pseudo prompt enhancement to construct captions aligned is reported with 95% confidence intervals (CI). More infor-
well with the audio. Overall we have ∼3k hours with mation on evaluation has been attached in Appendix C.
1M audio-text pairs for training data. For evaluating text-
to-audio models (Yang et al., 2022; Kreuk et al., 2022),
the AudioCaption validation set is adopted as the stan-
6. Results
dard benchmark, which contains 494 samples with five 6.1. Quantitative Results
human-annotated captions in each audio clip. For a more
challenging zero-shot scenario, we also provide results in Automatic Objective Evaluation The objective evaluation
Clotho (Drossos et al., 2020) validation set which contain comparison with baseline Diffsound (the only publicly-
multiple audio events. A more detailed data setup has been available T2A generation model) are presented in Table 1,
attached in Appendix A. and we have the following observations: 1) In terms of au-
dio qualty, Make-An-Audio achieves the highest perceptual
We conduct preprocessing on the text and audio data: 1) quality in AudioCaption with FID of 4.61 and KL of 2.79.
convert the sampling rate of audios to 16kHz and pad short For zero-shot generation, it also demonstrates the outper-
clips to 10-second long; 2) extract the spectrogram with formed results superior to the baseline model; 2) On text-
the FFT size of 1024, hop size of 256 and crop it to a audio similarity, Make-An-Audio scores the highest CLAP
mel-spectrogram of size 80 × 624; 3) non-standard words with a gap of 0.037 compared to the ground truth audio, sug-
(e.g., abbreviations, numbers, and currency expressions) and gesting Make-An-Audio’s ability to generate faithful audio
semiotic classes (Taylor, 2009) (text tokens that represent that aligns well with descriptions.
particular entities that are semantically constrained, such as
measure phrases, addresses, and dates) are normalized. Subjective Human Evaluation The evaluation of the T2A
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Model Text-cond Params FID KL CLAP MOS-Q MOS-F FID-Z KL-Z

Reference / / / / 0.526 74.7±0.94 80.5±1.84 / /
Diffsound CLIP 520M 7.17 3.57 0.420 67.1±1.03 70.9±1.05 24.97 6.53
CLAP 332M 4.61 2.79 0.482 72.5±0.90 78.6±1.01 17.38 6.98
Make-An-Audio BERT 809M 5.15 2.89 0.480 70.5±0.87 77.2±0.98 18.75 7.01
T5-Large 563M 4.83 2.81 0.486 71.8±0.91 77.2±0.93 17.23 7.02
CLIP 576M 6.45 2.91 0.444 72.1±0.92 75.4±0.96 17.55 7.09
Table 1. Text-to-audio evaluation. We report the evaluation metrics including MOS(↑), FID(↓), KL(↓), and CLAP(↑). FID-Z and KL-Z
denote the zero-shot results in the Clotho dataset.

Narrow Masks Wide Masks Method MOS-Q MOS-F

Training Masks
FID KL MOS-Q FID KL MOS-Q Image-to-Audio Generation
Irregular (Thin) 1.83 0.46 68.3±1.38 4.01 0.86 66.2±1.20 Reference 72.0±1.54 76.4±1.83
Irregular (Medium) 1.76 0.31 67.8±1.41 3.93 0.65 66.9±1.22 Make-An-Audio 68.4±1.09 78.0±1.20
Irregular (Thick) 1.73 0.32 69.6±1.36 3.83 0.67 69.3±1.05
Video-to-Audio Generation
Frame (p=30%) 1.64 0.29 66.9±1.60 3.68 0.62 66.1±1.29
Frame (p=50%) 1.77 0.32 68.6±1.42 3.66 0.63 67.4±1.27 Reference 69.5±1.22 81.0±1.43
Frame (p=70%) 1.59 0.32 71.0±1.12 3.49 0.65 70.8±1.50 Make-An-Audio 60.0±1.31 69.0±1.08

Table 2. Audio inpainting evaluation with variety masking strategies. Table 3. Image/Video-to-audio evaluation.

models is very challenging due to its subjective nature in trade-off curves between CLAP and FID scores in Figure 7.
perceptual quality, and thus we include a human evaluation Consistent with the observations in Ho & Salimans (2022),
in Table 1: Make-An-Audio (CLAP) achieves the highest the choice of the classifier guidance weight could scale
perceptual quality with MOS-Q of 72.5 and MOS-F of 78.6. conditional and unconditional synthesis, offering a trade-off
It indicates that raters prefer our model synthesis against between sample faithfulness and realism with respect to the
baselines in terms of audio naturalness and faithfulness. conditioning text.
For audio-inpainting, we compare different masking designs, For better comparison in audio inpainting, we visualize
including the irregular (thick, medium, and thin) strategy different masking strategies and synthesis results in Fig-
from visual world (Suvorov et al., 2022), as well as the ure 6. As can be seen, given the initial audio with undesired
frame-based (with varying p) strategy commonly used in content, our model correctly fills and reconstruct the audio
speech (Baevski et al., 2020; Hsu et al., 2021). During robust to different shapes of masked regions, suggesting that
evaluation, we randomly mask the wide or narrow regions it is capable of a high-level understanding of audio content.
and utilize FID and KL metrics to measure performance.
On the personalized text-to-audio generation, we explore
The results have been presented in Table 2, and we have the
different t0 ∈ (0, 1) to add Gaussian noise and conduct
following observations: 1) In both frame-based or irregular
reverse sampling. As shown in Figure 5, a trade-off between
strategies, larger masked regions in training have witnessed
faithfulness (measured by CLAP score) and realism (mea-
the improved perceptual quality, which force the network to
sured by 1-MSE distance) could be witnessed. We find that
exploit the high receptive field of continuous spectrograms
t0 ∈ [0.2, 0.5] works well for faithful guidance with realistic
fully. 2) With the similar size of the masked region, the
generation, suggesting that audio variants (e.g., speed, tim-
frame-based strategy consistently outperforms the irregular
bre, and energy) could be easily destroyed as t0 increases.
one, suggesting that it could be better to mask the audio
spectrograms which align in time series.
6.3. Analysis and Ablation Studies
We also present our visual-to-audio generation results in
Table 3. As can be seen, Make-An-Audio can generalize to To verify the effectiveness of several designs in Make-An-
a wide variety of images and videos. Leveraging contrastive Audio, including pseudo prompt enhancement, textual and
pre-training, the model provides a high-level understanding audio representation, we conduct ablation studies and dis-
of visual input, which generates high-fidelity audio spectro- cuss the key findings as follows. More analysis on audio
grams well-aligned with their semantic meanings. representation has been attached in Appendix E.1.

6.3.1. T EXTUAL R EPRESENTATION

6.2. Qualitative Findings
We explore several pretrained text encoders, including lan-
Firstly, we explore the classifier-free guidance in text-to-
guage models BERT (Devlin et al., 2018), T5-Large (Raf-
audio synthesis. We sweep over guidance values and present
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Better align with initial audio (More realistic) Better align with prompt (More faithful)

t0 = 0.0 t0 = 0.2 t0 = 0.4 t0 = 0.6

Figure 5. We illustrate personalized text-to-audio results with various t0 initializations. t0 = 0 indicates the initial audio itself, whereas
t0 = 1 indicates a text-to-audio synthesis from scratch. For comparison, realism is measured by the 1-MSE distance between generated
and initial audio, and faithfulness is measured by the CLAP score between the generated sample. Prompt: A clock ticktocks.

Frame of embeddings in large-scale language models.

Input 6.3.2. P SEUDO P ROMPT E NHANCEMENT

Irregular (Thin)
Our prompt enhancement approach alleviates the issue of
data scarcity, which consists of two stages with a distill-then-
Result Irregular (Medium) reprogram approach. As shown in Table 5 in Appendix A,
we calculate and compare the prompt-audio faithfulness av-
eraged across datasets: The joint expert distillation produces
Irregular (Thick) high-quality captions aligned well with audio, and suggests
GT
strong generalization to diverse audio domains.

(a) Sample 1 (b) Sample 2 (c) Masking Strategy To highlight the effectiveness of the proposed dynamic re-
programming strategy to create unseen object compositions,
Figure 6. Qualitative results with our inpainting model. we additionally train our Make-An-Audio in the static train-
ing dataset, and attach the results in Table 7 in Appendix E:
1) Removing the dynamic reprogramming approach results
in a slight drop in evaluation; 2) When migrating to a more
challenging scenario to Clotho in a zero-shot fashion, a sig-
nificant degradation could be witnessed, demonstrating its
effectiveness in constructing diverse object compositions
for better generalization.

7. Conclusion
In this work, we presented Make-An-Audio with a prompt-
enhanced diffusion model for text-to-audio generation.
Leveraging the prompt enhancement with the distill-then-
Figure 7. Classifier-free guidance trade-off curves.
reprogram approach, Make-An-Audio was endowed with
various concept compositions with orders of magnitude un-
fel et al., 2020), as well as the multimodal contrastive supervised data. We investigated textual representation and
pre-trained encoder CLIP (Radford et al., 2021) and emphasized the advantages of contrastive pre-training for
CLAP (Elizalde et al., 2022). We freeze the weights of a deep understanding of natural languages with computa-
text encoders for T2A generation. For easy comparison, we tional efficiency. Both objective and subjective evaluation
present the results in Table 1 and have the following obser- demonstrated that Make-An-Audio achieved new state-of-
vations: 1) Since CLIP is introduced as a scalable approach the-art results in text-to-audio with realistic and faithful
for learning joint representations between text and images, synthesis. Make-An-Audio was the first attempt to gener-
it could be less useful in deriving semantic representation ate high-definition, high-fidelity audio given a user-defined
for T2A in contrast to Yang et al. (2022). 2) CLAP and T5- modality input, opening up a host of applications for person-
Large achieve similar performances on benchmarks dataset, alized transfer and fine-grained control. We envisage that
while CLAP could be more computationally efficient (with our work serve as a basis for future audio synthesis studies.
only %59 params), without the need for offline computation
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

References Elizalde, B., Deshmukh, S., Ismail, M. A., and Wang, H.

Clap: Learning audio concepts from natural language
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. wav2vec
supervision. arXiv preprint arXiv:2206.04769, 2022.
2.0: A framework for self-supervised learning of speech
representations. Advances in Neural Information Process- Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano,
ing Systems, 33, 2020. A. H., Chechik, G., and Cohen-Or, D. An image is worth
one word: Personalizing text-to-image generation using
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, textual inversion. arXiv preprint arXiv:2208.01618, 2022.
M. Data2vec: A general framework for self-supervised
learning in speech, vision and language. arXiv preprint Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., and Tor-
arXiv:2202.03555, 2022. ralba, A. Foley music: Learning to generate music from
videos. In European Conference on Computer Vision, pp.
Benhamdi, S., Babouri, A., and Chiky, R. Personalized rec- 758–775. Springer, 2020.
ommender system for e-learning environment. Education
and Information Technologies, 22(4):1455–1477, 2017. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A.,
Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M.
Bittner, R. M., Salamon, J., Tierney, M., Mauch, M., Can- Audio set: An ontology and human-labeled dataset for
nam, C., and Bello, J. P. Medleydb: A multitrack dataset audio events. In 2017 IEEE international conference on
for annotation-intensive mir research. In ISMIR, vol- acoustics, speech and signal processing (ICASSP), pp.
ume 14, pp. 155–160, 2014. 776–780. IEEE, 2017.
Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Gong, Y., Lai, C.-I., Chung, Y.-A., and Glass, J. Ssast: Self-
Pietquin, O., Sharifi, M., Teboul, O., Grangier, D., supervised audio spectrogram transformer. In Proceed-
Tagliasacchi, M., and Zeghidour, N. Audiolm: a lan- ings of the AAAI Conference on Artificial Intelligence,
guage modeling approach to audio generation. arXiv volume 36, pp. 10699–10709, 2022.
preprint arXiv:2209.03143, 2022.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Chen, H., Xie, W., Vedaldi, A., and Zisserman, A. Vg- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
gsound: A large-scale audio-visual dataset. In ICASSP Generative adversarial networks. Communications of the
2020-2020 IEEE International Conference on Acoustics, ACM, 63(11):139–144, 2020.
Speech and Signal Processing (ICASSP), pp. 721–725.
IEEE, 2020a. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick,
R. Masked autoencoders are scalable vision learners. In
Chen, M., Tan, X., Li, B., Liu, Y., Qin, T., Liu, T.-Y., et al. Proceedings of the IEEE/CVF Conference on Computer
Adaspeech: Adaptive text to speech for custom voice. In Vision and Pattern Recognition, pp. 16000–16009, 2022.
International Conference on Learning Representations,
Hessel, J., Holtzman, A., Forbes, M., Bras, R. L., and Choi,
2020b.
Y. Clipscore: A reference-free evaluation metric for im-
Deshmukh, S., Elizalde, B., and Wang, H. Audio re- age captioning. arXiv preprint arXiv:2104.08718, 2021.
trieval with wavtext5k and clap training. arXiv preprint Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
arXiv:2209.14275, 2022. Hochreiter, S. Gans trained by a two time-scale update
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: rule converge to a local nash equilibrium. Advances in
Pre-training of deep bidirectional transformers for lan- neural information processing systems, 30, 2017.
guage understanding. arXiv preprint arXiv:1810.04805, Ho, J. and Salimans, T. Classifier-free diffusion guidance.
2018. arXiv preprint arXiv:2207.12598, 2022.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
image synthesis. In Proc. of NeurIPS, volume 34, 2021. bilistic models. In Proc. of NeurIPS, 2020.
Ding, M., Zheng, W., Hong, W., and Tang, J. Cogview2: Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J.
Faster and better text-to-image generation via hierarchical Cogvideo: Large-scale pretraining for text-to-video gener-
transformers. arXiv preprint arXiv:2204.14217, 2022. ation via transformers. arXiv preprint arXiv:2205.15868,
2022.
Drossos, K., Lipping, S., and Virtanen, T. Clotho: An audio
captioning dataset. In ICASSP 2020-2020 IEEE Inter- Hsu, W.-N., Harwath, D., Song, C., and Glass, J. Text-free
national Conference on Acoustics, Speech and Signal image-to-speech synthesis using learned segmental units.
Processing (ICASSP), pp. 736–740. IEEE, 2020. arXiv preprint arXiv:2012.15454, 2020.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Nazeri, K., Ng, E., Joseph, T., Qureshi, F. Z., and Ebrahimi,
Salakhutdinov, R., and Mohamed, A. Hubert: Self- M. Edgeconnect: Generative image inpainting with ad-
supervised speech representation learning by masked versarial edge learning. arXiv preprint arXiv:1901.00212,
prediction of hidden units. IEEE/ACM Transactions on 2019.
Audio, Speech, and Language Processing, 29:3451–3460,
2021. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
P., McGrew, B., Sutskever, I., and Chen, M. Glide:
Huang, R., Ren, Y., Liu, J., Cui, C., and Zhao, Z. Gen- Towards photorealistic image generation and editing
erspeech: Towards style transfer for generalizable out- with text-guided diffusion models. arXiv preprint
of-domain text-to-speech synthesis. arXiv preprint arXiv:2112.10741, 2021.
arXiv:2205.07211, 2022.
Piczak, K. J. Esc: Dataset for environmental sound classi-
Iashin, V. and Rahtu, E. Taming visually guided sound fication. In Proceedings of the 23rd ACM international
generation. arXiv preprint arXiv:2110.08791, 2021. conference on Multimedia, pp. 1015–1018, 2015.
Kim, C. D., Kim, B., Lee, H., and Kim, G. Audiocaps: Gen- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
erating captions for audios in the wild. In Proceedings Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
of the 2019 Conference of the North American Chapter et al. Learning transferable visual models from natural
of the Association for Computational Linguistics: Hu- language supervision. In International Conference on
man Language Technologies, Volume 1 (Long and Short Machine Learning, pp. 8748–8763. PMLR, 2021.
Papers), pp. 119–132, 2019.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Kingma, D. P. and Dhariwal, P. Glow: Generative flow Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
with invertible 1x1 convolutions. Advances in Neural the limits of transfer learning with a unified text-to-text
Information Processing Systems, 31:10215–10224, 2018. transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Koepke, A. S., Oncescu, A.-M., Henriques, J., Akata, Z., Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
and Albanie, S. Audio retrieval with natural language ford, A., Chen, M., and Sutskever, I. Zero-shot text-
queries: A benchmark study. IEEE Transactions on Mul- to-image generation. In International Conference on
timedia, 2022. Machine Learning, pp. 8821–8831. PMLR, 2021.
Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative ad- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
versarial networks for efficient and high fidelity speech M. Hierarchical text-conditional image generation with
synthesis. arXiv preprint arXiv:2010.05646, 2020. clip latents. arXiv preprint arXiv:2204.06125, 2022.
Koutini, K., Schlüter, J., Eghbal-zadeh, H., and Widmer,
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
G. Efficient training of audio transformers with patchout.
Ommer, B. High-resolution image synthesis with latent
arXiv preprint arXiv:2110.05069, 2021.
diffusion models. In Proceedings of the IEEE/CVF Con-
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez, ference on Computer Vision and Pattern Recognition, pp.
A., Copet, J., Parikh, D., Taigman, Y., and Adi, Y. Audio- 10684–10695, 2022.
gen: Textually guided audio generation. arXiv preprint
Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-
arXiv:2209.15352, 2022.
tional networks for biomedical image segmentation. In In-
Liu, H., Jiang, B., Song, Y., Huang, W., and Yang, C. Re- ternational Conference on Medical image computing and
thinking image inpainting via a mutual encoder-decoder computer-assisted intervention, pp. 234–241. Springer,
with feature equalizations. In European Conference on 2015.
Computer Vision, pp. 725–741. Springer, 2020.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton,
Martı́n-Morató, I. and Mesaros, A. What is the ground E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S.,
truth? reliability of multi-annotator data for audio tagging. Lopes, R. G., et al. Photorealistic text-to-image diffusion
In 2021 29th European Signal Processing Conference models with deep language understanding. arXiv preprint
(EUSIPCO), pp. 76–80. IEEE, 2021. arXiv:2205.11487, 2022.

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Salamon, J., Jacoby, C., and Bello, J. P. A dataset and
Ermon, S. Sdedit: Guided image synthesis and editing taxonomy for urban sound research. In 22nd ACM Inter-
with stochastic differential equations. In International national Conference on Multimedia (ACM-MM’14), pp.
Conference on Learning Representations, 2021. 1041–1044, Orlando, FL, USA, Nov. 2014.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S.,
Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-
video: Text-to-video generation without text-video data.
arXiv preprint arXiv:2209.14792, 2022.
Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. In Proc. of ICLR, 2020.

Su, K., Liu, X., and Shlizerman, E. Audeo: Audio genera-

tion for a silent performance video. Advances in Neural
Information Processing Systems, 33:3325–3337, 2020.
Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A.,
Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park,
K., and Lempitsky, V. Resolution-robust large mask in-
painting with fourier convolutions. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Com-
puter Vision, pp. 2149–2159, 2022.
Taylor, P. Text-to-speech synthesis. Cambridge university
press, 2009.

Van Den Oord, A., Vinyals, O., et al. Neural discrete rep-
resentation learning. Advances in neural information
processing systems, 30, 2017.
Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F.,
Feichtenhofer, C., et al. Masked autoencoders that listen.
arXiv preprint arXiv:2207.06405, 2022.
Xu, X., Dinkel, H., Wu, M., and Yu, K. A crnn-gru based
reinforcement learning approach to audio captioning. In
DCASE, pp. 225–229, 2020.

Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y.,
and Yu, D. Diffsound: Discrete diffusion model for text-
to-sound generation. arXiv preprint arXiv:2207.09983,
2022.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and
Tagliasacchi, M. Soundstream: An end-to-end neural
audio codec. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 30:495–507, 2021.
Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J.,
Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus de-
rived from librispeech for text-to-speech. arXiv preprint
arXiv:1904.02882, 2019.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Appendices
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models

A. Detailed Experimental Setup

Dataset Hours Type Source

Clotho 152 Caption Drossos et al. (2020)
AudioCaps 109 Caption Kim et al. (2019)
MACS 100 Caption Martı́n-Morató & Mesaros (2021)
WavText5Ks 25 Caption Deshmukh et al. (2022)
BBC sound effects 481 Caption https://sound-effects.bbcrewind.co.uk/
Audiostock 43 Caption https://audiostock.net/se
Filter AudioSet 2084 Label Gemmeke et al. (2017)
ESC-50 3 Label Piczak (2015)
FSD50K 108 Label https://annotator.freesound.org/fsd/
Sonniss Game Effects 20 Label https://sonniss.com/gameaudiogdc/
WeSoundEffects 11 Label https://wesoundeffects.com/
Epidemic Sound 220 Label https://www.epidemicsound.com/
UrbanSound8K 8 Label Salamon et al. (2014)
LibriTTS 300 Language-free Zen et al. (2019)
Medley-solos-DB 7 Language-free Bittner et al. (2014)

Table 4. Statistics for the combination of several datasets.

As shown in Table 5, we collect a large-scale audio-text dataset consisting of 1M audio samples with a total duration of
∼3k hours. It contains audio of human activities, natural sounds, and audio effects, consisting of several data sources from
publicly available websites. For audio with text descriptions, we download the parallel audio-text data. For audios without
natural language annotation (or with labels), we discard the corresponding class label (if any) and apply the pseudo prompt
enhancement to construct natural language descriptions aligned well with the audio.
As speech and music are the dominant classes in Audioset, we filter these samples to construct a more balanced dataset.
Overall we are left with 3k hours with 1M audio-text pairs for training data. For evaluating text-to-audio models (Yang
et al., 2022; Kreuk et al., 2022), the AudioCaption validation set is the standard benchmark, which contains 494 samples
with five human-annotated captions in each audio clip. In both training and inference, we pad short clips to 10-second long
and randomly crop a 624 × 80 mel-spectrogram from 10-second 16 kHz audio.

Method FSD50K ESC-50 Urbansound8k

Original 0.40 0.43 0.33
Captioning 0.35 0.46 0.37
Retrieval 0.31 0.44 0.38
Both + CLAP Select 0.54 0.62 0.55

Table 5. Text-audio alignment CLAP score averaged across the single-label dataset.

B. Model Configurations
We list the model hyper-parameters of Make-An-Audio in Table 6.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Hyperparameter Make-An-Audio
Input/Output Channels 1
Hidden Channels 4
Spectrogram Autoencoders Residual Blocks 2
Spectrogram Size 80 × 624
Channel Mult [1, 2, 2, 4]
Input/Output Channels 4
Model Channels 320
Attention Heads 8
Denoising Unet
Condition Channels 1024
Latent Size 10 × 78
Channel Mult [1, 2]
Transformer Embed Channels 768
CLAP Text Encoder Output Project Channels 1024
Token Length 77
Total Number of Parameters 332M

Table 6. Hyperparameters of Make-An-Audio models.

C. Evaluation
To probe audio quality, we conduct the MOS (mean opinion score) tests and explicitly instruct the raters to “focus on
examining the audio quality and naturalness.”. The testers present and rate the samples, and each tester is asked to evaluate
the subjective naturalness on a 20-100 Likert scale.
To probe text-audio alignment, human raters are shown an audio and a prompt and asked “Does the natural language
description align with audio faithfully?”. They must respond with “completely”, “mostly”, or “somewhat” on a 20-100
Likert scale.
Our subjective evaluation tests are crowd-sourced and conducted via Amazon Mechanical Turk. These ratings are obtained
independently for model samples and reference audio, and both are reported. The screenshots of instructions for testers have
been shown in Figure 8. We paid $8 to participants hourly and totally spent about $750 on participant compensation. A
small subset of speech samples used in the test is available at https://Text-to-Audio.github.io/.

D. Detailed Formulation of DDPM

We define the data distribution as q(x0 ). The diffusion process is defined by a fixed Markov chain from data x0 to the latent
variable xT :
T
Y
q(x1 , · · · , xT |x0 ) = q(xt |xt−1 ), (3)
t=1
For a small positive constant βt , a small Gaussian noise is added from xt−1 to the distribution of xt under the function of
q(xt |xt−1 ).
The whole process gradually converts data x0 to whitened latents xT according to the fixed noise schedule β1 , · · · , βT ,
where ∼ N (0, I): p
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) (4)
Efficient training is optimizing a random term of t with stochastic gradient descent:
q 2
Lθ = θ αt x0 + 1 − αt − 2 (5)
2

Unlike the diffusion process, the reverse process is to recover samples from Gaussian noises. The reverse process is a
Markov chain from xT to x0 parameterized by shared θ:
T
Y
pθ (x0 , · · · , xT −1 |xT ) = pθ (xt−1 |xt ), (6)
t=1
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

(a) Screenshot of MOS-F testing.

(b) Screenshot of MOS-Q testing.

Figure 8. Screenshots of subjective evaluations.

where each iteration eliminates the Gaussian noise added in the diffusion process:

pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), σθ (xt , t)2 I) (7)

E. Implementation Details
E.1. Spectrogram Autoencoders
We also investigate the effectiveness of several audio autoencoder variants in Table 7, and find that deeper representation
(i.e., 32 or 128) relatively brings more compression, while the information deterioration could burden the Unet model in
generative modeling.

Table 7. Audio quality comparisons for ablation study with Make-An-Audio BERT. We use PPE to denote pseudo prompt enhancement.

E.2. Text-to-audio
We first encode the text into a sequence of K tokens, and utilize the cross-attention mechanism to learn a language and
mel-spectrograms representation mapping in a powerful model. After the initial training run, we fine-tuned our base model
to support unconditional generation, with 20% of text token sequences being replaced with the empty sequence. This
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

way, the model retains its ability to generate text-conditional outputs, but can also generate spectrogram representation
unconditionally.
We consider the pre-trained automatic audio captioning (Xu et al., 2020) and audio-text retrieval (Deshmukh et al., 2022;
Koepke et al., 2022) systems as our experts for prompt generation. Regarding automatic audio captioning, the model consists
of a 10-layer convolution neural network (CNN) encoder and a temporal attentional single-layer gated recurrent unit (GRU)
decoder. The CNN encoder is pre-trained on a large-scale Audioset dataset. As for audio-text retrieval, the model leverages
BERT with a multi-modal transformer encoder for representation learning. It is trained on AudioCaps and Clotho datasets.

E.3. Visual-to-audio
For visual-to-audio (image/video) synthesis, we utilize the CLIP-guided T2A model and leverage global textual represen-
tations to bridge the modality gap between the visual and audio worlds. However, we empirically find that global CLIP
conditions have a limited ability to control faithful synthesis with high text-audio similarity. On that account, we use the
110h FSD50K audios annotated with a class label for training, and this simplification avoids multimodal prediction (a
conditional vector may refer to different concepts) with complex distribution.
We conduct ablation studies to compare various training settings, including datasets and global conditions. The results have
been presented in Table 8, and we have the following observations: 1) Replacing the FSD50K dataset with AudioCaps (Kim
et al., 2019) have witnessed a significant decrease in faithfulness. The dynamic concepts compositions confuse the
global-condition models, and the multimodal distribution hinders its capacity for controllable synthesis; 2) Removing the
normalization in the condition vector has witnessed the realism degradation measured by FID, demonstrating its efficiency
in reducing variance in latent space.

Training/Testing Dataset Condition FID KL CLAP

AudioCaption Global / / 0.12
FSD50k Global 40.7 8.2 0.40
FSD50k NormGlobal 31.1 8.0 0.42

Table 8. Ablation studies for training Make-An-Audio with global conditions.

F. Dynamic Reprogramming Templates

Below we provide the list of text templates used when providing dynamic reprogramming:

• before v q a n of &, X

• X before v q a n of &,

• in front of v q a n of &, X

• first is X second is q a n of &

• after X, v q a n of &

• after v q a n of &, X

• behind v q a n of &, X

• v q a n of &, then X

• v q a n of &, following X

• v q a n of &, later X

• X after v q a n of &

• before X, v q a n of &
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Specifically, we replace X and &, respectively, with the natural language of sampled data and the class label of sampled
events from the database.
For verb (denoted as v), we have {‘hearing’, ‘noticing’, ‘listening to’, ‘appearing’}; for adjective (denoted as a), we
have {‘clear’, ‘noisy’, ‘close-up’, ‘weird’, ‘clean’}; for noun (denoted as n), we have {‘audio’, ‘sound’, ‘voice’}; for
numeral/quantifier (denoted as q), we have {‘a’, ‘the’, ‘some’};

G. Potential Negative Societal Impacts

This paper aims to advance open-domain text-to-audio generation, which will ease the effort of short video and digital art
creation. The efficient training method also transfers knowledge from text-to-audio models to X-to-audio generation, which
helps avoid training from scratch, and thus reduces the issue of data scarcity. A negative impact is the risk of misinformation.
To alleviate it, we can train an additional classifier to discriminate the fakes. We believe the benefits outweigh the downsides.
Make-An-Audio lowers the requirements for high-quality text-to-audio synthesis, which may cause unemployment for
people with related occupations, such as sound engineers and radio hosts. In addition, there is the potential for harm from
non-consensual voice cloning or the generation of fake media, and the voices in the recordings might be overused than they
expect.

H. Limitations
Make-An-Audio adopts generative diffusion models for high-quality synthesis, and thus it inherently requires multiple
iterative refinements for better results. Besides, latent diffusion models require typically require more computational
resources, and degradation could be witnessed with decreased training data. One of our future directions is to develop
lightweight and fast diffusion models for accelerating sampling.

AudioGen
No ratings yet
AudioGen
16 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
tavt-towards-transferable-audio-visual-text-generation-2h20y12957
No ratings yet
tavt-towards-transferable-audio-visual-text-generation-2h20y12957
17 pages
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
No ratings yet
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
9 pages
2407.07464v2
No ratings yet
2407.07464v2
17 pages
3_GAN
No ratings yet
3_GAN
12 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Hallo 2
No ratings yet
Hallo 2
18 pages
2311.03054v5
No ratings yet
2311.03054v5
18 pages
Dawn: D F A N - D F - V G: Ynamic Rame Vatar With ON Autoregressive Iffusion Ramework For Talk Ing Head Ideo Eneration
No ratings yet
Dawn: D F A N - D F - V G: Ynamic Rame Vatar With ON Autoregressive Iffusion Ramework For Talk Ing Head Ideo Eneration
17 pages
Betray Oneself: A Novel Audio Deepfake Detection Model Via Mono-To-Stereo Conversion
No ratings yet
Betray Oneself: A Novel Audio Deepfake Detection Model Via Mono-To-Stereo Conversion
5 pages
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
No ratings yet
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
15 pages
Bella Zio
No ratings yet
Bella Zio
13 pages
Musetalk Paper
No ratings yet
Musetalk Paper
15 pages
Prefix-Diffusion: A Lightweight Diffusion Model For Diverse Image Captioning
No ratings yet
Prefix-Diffusion: A Lightweight Diffusion Model For Diverse Image Captioning
11 pages
Such Papers
No ratings yet
Such Papers
5 pages
Rave
No ratings yet
Rave
15 pages
Geneface++ ICLR 23
No ratings yet
Geneface++ ICLR 23
15 pages
Whisper PDF
No ratings yet
Whisper PDF
28 pages
Whisper Openai
No ratings yet
Whisper Openai
28 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
2502.00459v2
No ratings yet
2502.00459v2
14 pages
D - W: R K D L - S P L: Istil Hisper Obust Nowledge Istillation Via Arge Cale Seudo Abelling
No ratings yet
D - W: R K D L - S P L: Istil Hisper Obust Nowledge Istillation Via Arge Cale Seudo Abelling
30 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
On Discrete Prompt Optimization or Di Usion Models
No ratings yet
On Discrete Prompt Optimization or Di Usion Models
20 pages
Qwen-Audio: Advancing Universal Audio Understanding Via Unified Large-Scale Audio-Language Models
No ratings yet
Qwen-Audio: Advancing Universal Audio Understanding Via Unified Large-Scale Audio-Language Models
18 pages
Final_ppt
No ratings yet
Final_ppt
13 pages
2409.06135v1
No ratings yet
2409.06135v1
14 pages
ICASSP 2025 ImmersDiffusion Immersive Audio Latent Diffusion Generative Model
No ratings yet
ICASSP 2025 ImmersDiffusion Immersive Audio Latent Diffusion Generative Model
5 pages
Audiogpt: Understanding and Generating Speech, Music, Sound, and Talking Head
No ratings yet
Audiogpt: Understanding and Generating Speech, Music, Sound, and Talking Head
14 pages
1707 06519
No ratings yet
1707 06519
8 pages
AudioPaLM- A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM- A Large Language Model That Can Speak and Listen
27 pages
2023 11 Text Rendering Strategies for Pixel Language Models
No ratings yet
2023 11 Text Rendering Strategies for Pixel Language Models
18 pages
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
No ratings yet
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
13 pages
Documents Audio Super Resolution Using Neural Networks
No ratings yet
Documents Audio Super Resolution Using Neural Networks
8 pages
噪音微调CLIP
No ratings yet
噪音微调CLIP
13 pages
Amanda Cardoso Duarte WAV2PIX Speech-Conditioned Face Generation Using Generative Adversarial Networks CVPRW 2019 Paper
No ratings yet
Amanda Cardoso Duarte WAV2PIX Speech-Conditioned Face Generation Using Generative Adversarial Networks CVPRW 2019 Paper
4 pages
VoxCeleb - A Large-Scale Speaker Identification Dataset
No ratings yet
VoxCeleb - A Large-Scale Speaker Identification Dataset
6 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
2406.00320v4
No ratings yet
2406.00320v4
15 pages
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha
No ratings yet
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha
21 pages
Consistency Models
No ratings yet
Consistency Models
42 pages
2409.14709v1
No ratings yet
2409.14709v1
5 pages
MACS- Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
No ratings yet
MACS- Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
20 pages
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
No ratings yet
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
20 pages
Qwen2.5-Omni Technical Report
No ratings yet
Qwen2.5-Omni Technical Report
20 pages
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
No ratings yet
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
5 pages
Paper6
No ratings yet
Paper6
10 pages
Glow Wavegan
No ratings yet
Glow Wavegan
5 pages
tts paper
No ratings yet
tts paper
16 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
GeneratingVisualDynamics fromSoundandContext
No ratings yet
GeneratingVisualDynamics fromSoundandContext
40 pages
Wavelets Meet Large Language Models
No ratings yet
Wavelets Meet Large Language Models
16 pages
THANK_YOU
No ratings yet
THANK_YOU
23 pages
2409.08628v1
No ratings yet
2409.08628v1
5 pages
SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment
No ratings yet
SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment
11 pages
Pose-Guided Sign Language Video GAN With Dynamic Lambda
No ratings yet
Pose-Guided Sign Language Video GAN With Dynamic Lambda
6 pages
2410.18607v1
No ratings yet
2410.18607v1
11 pages
AV-HuBERT Large
No ratings yet
AV-HuBERT Large
7 pages
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
DAV-SM Fomra School - Kelambakkam
No ratings yet
DAV-SM Fomra School - Kelambakkam
4 pages
Non-Readers RRL
No ratings yet
Non-Readers RRL
13 pages
Reading Intervention Plan CATCH UP FRIDAY READING
No ratings yet
Reading Intervention Plan CATCH UP FRIDAY READING
3 pages
CELTA Lesson Plan: Listening
No ratings yet
CELTA Lesson Plan: Listening
10 pages
PHB B - Inggris - Xii - 2019-20
No ratings yet
PHB B - Inggris - Xii - 2019-20
13 pages
12 B&C Indian Weaver 2
No ratings yet
12 B&C Indian Weaver 2
8 pages
Chapter - 4
No ratings yet
Chapter - 4
83 pages
TH 3 (3-4) Team 1
No ratings yet
TH 3 (3-4) Team 1
8 pages
11 Be A Good Speaker: St. Xavier's Group of Institutions (SSC)
No ratings yet
11 Be A Good Speaker: St. Xavier's Group of Institutions (SSC)
7 pages
IXL Learn 9th Grade Language Arts
No ratings yet
IXL Learn 9th Grade Language Arts
1 page
Homework #5: in Switzerland, Under The Alps
No ratings yet
Homework #5: in Switzerland, Under The Alps
2 pages
Final Exam Reading and Writing For 3rd Quarter
No ratings yet
Final Exam Reading and Writing For 3rd Quarter
3 pages
t2 G5 Diagnostic Test
No ratings yet
t2 G5 Diagnostic Test
4 pages
6th Grade - Unit 5 MYP
No ratings yet
6th Grade - Unit 5 MYP
43 pages
The Verbal Group
No ratings yet
The Verbal Group
5 pages
Communicative English
No ratings yet
Communicative English
8 pages
Tag Questions With The Present Simple
No ratings yet
Tag Questions With The Present Simple
2 pages
Gramática Básica Del Estudiante de Español
No ratings yet
Gramática Básica Del Estudiante de Español
3 pages
Unlocking The Power of English
No ratings yet
Unlocking The Power of English
4 pages
ESEN Module 3 - CP1&2 Reviewer
No ratings yet
ESEN Module 3 - CP1&2 Reviewer
17 pages
Evolve Digital Level 1 Students Course Unit 5 Audio Script
No ratings yet
Evolve Digital Level 1 Students Course Unit 5 Audio Script
25 pages
South Korea Homework
100% (1)
South Korea Homework
4 pages
Definition of Language
No ratings yet
Definition of Language
3 pages
I Shall Never Pass This Way Again
No ratings yet
I Shall Never Pass This Way Again
11 pages
English W4 WLP Q1
No ratings yet
English W4 WLP Q1
9 pages
Unit 8
No ratings yet
Unit 8
4 pages
A Review of Foreign Research On Written Corrective Feedback
No ratings yet
A Review of Foreign Research On Written Corrective Feedback
10 pages
Conditional Sentence
No ratings yet
Conditional Sentence
68 pages
Complete (Ebook PDF) Writing and Reporting For The Media 12th Edition PDF For All Chapters
100% (4)
Complete (Ebook PDF) Writing and Reporting For The Media 12th Edition PDF For All Chapters
41 pages
Advanced+Cheat+Sheet+Delf+A1
0% (1)
Advanced+Cheat+Sheet+Delf+A1
18 pages