0% found this document useful (0 votes)
79 views15 pages

Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples

Troleadooooooooooooooooooooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooo no hay nada >:3

Uploaded by

valeria romano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views15 pages

Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples

Troleadooooooooooooooooooooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooo no hay nada >:3

Uploaded by

valeria romano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MusicLM: Generating Music From Text

Andrea Agostinelli * 1 Timo I. Denk * 1


Zalán Borsos Jesse Engel Mauro Verzetti 1 Antoine Caillon 2 Qingqing Huang 1 Aren Jansen 1
1 1

Adam Roberts 1 Marco Tagliasacchi 1 Matt Sharifi 1 Neil Zeghidour 1 Christian Frank 1

Abstract period of seconds. Hence, turning a single text caption into


a rich audio sequence with long-term structure and many
arXiv:2301.11325v1 [cs.SD] 26 Jan 2023

We introduce MusicLM, a model for generating


stems, such as a music clip, remains an open challenge.
high-fidelity music from text descriptions such as
“a calming violin melody backed by a distorted gui- AudioLM (Borsos et al., 2022) has recently been proposed
tar riff”. MusicLM casts the process of condi- as a framework for audio generation. Casting audio synthe-
tional music generation as a hierarchical sequence- sis as a language modeling task in a discrete representation
to-sequence modeling task, and it generates music space, and leveraging a hierarchy of coarse-to-fine audio
at 24 kHz that remains consistent over several mi- discrete units (or tokens), AudioLM achieves both high-
nutes. Our experiments show that MusicLM out- fidelity and long-term coherence over dozens of seconds.
performs previous systems both in audio quality Moreover, by making no assumptions about the content
and adherence to the text descriptions. Moreover, of the audio signal, AudioLM learns to generate realistic
we demonstrate that MusicLM can be conditioned audio from audio-only corpora, be it speech or piano music,
on both text and a melody in that it can transform without any annotation. The ability to model diverse signals
whistled and hummed melodies according to the suggests that such a system could generate richer outputs
style described in a text caption. To support fu- if trained on the appropriate data.
ture research, we publicly release MusicCaps, a
Besides the inherent difficulty of synthesizing high-quality
dataset composed of 5.5k music-text pairs, with
and coherent audio, another impeding factor is the scarcity
rich text descriptions provided by human experts.
of paired audio-text data. This is in stark contrast with the
google-research.github.io/seanet/musiclm/examples image domain, where the availability of massive datasets
contributed significantly to the remarkable image generation
quality that has recently been achieved (Ramesh et al., 2021;
1. Introduction 2022; Saharia et al., 2022; Yu et al., 2022). Moreover, creat-
Conditional neural audio generation covers a wide range of ing text descriptions of general audio is considerably harder
applications, ranging from text-to-speech (Zen et al., 2013; than describing images. First, it is not straightforward to un-
van den Oord et al., 2016) to lyrics-conditioned music ge- ambiguously capture with just a few words the salient char-
neration (Dhariwal et al., 2020) and audio synthesis from acteristics of either acoustic scenes (e.g., the sounds heard
MIDI sequences (Hawthorne et al., 2022b). Such tasks are in a train station or in a forest) or music (e.g., the melody,
facilitated by a certain level of temporal alignment between the rhythm, the timbre of vocals and the many instruments
the conditioning signal and the corresponding audio out- used in accompaniment). Second, audio is structured along
put. In contrast, and inspired by progress in text-to-image a temporal dimension which makes sequence-wide captions
generation (Ramesh et al., 2021; 2022; Saharia et al., 2022; a much weaker level of annotation than an image caption.
Yu et al., 2022), recent work has explored generating audio In this work, we introduce MusicLM, a model for genera-
from sequence-wide, high-level captions (Yang et al., 2022; ting high-fidelity music from text descriptions. MusicLM
Kreuk et al., 2022) such as “whistling with wind blowing”. leverages AudioLM’s multi-stage autoregressive modeling
While generating audio from such coarse captions repre- as the generative component, while extending it to incor-
sents a breakthrough, these models remain limited to simple porate text conditioning. To address the main challenge of
acoustic scenes, consisting of few acoustic events over a paired data scarcity, we rely on MuLan (Huang et al., 2022),
* a joint music-text model that is trained to project music and
Equal contribution 1 Google Research 2 IRCAM - Sorbonne
Université (work done while interning at Google). Correspondence its corresponding text description to representations close to
to: Christian Frank <[email protected]>. each other in an embedding space. This shared embedding
space eliminates the need for captions at training time alto-
MusicLM: Generating Music From Text

gether, and allows training on massive audio-only corpora. 2. Background and Related Work
That is, we use the MuLan embeddings computed from the
audio as conditioning during training, while we use MuLan The state-of-the-art in generative modeling for various do-
embeddings computed from the text input during inference. mains is largely dominated either by Transformer-based au-
toregressive models (Vaswani et al., 2017) or U-Net-based
When trained on a large dataset of unlabeled music, diffusion models (Ho et al., 2020). In this section, we re-
MusicLM learns to generate long and coherent music at 24 view the related work with an emphasis on autoregressive
kHz, for text descriptions of significant complexity, such as generative models operating on discrete tokens, which share
“enchanting jazz song with a memorable saxophone solo and similarities with MusicLM.
a solo singer” or “Berlin 90s techno with a low bass and
strong kick”. To address the lack of evaluation data for this 2.1. Quantization
task, we introduce MusicCaps, a new high-quality music
caption dataset with 5.5k examples prepared by expert musi- Modeling sequences of discrete tokens autoregressively
cians, which we publicly release to support future research. has proven to be a powerful approach in natural language
processing (Brown et al., 2020; Cohen et al., 2022) and
Our experiments show through quantitative metrics and image or video generation (Esser et al., 2021; Ramesh et al.,
human evaluations that MusicLM outperforms previous 2021; Yu et al., 2022; Villegas et al., 2022). Quantization
systems such as Mubert (Mubert-Inc, 2022) and Riffu- is a key component to the success of autoregressive models
sion (Forsgren & Martiros, 2022), both in terms of quality for continuous signals, including images, videos, and audio.
and adherence to the caption. Furthermore, since describing The goal of quantization is to provide a compact, discrete
some aspects of music with words can be difficult or even representation, which at the same time allows for high-
impossible, we show how our method supports condition- fidelity reconstruction. VQ-VAEs (Van Den Oord et al.,
ing signals beyond text. Concretely, we extend MusicLM 2017) demonstrated impressive reconstruction quality at
to accept an additional melody in the form of audio (e.g., low bitrates in various domains and serve as the underlying
whistling, humming) as conditioning to generate a music quantizer for many approaches.
clip that follows the desired melody, rendered in the style
described by the text prompt. SoundStream (Zeghidour et al., 2022) is a universal neural
audio codec capable of compressing general audio at low
We acknowledge the risks associated with music generation, bitrates, while maintaining a high reconstruction quality. To
in particular, the potential misappropriation of creative con- achieve this, SoundStream uses residual vector quantization
tent. In accordance with responsible model development (RVQ), allowing scalability to higher bitrate and quality,
practices, we conduct a thorough study of memorization without a significant computational cost. More specifically,
by adapting and extending the methodology of Carlini et al. RVQ is a hierarchical quantization scheme composing a se-
(2022) used for text-based large language models. Our ries of vector quantizers, where the target signal is recon-
findings show that when feeding MuLan embeddings to structed as the sum of quantizer outputs. Due to the compo-
MusicLM, the sequences of generated tokens significantly sition of quantizers, RVQ avoids the exponential blowup in
differ from the corresponding sequences in the training set. the codebook size as the target bitrate increases. Moreover,
The key contributions of this work are the following: the fact that each quantizer is fitted to the residual of coarser
quantizers introduces a hierarchical structure to the quan-
1. We introduce MusicLM, a generative model that pro- tizers, where coarser levels are more important for high-
duces high-quality music at 24 kHz which is consistent fidelity reconstruction. This property is desirable for genera-
over several minutes while being faithful to a text con- tion, since the past context can be defined by only attending
ditioning signal. to the coarse tokens. Recently, SoundStream was extended
by EnCodec (Défossez et al., 2022) to higher bitrates and
2. We extend our method to other conditioning signals, stereophonic audio. In this work, we rely on SoundStream
such as a melody that is then synthesized according to as our audio tokenizer, since it can reconstruct 24 kHz mu-
the text prompt. Furthermore, we demonstrate long and sic at 6 kbps with high fidelity.
coherent music generation of up to 5-minute long clips.
2.2. Generative Models for Audio
3. We release the first evaluation dataset collected specif-
ically for the task of text-to-music generation: Mu- Despite the challenge of generating high-quality audio with
sicCaps is a hand-curated, high-quality dataset of long-term consistency, a series of approaches have recently
5.5k music-text pairs prepared by musicians. tackled the problem with some success. Jukebox (Dhari-
wal et al., 2020), for example, proposes a hierarchy of VQ-
VAEs at various time resolutions to achieve high temporal
MusicLM: Generating Music From Text

coherence, but the generated music displays noticeable arti- Symbolic representations of music (e.g., MIDI) can also
facts. PerceiverAR (Hawthorne et al., 2022a), on the other be used to drive the generative process as a form of strong
hand, proposes to model a sequence of SoundStream tokens conditioning, as demonstrated by Huang et al. (2019);
autoregressively, achieving high-quality audio, but compro- Hawthorne et al. (2019); Engel et al. (2020). MusicLM
mising the long-term temporal coherence. enables a more natural and intuitive way of providing a con-
ditioning signal, for example through a hummed melody,
Inspired by these approaches, AudioLM (Borsos et al., 2022)
which can also be combined with a text description.
addresses the trade-off between coherence and high-quality
synthesis by relying on a hierarchical tokenization and ge-
neration scheme. Concretely, the approach distinguishes 2.4. Text-Conditioned Image Generation
between two token types: (1) semantic tokens that allow Precursor to text-conditioned audio synthesis are the text-
the modeling of long-term structure, extracted from models conditioned image generation models, which made signifi-
pretrained on audio data with the objective of masked lan- cant progress in quality due to architectural improvements
guage modeling; (2) acoustic tokens, provided by a neural and the availability of massive, high-quality paired train-
audio codec, for capturing fine acoustic details. This allows ing data. Prominent Transformer-based autoregressive ap-
AudioLM to generate coherent and high-quality speech as proaches include Ramesh et al. (2021); Yu et al. (2022),
well as piano music continuations without relying on tran- while Nichol et al. (2022); Rombach et al. (2022b); Saharia
scripts or symbolic music representations. et al. (2022) present diffusion-based models. The text-to-
MusicLM builds on top of AudioLM with three important image approaches have been extended to generating videos
additional contributions: (1) we condition the generation from a text prompt (Wu et al., 2022a; Hong et al., 2022; Vil-
process on a descriptive text, (2) we show that the condition- legas et al., 2022; Ho et al., 2022).
ing can be extended to other signals such as melody, and The closest to our approach among these works is
(3) we model a large variety of long music sequences be- DALL·E 2 (Ramesh et al., 2022). In particular, similarly
yond piano music (from drum’n’bass over jazz to classical to the way DALL·E 2 relies on CLIP (Radford et al., 2021)
music). for text encoding, we also use a joint music-text embed-
ding model for the same purpose. In contrast to DALL·E 2,
2.3. Conditioned Audio Generation which uses a diffusion model as a decoder, our decoder is
based on AudioLM. Furthermore, we also omit the prior
Generating audio from a text description (such as “whistling
model mapping text embeddings to music embeddings, such
with laughter in the background”) has recently been tack-
that the AudioLM-based decoder can be trained on an audio-
led by several works. DiffSound (Yang et al., 2022) uses
only dataset and the music embedding is simply replaced
CLIP (Radford et al., 2021) as the text encoder and applies
during inference by the text embedding.
a diffusion model to predict the quantized mel spectrogram
features of the target audio based on the text embeddings.
AudioGen (Kreuk et al., 2022) uses a T5 (Raffel et al., 2.5. Joint Embedding Models for Music and Text
2020) encoder for embedding the text, and an autoregressive MuLan (Huang et al., 2022) is a music-text joint embedding
Transformer decoder for predicting target audio codes pro- model consisting of two embedding towers, one for each
duced by EnCodec (Défossez et al., 2022). Both approaches modality. The towers map the two modalities to a shared
rely on a modest amount of paired training data such as Au- embedding space of 128 dimensions using contrastive learn-
dioSet (Gemmeke et al., 2017) and AudioCaps (Kim et al., ing, with a setup similar to (Radford et al., 2021; Wu et al.,
2019) (totalling less than 5k hours after filtering). 2022b). The text embedding network is a BERT (Devlin
Closer to MusicLM, there are also works focusing on music et al., 2019) pre-trained on a large corpus of text-only data,
generation conditioned on text. In Mubert (Mubert-Inc, while we use the ResNet-50 variant of the audio tower.
2022), the text prompt is embedded by a Transformer, music MuLan is trained on pairs of music clips and their corre-
tags which are close to the encoded prompt are selected and sponding text annotations. Importantly, MuLan imposes
used to query the song generation API. Based on the selected only weak requirements on its training data quality, learn-
tags, Mubert generates a combination of sounds, which ing cross-modal correspondences even when the music-text
in turn were generated by musicians and sound designers. pairs are only weakly associated. The ability to link mu-
This is in contrast to Riffusion (Forsgren & Martiros, 2022), sic to unconstrained natural language descriptions makes it
which fine-tunes a Stable Diffusion model (Rombach et al., applicable for retrieval or zero-shot music tagging. In this
2022a) on mel spectrograms of music pieces from a paired work, we rely on the pretrained and frozen model of Huang
music-text dataset. We use both Mubert and Riffusion as et al. (2022).
baselines for our work, showing that we improve the audio
generation quality and adherence to the text description.
MusicLM: Generating Music From Text

Adversarial and MLM loss and


MuLan. To train MusicLM, we extract the representation
<Audio, Text> Contrastive Loss
Reconstruction Loss Contrastive Loss
of the target audio sequence from the audio-embedding
network of MuLan. Note that this representation is conti-
SoundStream w2v-BERT MuLan
nuous and could be directly used as a conditioning signal
Decoder
Audio Text
Embedding Embedding
in Transformer-based autoregressive models. However,
Intermediate
RVQ
Layer we opt for quantizing the MuLan embeddings in such a
Audio Text
Encoder Network Network way that both the audio and the conditioning signal have
a homogeneous representation based on discrete tokens,
“Rock song with
distorted guitar”
aiding further research into autoregressively modeling the
conditioning signal as well.
Figure 1. Independent pretraining of the models providing the au- Since MuLan operates on 10-second audio inputs and we
dio and text representations for MusicLM: SoundStream (Zeghi-
need to process longer audio sequences, we calculate the
dour et al., 2022), w2v-BERT (Chung et al., 2021), and MuLan
audio embeddings on 10-second windows with 1-second
(Huang et al., 2022).
stride and average the resulting embeddings. We then dis-
cretize the resulting embedding by applying an RVQ with
12 vector quantizers, each with a vocabulary size of 1024.
3. Method This process yields 12 MuLan audio tokens MA for an au-
dio sequence. During inference, we use as conditioning the
In this section, we describe MusicLM and its components. MuLan text embedding extracted from the text prompt, and
Section 3.1 describes the models that provide the audio quantize it with the same RVQ as the one used for the audio
representations. Then, we show in Section 3.2 how we use embeddings, to obtain 12 tokens MT .
these representations for text-conditioned music generation.
Conditioning on MA during training has two main advan-
3.1. Representation and Tokenization of Audio and Text tages. First, it allows us to easily scale our training data,
We use three models for extracting audio representations that since we are not limited by the need of text captions. Sec-
will serve for conditional autoregressive music generation, ond, by exploiting a model like MuLan, trained using a
which are illustrated in Figure 1. In particular, by following contrastive loss, we increase the robustness to noisy text
the approach of AudioLM, we use the self-supervised audio descriptions.
representations of SoundStream (Zeghidour et al., 2022), as
acoustic tokens to enable high-fidelity synthesis, and w2v- 3.2. Hierarchical Modeling of Audio Representations
BERT (Chung et al., 2021), as semantic tokens to facilitate
We combine the discrete audio representations presented
long-term coherent generation. For representing the con-
above with AudioLM to achieve text-conditioned music
ditioning, we rely on the MuLan music embedding during
generation. For this, we propose a hierarchical sequence-
training and the MuLan text embedding at inference time.
to-sequence modeling task, where each stage is modeled
All three of these models are pretrained independently and
autoregressively by a separate decoder-only Transformer.
then frozen, such that they provide the discrete audio and
The proposed approach is illustrated in Figure 2.
text representations for the sequence-to-sequence modeling.
The first stage is the semantic modeling stage, which learns
SoundStream. We use a SoundStream model for 24 kHz the mapping from the MuLan audio tokens to the seman-
monophonic audio with a striding factor of 480, resulting in tic tokens S, by modeling the distribution p(St |S<t , MA ),
50 Hz embeddings. The quantization of these embeddings is where t is the position in the sequence corresponding to a
learned during training by an RVQ with 12 quantizers, each the time step. The second stage is the acoustic modeling
with a vocabulary size of 1024. This results in a bitrate of stage, where the acoustic tokens Aq are predicted condi-
6 kbps, where one second of audio is represented by 600 to- tioned on both the MuLan audio tokens and the semantic
kens. We refer to these as acoustic tokens, denoted by A. tokens, modeling the distribution p(At |A<t , S, MA ).
Notably, to avoid long token sequences, AudioLM proposed
w2v-BERT. Similarly to AudioLM, we use an intermedi- to further split the acoustic modeling stage into a coarse and
ate layer of the masked-language-modeling (MLM) mod- fine modeling stage. We rely on the same approach, where
ule of a w2v-BERT model with 600M parameters. After the coarse stage models the first four levels from the output
pretraining and freezing the model, we extract embeddings of the SoundStream RVQ, and the fine stage models the re-
from the 7th layer and quantize them using the centroids of maining eight — we refer to Borsos et al. (2022) for details.
a learned k-means over the embeddings. We use 1024 clus-
ters and a sampling rate of 25 Hz, resulting in 25 semantic
tokens for every second of audio, denoted by S.
MusicLM: Generating Music From Text

Acoustic
MuLan tokens Semantic tokens Acoustic tokens
modeling SoundStream

Semantic Decoder
MuLan tokens Semantic tokens
modeling

RVQ k-means RVQ


MuLan (Audio) w2v-BERT SoundStream MuLan (Text)
Decoder Generated audio
128d Audio 128d Text
Embedding Embedding
Intermediate RVQ
Layer
Audio "Hip hop song with
Text Network
Network Encoder violin solo"

Target Audio

Figure 2. Left: During training we extract the MuLan audio tokens, semantic tokens, and acoustic tokens from the audio-only training
set. In the semantic modeling stage, we predict semantic tokens using MuLan audio tokens as conditioning. In the subsequent acoustic
modeling stage, we predict acoustic tokens, given both MuLan audio tokens and semantic tokens. Each stage is modeled as a sequence-to-
sequence task using decoder-only Transformers. Right: During inference, we use MuLan text tokens computed from the text prompt as
conditioning signal and convert the generated audio tokens to waveforms using the SoundStream decoder.

4. Experimental Setup 4.3. Evaluation Dataset


4.1. Models To evaluate MusicLM, we prepare MusicCaps, a high-
quality music caption dataset, which we make publicly
We use decoder-only Transformers for modeling the seman- available.1 This dataset includes 5.5k music clips from Au-
tic stage and the acoustic stages of AudioLM. The models dioSet (Gemmeke et al., 2017), each paired with correspond-
share the same architecture, composed of 24 layers, 16 atten- ing text descriptions in English, written by ten professional
tion heads, an embedding dimension of 1024, feed-forward musicians. For each 10-second music clip, MusicCaps pro-
layers of dimensionality 4096, dropout of 0.1, and relative vides: (1) a free-text caption consisting of four sentences on
positional embeddings (Raffel et al., 2020), resulting in average, describing the music and (2) a list of music aspects,
430M parameters per stage. describing genre, mood, tempo, singer voices, instrumenta-
tion, dissonances, rhythm, etc. On average, the dataset in-
4.2. Training and Inference cludes eleven aspects per clip. See Appendix A for a few
By relying on pretrained and frozen MuLan, we need audio- caption and aspect list examples.
only data for training the other components of MusicLM. MusicCaps complements AudioCaps (Kim et al., 2019), as
We train SoundStream and w2v-BERT on the Free Music they both contain audio clips from AudioSet with corre-
Archive (FMA) dataset (Defferrard et al., 2017), whereas sponding textual descriptions. However, while AudioCaps
the tokenizers and the autoregressive models for the seman- contains non-music content, MusicCaps focuses exclusively
tic and acoustic modeling stages are trained on a dataset con- on music and includes highly detailed expert-provided an-
taining five million audio clips, amounting to 280k hours of notations. The examples are extracted from both the train
music at 24 kHz. Each of the stages is trained with multi- and eval split of AudioSet, covering a diverse distribution
ple passes over the training data. We use 30 and 10-second of genres, as detailed in Appendix A. MusicCaps also pro-
random crops of the target audio for the semantic stage and vides a genre-balanced split of the data with 1k examples.
the acoustic stage, respectively. The AudioLM fine acoustic
modeling stage is trained on 3-second crops. 4.4. Metrics
During inference, we make use of the joint embedding space We compute different metrics to evaluate MusicLM, captur-
between audio and text learned by MuLan, that is, we sub- ing two important aspects of music generation: the audio
stitute MA with MT . We then follow the stages described quality and the adherence to the text description.
above and obtain A given MT . We use temperature sam-
pling for the autoregressive sampling in all stages, with tem-
perature of 1.0 for the semantic modeling stage, 0.95 and Fréchet Audio Distance (FAD). The Fréchet Audio Dis-
0.4 for the coarse and fine acoustic modeling stages respec- tance (Kilgour et al., 2019) is a reference-free audio quality
tively. These temperature values were chosen based on sub- metric, which correlates well with human perception. Mod-
jective inspection to provide a good trade-off between diver- els producing samples with a low FAD score are expected
sity and temporal consistency of the generated music. 1
kaggle.com/datasets/googleai/musiccaps
MusicLM: Generating Music From Text

to generate plausible audio. However, the generated sam- that is, how often a condition is strongly or weakly preferred.
ples might not necessarily adhere to the text description pro- The samples are selected from the genre-balanced 1k subset
vided as conditioning. of our evaluation data.
We report the FAD based on two audio embedding models,
both of which are publicly available: (1) Trill2 (Shor et al., Training data memorization. Large language models
2020), which is trained on speech data, and (2) VGGish3 , have the capacity to memorize patterns seen in the training
(Hershey et al., 2017) which is trained on the YouTube-8M data (Carlini et al., 2020). We adapt the methodology used
audio event dataset (Abu-El-Haija et al., 2016). Because in Carlini et al. (2022) to study the extent to which MusicLM
of the difference in training data, we expect the models to might memorize music segments. We focus on the first stage,
measure different aspects of the audio quality (speech and responsible for semantic modeling. We select N examples at
non-speech, respectively). random from the training set. For each example, we feed to
the model a prompt which includes the MuLan audio tokens
MA followed by a sequence of the first T semantic tokens S,
KL Divergence (KLD). There is a many-to-many rela-
with T ∈ {0, . . . , 250}, corresponding to up to 10 seconds.
tionship between text descriptions and music clips com-
We use greedy decoding to generate a continuation of 125 se-
patible with them. It is therefore not possible to directly
mantic tokens (5 seconds) and we compare the generated
compare the generated music with the reference at the level
tokens to the target tokens in the dataset. We measure exact
of the audio waveform. To assess the adherence to the input
matches as the fraction of examples for which generated and
text description, we adopt a proxy method similar to the
target tokens are identical over the whole sampled segment.
one proposed in Yang et al. (2022); Kreuk et al. (2022).
Specifically, we use a LEAF (Zeghidour et al., 2021) clas- In addition, we propose a methodology to detect approx-
sifier trained for multi-label classification on AudioSet, to imate matches, based on the observation that sequences of
compute class predictions for both the generated and the seemingly different tokens might lead to acoustically sim-
reference music and measure the KL divergence between ilar audio segments. Namely, we compute the histogram
probability distributions of class predictions. When the of semantic token counts over the corresponding vocab-
KL-divergence is low, the generated music is expected to ulary {0, . . . , 1023} from both the generated and target
have similar acoustic characteristics as the reference music, tokens, and define a matching cost measure between his-
according to the classifier. tograms as follows. First, we compute the distance matrix
between pairs of semantic tokens, which is populated by the
MuLan Cycle Consistency (MCC). As a joint music- Euclidean distances between the corresponding k-means
text embedding model, MuLan can be used to quantify the centroids used to quantize w2v-BERT to semantic tokens
similarity between music-text pairs. We compute the MuLan (see Section 3.1). Then, we solve an optimal transport prob-
embeddings from the text descriptions in MusicCaps as lem to find the matching cost between a pair of histograms
well as the generated music based on them, and define the using the Sinkhorn algorithm (Cuturi, 2013), considering
MCC metric as the average cosine similarity between these only the sub-matrix corresponding to non-zero token counts
embeddings. in the two histograms. To calibrate the threshold used to
determine whether two sequences might be approximate
matches, we construct negative pairs by permuting the
Qualitative evaluation. Ultimately, we rely on subjective
examples with target tokens and measure the empirical
tests to evaluate the adherence of generated samples to the
distribution of matching costs for such negative pairs. We
text description. We set up an A-vs-B human rating task, in
set the match threshold τ to 0.85, which leads to less than
which raters are presented with the text description and two
0.01% false positive approximate matches.
samples of music generated by two different models, or one
model and the reference music. There are five possible an-
swers: strong or weak preference for A or B, and no prefer- 5. Results
ence. The raters are instructed not to take the music quality
We evaluate MusicLM by comparing it with two recent
into account when making their decision, because this as-
baselines for music generation from descriptive text, namely
pect of the evaluation is already covered by the FAD metric.
Mubert (Mubert-Inc, 2022) and Riffusion (Forsgren & Mar-
We consider the output of n different models, in addition tiros, 2022). In particular, we generate audio by querying
to the reference music, thus a total of n + 1 conditions and the Mubert API,4 and by running inference on the Riffusion
n(n + 1)/2 pairs. To aggregate the results of the pairwise model.5 We perform our evaluations on MusicCaps, the eval-
tests and rank conditions, we count the number of “wins”, uation dataset we publicly release together with this paper.
2 4
tfhub.dev/google/nonsemantic-speech-benchmark/trill/3 github.com/MubertAI (accessed in Dec 2022 and Jan 2023)
3 5
tfhub.dev/google/vggish/1 github.com/riffusion/riffusion-app (accessed on Dec 27, 2022)
MusicLM: Generating Music From Text

eling, we train a Transformer model which directly predicts


Table 1. Evaluation of generated samples using captions from the
coarse acoustic tokens from MuLan tokens, by modeling
MusicCaps dataset. Models are compared in terms of audio quality,
by means of Fréchet Audio Distance (FAD), and faithfulness to the p(At |A<t , MA ). We observe that while the FAD metrics
text description, using Kullback–Leibler Divergence (KLD) and are comparable (0.42 FADTrill and 4.0 FADVGG ), KLD and
MuLan Cycle Consistency (MCC), and counts of wins in pairwise MCC scores worsen when removing the semantic modeling
human listening tests (Wins). stage. In particular the KLD score increases from 1.01 to
1.05, and the MCC score decreases from 0.51 to 0.49, indi-
M ODEL FAD T RILL ↓ FAD VGG ↓ KLD ↓ MCC ↑ W INS ↑ cating that semantic tokens facilitate the adherence to the
R IFFUSION 0.76 13.4 1.19 0.34 158 text description. We also confirm this qualitatively by listen-
M UBERT 0.45 9.6 1.58 0.32 97 ing to the samples. In addition, we observe degradation in
M USIC LM 0.44 4.0 1.01 0.51 312
long term structure.
M USIC C APS - - - - 472

Information represented by audio tokens. We conduct


Comparison to baselines. Table 1 reports the main quan- additional experiments to study the information captured by
titative and qualitative results of this paper. In terms of au- the semantic and the acoustic tokens. In the first study, we
dio quality, as captured by the FAD metrics, on FADVGG fix the MuLan text tokens as well as the semantic tokens,
MusicLM achieves better scores than Mubert and Riffusion. running the acoustic modeling stage multiple times to gen-
On FADTrill , MusicLM scores similarly to Mubert (0.44 vs. erate several samples. In this case, by listening to the gen-
0.45) and better than Riffusion (0.76). We note that, ac- erated music, it is possible to observe that the samples are
cording to these metrics, MusicLM is capable of generating diverse, yet they tend to share the same genre, rhythmical
high-quality music comparable to Mubert, which relies on properties (e.g., drums), and part of the main melody. They
pre-recorded sounds prepared by musicians and sound de- differ in terms of specific acoustic properties (e.g., level of
signers. In terms of faithfulness to the input text description, reverb, distortion) and, in some cases, different instruments
as captured by KLD and MCC, MusicLM achieves the best with a similar pitch range can be synthesized in different
scores, suggesting that it is able to capture more informa- examples. In the second study, we fix only the MuLan text
tion from the text descriptions compared to the baselines. tokens and generate both the semantic and acoustic tokens.
We further supplement our evaluation of text faithfulness In this case, we observe a much higher level of diversity
with a human listening test. Participants are presented with in terms of melodies and rhythmic properties, still coher-
two 10-second clips and a text caption, and asked which clip ent with the text description. We provide samples from this
is best described by the text of the caption on a 5-point Likert study in the accompanying material.
scale. We collect 1200 ratings, with each source involved in
600 pair-wise comparisons. Table 1 reports the total number Memorization analysis. Figure 3 reports both exact and
of “wins”, that is, counting how often the human raters approximate matches when the length of the semantic token
preferred a model in a side-by-side comparison. MusicLM prompt is varied between 0 and 10 seconds. We observe
is clearly preferred over both baselines, while there is still that the fraction of exact matches always remains very
a measurable gap to the ground truth reference music. Full small (< 0.2%), even when using a 10 second prompt to
details of the listening study can be found in Appendix B. generate a continuation of 5 seconds. Figure 3 also in-
Listening to examples in which the ground truth was pre- cludes results for approximate matches, using τ = 0.85.
ferred over MusicLM reveals the following patterns: (1) cap- We can see a higher number of matches detected with this
tions are extremely detailed, referring to more than five in- methodology, also when using only MuLan tokens as input
struments or describing non musical aspects such as “wind, (prompt length T = 0) and the fraction of matching exam-
people talking“; (2) captions describe temporal ordering of ples increases as the length of the prompt increases. We
the audio being played; (3) negations are used, which are inspect these matches more closely and observe that those
not well captured by MuLan. with the lowest matching score correspond to sequences
characterized by a low level of token diversity. Namely, the
Overall, we conclude that: (1) our approach is able to cap- average empirical entropy of a sample of 125 semantic to-
ture fine-grained information from the rich free-text cap- kens is 4.6 bits, while it drops to 1.0 bits when considering
tions of MusicCaps; (2) the KLD and MCC metrics provide sequences detected as approximate matches with matching
a quantitative measure of the faithfulness to the text descrip- score less than 0.5. We include a sample of approximate
tion, which is in accordance with the human rating study. matches obtained with T = 0 in the accompanying material.
Note that acoustic modeling carried out by the second stage
Importance of semantic tokens. To understand the use- introduces further diversity in the generated samples, also
fulness of decoupling semantic modeling from acoustic mod- when the semantic tokens match exactly.
MusicLM: Generating Music From Text

Fraction of matching examples cretely, we compute MT from multiple text descriptions


10% and change the conditioning signal every 15 seconds. The
model generates smooth transitions which are tempo con-
1% approximate matches sistent and semantically plausible, while changing music
exact matches
context according to the text description.
0.1%

7. Conclusions
0.0 1.0 2.0 5.0 10.0
Prompt length [sec] We introduce MusicLM, a text-conditioned generative
Figure 3. Memorization results for the semantic modeling stage. model that produces high-quality music at 24 kHz, consis-
We compare the semantic tokens generated for 5 seconds of audio tent over several minutes, while being faithful to the text
to corresponding tokens in the training set, considering exact and conditioning signal. We demonstrate that our method outper-
approximate matches. forms baselines on MusicCaps, a hand-curated, high-quality
dataset of 5.5k music-text pairs prepared by musicians.
6. Extensions Some limitations of our method are inherited from MuLan,
in that our model misunderstands negations and does not
Melody conditioning. We extend MusicLM in such a adhere to precise temporal ordering described in the text.
way that it can generate music based on both a text descrip- Moreover, further improvements of our quantitative evalu-
tion and a melody, which is provided in the form of hum- ations are needed. Specifically, since MCC also relies on
ming, singing, whistling, or playing an instrument. This MuLan, the MCC scores are favorable to our method.
requires extending the conditioning signal in a way that cap-
tures the target melody. To this end, we create a synthetic Future work may focus on lyrics generation, along with
dataset composed of audio pairs with matching melodies improvement of text conditioning and vocal quality. Another
but different acoustics. To create such pairs, we use differ- aspect is the modeling of high-level song structure like
ent versions of the same music clip, such as covers, instru- introduction, verse, and chorus. Modeling the music at a
mentals, or vocals. Additionally, we acquire data pairs of higher sample rate is an additional goal.
people humming and singing. We then train a joint embed-
ding model such that when two audio clips contain the same 8. Broader Impact
melody, the corresponding embeddings are close to each
other. For implementation details we refer to Appendix C. MusicLM generates high-quality music based on a text de-
scription, and thus it further extends the set of tools that as-
To extract the melody conditioning for MusicLM, we quan- sist humans with creative music tasks. However, there are
tize the melody embeddings with RVQ, and concatenate the several risks associated with our model and the use-case
resulting token sequences with the MuLan audio tokens MA . it tackles. The generated samples will reflect the biases
During inference, we compute melody tokens from the input present in the training data, raising the question about ap-
audio clip and concatenate them with the MuLan text tokens propriateness for music generation for cultures underrepre-
MT . Based on this conditioning, MusicLM can success- sented in the training data, while at the same time also rais-
fully generate music which follows the melody contained in ing concerns about cultural appropriation.
the input audio clip, while adhering to the text description.
We acknowledge the risk of potential misappropriation of
Long generation and story mode. In MusicLM, gene- creative content associated to the use-case. In accordance
ration is autoregressive in the temporal dimension which with responsible model development practices, we con-
makes it possible to generate sequences longer than those ducted a thorough study of memorization, adapting and
used during training. In practice, the semantic modeling extending a methodology used in the context of text-based
stage is trained on sequences of 30 seconds. To generate LLMs, focusing on the semantic modeling stage. We found
longer sequences, we advance with a stride of 15 seconds, that only a tiny fraction of examples was memorized ex-
using 15 seconds as prefix to generate an additional 15 sec- actly, while for 1% of the examples we could identify an ap-
onds, always conditioning on the same text description. proximate match. We strongly emphasize the need for more
With this approach we can generate long audio sequences future work in tackling these risks associated to music gene-
which are coherent over several minutes. ration — we have no plans to release models at this point.

With a small modification, we can generate long audio se-


quences while changing the text description over time. Bor-
rowing from Villegas et al. (2022) in the context of video
generation, we refer to this approach as story mode. Con-
MusicLM: Generating Music From Text

References Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson,


X. FMA: A dataset for music analysis. In International
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici,
Society for Music Information Retrieval Conference (IS-
G., Varadarajan, B., and Vijayanarasimhan, S. Youtube-
MIR), 2017.
8m: A large-scale video classification benchmark.
arXiv:1609.08675, 2016. Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT:
pre-training of deep bidirectional transformers for lan-
Borsos, Z., Marinier, R., Vincent, D., Kharitonov,
guage understanding. In NAACL-HLT, 2019.
E., Pietquin, O., Sharifi, M., Teboul, O., Grangier,
D., Tagliasacchi, M., and Zeghidour, N. Audiolm: Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A.,
a language modeling approach to audio generation. and Sutskever, I. Jukebox: A generative model for music.
arXiv:2209.03143, 2022. arXiv:2005.00341, 2020.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, N. An image is worth 16x16 words: Transformers for
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., image recognition at scale. In International Conference
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, on Learning Representations (ICLR), 2021.
S., Radford, A., Sutskever, I., and Amodei, D. Language
models are few-shot learners. In Advances in Neural Défossez, A., Copet, J., Synnaeve, G., and Adi, Y. High
Information Processing Systems (NeurIPS), 2020. fidelity neural audio compression. arXiv:2210.13438,
2022.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Engel, J. H., Hantrakul, L., Gu, C., and Roberts, A. DDSP:
Erlingsson, U., Oprea, A., and Raffel, C. Extracting differentiable digital signal processing. In International
training data from large language models, 2020. URL Conference on Learning Representations (ICLR), 2020.
https://arxiv.org/abs/2012.07805.
Esser, P., Rombach, R., and Ommer, B. Taming transformers
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., for high-resolution image synthesis. In IEEE Conference
and Zhang, C. Quantifying memorization across neural on Computer Vision and Pattern Recognition (CVPR),
language models, 2022. URL https://arxiv.org/ 2021.
abs/2202.07646.
Forsgren, S. and Martiros, H. Riffusion - Stable diffusion
Chung, Y., Zhang, Y., Han, W., Chiu, C., Qin, J., Pang, for real-time music generation, 2022. URL https://
R., and Wu, Y. W2v-bert: Combining contrastive learn- riffusion.com/about.
ing and masked language modeling for self-supervised
speech pre-training. arXiv:2108.06209, 2021. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A.,
Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M.
Cohen, A. D., Roberts, A., Molina, A., Butryna, A., Jin, Audio set: An ontology and human-labeled dataset for
A., Kulshreshtha, A., Hutchinson, B., Zevenbergen, B., audio events. In IEEE international conference on acous-
Aguera-Arcas, B. H., ching Chang, C., Cui, C., Du, C., tics, speech and signal processing (ICASSP). IEEE, 2017.
Adiwardana, D. D. F., Chen, D., Lepikhin, D. D., Chi,
E. H., Hoffman-John, E., Cheng, H.-T., Lee, H., Kri- Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang,
vokon, I., Qin, J., Hall, J., Fenton, J., Soraker, J., Meier- C. A., Dieleman, S., Elsen, E., Engel, J. H., and Eck, D.
Hellstern, K., Olson, K., Aroyo, L. M., Bosma, M. P., Enabling factorized piano music modeling and generation
Pickett, M. J., Menegali, M. A., Croak, M., Dı́az, M., with the MAESTRO dataset. In International Conference
Lamm, M., Krikun, M., Morris, M. R., Shazeer, N., Le, on Learning Representations (ICLR), 2019.
Q. V., Bernstein, R., Rajakumar, R., Kurzweil, R., Thop-
pilan, R., Zheng, S., Bos, T., Duke, T., Doshi, T., Zhao, Hawthorne, C., Jaegle, A., Cangea, C., Borgeaud, S., Nash,
V. Y., Prabhakaran, V., Rusch, W., Li, Y., Huang, Y., Zhou, C., Malinowski, M., Dieleman, S., Vinyals, O., Botvinick,
Y., Xu, Y., and Chen, Z. Lamda: Language models for M. M., Simon, I., Sheahan, H., Zeghidour, N., Alayrac,
dialog applications. arXiv:2201.08239, 2022. J., Carreira, J., and Engel, J. H. General-purpose, long-
context autoregressive modeling with perceiver AR. In
Cuturi, M. Sinkhorn distances: Lightspeed computation Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu,
of optimal transport. In Advances in Neural Information G., and Sabato, S. (eds.), International Conference on
Processing Systems (NeurIPS), 2013. Machine Learning (ICML), 2022a.
MusicLM: Generating Music From Text

Hawthorne, C., Simon, I., Roberts, A., Zeghidour, N., Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Gardner, J., Manilow, E., and Engel, J. H. Multi- Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
instrument music synthesis with spectrogram diffusion. J., Krueger, G., and Sutskever, I. Learning transferable
arXiv:2206.05408, 2022b. visual models from natural language supervision. In
International Conference on Machine Learning (ICML),
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., 2021.
Jansen, A., Moore, C., Plakal, M., Platt, D., Saurous,
R. A., Seybold, B., Slaney, M., Weiss, R., and Wilson, Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
K. Cnn architectures for large-scale audio classification. Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Explor-
In International Conference on Acoustics, Speech and ing the limits of transfer learning with a unified text-to-
Signal Processing (ICASSP), 2017. text transformer. Journal of Machine Learning Research
(JMLR), 2020.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
bilistic models. Advances in Neural Information Process-
ford, A., Chen, M., and Sutskever, I. Zero-shot text-to-
ing Systems (NeurIPS), 2020.
image generation. In Meila, M. and Zhang, T. (eds.), Inter-
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, national Conference on Machine Learning (ICML), 2021.
M., and Fleet, D. J. Video diffusion models. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
arXiv:2204.03458, 2022. M. Hierarchical text-conditional image generation with
clip latents. arXiv:2204.06125, 2022.
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J.
Cogvideo: Large-scale pretraining for text-to-video gene- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-
ration via transformers. arXiv:2205.15868, 2022. mer, B. High-resolution image synthesis with latent diffu-
sion models. In Proceedings of the IEEE/CVF Conference
Huang, C. A., Vaswani, A., Uszkoreit, J., Simon, I., on Computer Vision and Pattern Recognition (CVPR), pp.
Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., 10684–10695, June 2022a.
Dinculescu, M., and Eck, D. Music transformer: Gene-
rating music with long-term structure. In International Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-
Conference on Learning Representations (ICLR), 2019. mer, B. High-resolution image synthesis with latent diffu-
sion models. In Proceedings of the IEEE/CVF Conference
Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and on Computer Vision and Pattern Recognition, 2022b.
Ellis, D. P. W. Mulan: A joint embedding of music audio
and natural language. In International Society for Music Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
Information Retrieval Conference (ISMIR), 2022. ton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., and
Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. Norouzi, M. Photorealistic text-to-image diffusion mod-
Fréchet audio distance: A reference-free metric for evalu- els with deep language understanding. arXiv:2205.11487,
ating music enhancement algorithms. In INTERSPEECH, 2022.
2019.
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A
Kim, C. D., Kim, B., Lee, H., and Kim, G. Audiocaps: unified embedding for face recognition and clustering. In
Generating captions for audios in the wild. In NAACL- Proceedings of the IEEE conference on computer vision
HLT, 2019. and pattern recognition (CVPR), 2015.
Shor, J., Jansen, A., Maor, R., Lang, O., Tuval, O., de Chau-
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez,
mont Quitry, F., Tagliasacchi, M., Shavitt, I., Emanuel,
A., Copet, J., Parikh, D., Taigman, Y., and Adi, Y. Audio-
D., and Haviv, Y. Towards Learning a Universal Non-
gen: Textually guided audio generation, 2022.
Semantic Representation of Speech. In INTERSPEECH,
Mubert-Inc. Mubert. https://mubert. 2020.
com/, https://github.com/MubertAI/ van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Mubert-Text-to-Music, 2022. Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
and Kavukcuoglu, K. Wavenet: A generative model for
Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P.,
raw audio. In ISCA, 2016.
Mishkin, P., McGrew, B., Sutskever, I., and Chen, M.
GLIDE: towards photorealistic image generation and edit- Van Den Oord, A., Vinyals, O., et al. Neural discrete repre-
ing with text-guided diffusion models. In International sentation learning. Advances in neural information pro-
Conference on Machine Learning (ICML), 2022. cessing systems (NeurIPS), 2017.
MusicLM: Generating Music From Text

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,


L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. Advances in neural information pro-
cessing systems (NeurIPS), 2017.
Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo,
H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., and
Erhan, D. Phenaki: Variable length video generation
from open domain textual description. arXiv:2210.02399,
2022.
Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D.,
and Duan, N. Nüwa: Visual synthesis pre-training for
neural visual world creation. In European Conference on
Computer Vision (ECCV), 2022a.
Wu, H., Seetharaman, P., Kumar, K., and Bello, J. P.
Wav2clip: Learning robust audio representations from
CLIP. In International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2022b.

Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y.,
and Yu, D. Diffsound: Discrete diffusion model for text-
to-sound generation. arXiv:2207.09983, 2022.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z.,
Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson,
B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J.,
and Wu, Y. Scaling autoregressive models for content-
rich text-to-image generation, 2022.
Zeghidour, N., Teboul, O., de Chaumont Quitry, F., and
Tagliasacchi, M. LEAF: A learnable frontend for audio
classification. In 9th International Conference on Learn-
ing Representations, ICLR 2021, Virtual Event, Austria,
May 3-7, 2021. OpenReview.net, 2021. URL https:
//openreview.net/forum?id=jM76BCb6F9m.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and
Tagliasacchi, M. Soundstream: An end-to-end neural
audio codec. IEEE ACM Trans. Audio Speech Lang.
Process., 30, 2022.
Zen, H., Senior, A., and Schuster, M. Statistical paramet-
ric speech synthesis using deep neural networks. In In-
ternational Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2013.
MusicLM: Generating Music From Text

A. MusicCaps Dataset
Together with this paper, we release MusicCaps, a high-quality music caption dataset.6 This dataset includes music clips
from AudioSet (Gemmeke et al., 2017), paired with corresponding text descriptions in English. It contains a total of
5,521 examples, out of which 2,858 are from the AudioSet eval and 2,663 from the AudioSet train split. We further tag
1,000 examples as a balanced subset of our dataset, which is balanced with respect to the genres of the music contained. All
examples in the balanced subset are from the AudioSet eval split.
Examples of free text captions:

• “This folk song features a male voice singing the main melody in an emotional mood. This is accompanied by an
accordion playing fills in the background. A violin plays a droning melody. There is no percussion in this song. This
song can be played at a Central Asian classical concert.”
• “This is a live recording of a keyboardist playing a twelve bar blues progression on an electric keyboard. The player
adds embellishments between chord changes and the piece sounds groovy, bluesy and soulful.”
• “A synth is playing an arpeggio pluck with a lot of reverb rising and falling in velocity. Another synth sound is playing
pads and a sub bassline. This song is full of synth sounds creating a soothing and adventurous atmosphere. This song
may be playing at a festival during two songs for a buildup.”
• “A low sounding male voice is rapping over a fast paced drums playing a reggaeton beat along with a bass. Something
like a guitar is playing the melody along. This recording is of poor audio-quality. In the background a laughter can be
noticed. This song may be playing in a bar.”
• “The electronic music features a section that repeats roughly every two seconds. It consists of a beat that’s made of a
kick drum and claps. A buzzing synth sets the pulsation of the music by playing once every two beats. The whole music
sounds like a loop being played over and over. Towards the end of the excerpt a crescendo-like buzzing sound can be
heard, increasing the tension.”

Examples of aspect lists:

• “pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead, soft
female vocal, punchy kick, sustained synth bass, claps, emotional, sad, passionate”
• “amateur recording, finger snipping, male mid range voice singing, reverb”
• “backing track, jazzy, digital drums, piano, e-bass, trumpet, acoustic guitar, digital keyboard song, medium tempo”
• “rubab instrument, repetitive melody on different octaves, no other instruments, plucked string instrument, no voice,
instrumental, fast tempo”
• “instrumental, white noise, female vocalisation, three unrelated tracks, electric guitar harmony, bass guitar, keyboard
harmony, female lead vocalisation, keyboard harmony, slick drumming, boomy bass drops, male voice backup
vocalisation”

6
kaggle.com/datasets/googleai/musiccaps
MusicLM: Generating Music From Text

Funk Electronic music


1.8% 15.6%
Reggae
2.0%
Pop music
2.1%
Music of Asia
2.6%
Christian music
Classical music
2.8%
Traditional music 13.7%
3.2%
Hip hop music
3.4%
Vocal music
3.5%
Music of LatAm
3.8% Rock music
Jazz 10.5%
4.1%
New-age music Country
4.4% 5.6%
Music for children Blues
5.0% 5.3%

Figure 4. Genre distribution of all 5.5k examples of MusicCaps, according to an AudioSet classifier.

Music of Africa Pop music


3.0% 4.3%
Folk music Hip hop music
3.1% 4.3%
Soul music Rock music
3.5% 4.2%
Middle Eastern Reggae
4.0% 4.2%
Rhythm and blues Country
4.0% 4.2%
Independent music Funk
4.2% 4.2%
Traditional music
Jazz
4.2%
4.2%
Ska Classical music
4.2% 4.2%
Music of Asia Electronic music
4.2% 4.2%
Christian music Music of LatAm
4.2% 4.2%
Vocal music Blues
4.2% 4.2%
New-age music Music for children
4.2% 4.2%

Figure 5. Genre distribution of a balanced 1k example subset of MusicCaps, according to an AudioSet classifier.
MusicLM: Generating Music From Text

B. Qualitative Evaluation
Participants in the listening test were presented with two 10-second clips and a text caption, and asked which clip is best
described the text of the caption on a 5-point Likert scale. They were also instructed to ignore audio quality and focus just
on how well the text matches the music (similar to MuLan score). Figure 6 shows the user interface presented to raters.
We collected 1200 ratings, with each source involved in 600 pair-wise comparisons. Figures 7 and 8 show the granular
results of pairwise comparisons between the models. According to a post-hoc analysis using the Wilcoxon signed-rank test
with Bonferroni correction (with p < 0.01/15), the orderings shown in Figure 8 from raters are all statistically significant.

Figure 6. User interface for the human listener study.

Figure 7. Pairwise comparisons from the human listener study. Each pair is compared on a 5-point Likert scale. Raters had a decisive
model preference in all cases except Mubert vs. Riffusion.
MusicLM: Generating Music From Text

Mu ps

ion
LM
Ca

rt
sic

sic

fus

be
Mu

Mu
Rif
All
MusicCaps 84 73 88 91 80

MusicLM 60 27 75 81 60

Riffusion 32 12 25 66 40

Mubert 19 9.2 19 34 20

Figure 8. Win percentage from the human listener study. Each row indicates the % of times listeners found the music to better match
the caption from that system to those from any other system (first column, N = 1200) and each system individually (other columns,
N = 600). The ground truth data (MusicCaps) clearly is the best match to the captions, but followed closely by MusicLM, which even
beats the ground truth in 27% of comparisons.

C. Melody Conditioning
We provide here implementation details of the model used for conditioning the music generation on melody. The model is
based on a small ViT (Dosovitskiy et al., 2021) composed of 12 layers, 6 attention heads, embedding dimension of 512
and feed-forward layer of dimension 1024. The input to the model are the temporal frames of the mel spectrogram of the
audio. We use semi-hard triplet loss (Schroff et al., 2015) to train the melody embedding model to generate 192 dimensional
embeddings for each 4 seconds of audio. The model learns to generate embeddings which are representative of a melody
while being invariant to acoustic properties related to the instruments being played. This is particularly advantageous,
since this representation is complementary to the representation learned by the MuLan embeddings. Hence, our melody
embeddings and the MuLan can be jointly and complementarily used for conditioning the music generation process. During
training, we consider input audio with a duration of 10 seconds. We extract three melody embeddings, with a hop length of
3 seconds, discretize each of them to tokens with residual vector quantization (RVQ) and concatenate the resulting token
sequences with the MuLan audio tokens MA . We use an RVQ composed of 24 quantizers, each with a vocabulary size of 512.

You might also like