Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
Adam Roberts 1 Marco Tagliasacchi 1 Matt Sharifi 1 Neil Zeghidour 1 Christian Frank 1
gether, and allows training on massive audio-only corpora. 2. Background and Related Work
That is, we use the MuLan embeddings computed from the
audio as conditioning during training, while we use MuLan The state-of-the-art in generative modeling for various do-
embeddings computed from the text input during inference. mains is largely dominated either by Transformer-based au-
toregressive models (Vaswani et al., 2017) or U-Net-based
When trained on a large dataset of unlabeled music, diffusion models (Ho et al., 2020). In this section, we re-
MusicLM learns to generate long and coherent music at 24 view the related work with an emphasis on autoregressive
kHz, for text descriptions of significant complexity, such as generative models operating on discrete tokens, which share
“enchanting jazz song with a memorable saxophone solo and similarities with MusicLM.
a solo singer” or “Berlin 90s techno with a low bass and
strong kick”. To address the lack of evaluation data for this 2.1. Quantization
task, we introduce MusicCaps, a new high-quality music
caption dataset with 5.5k examples prepared by expert musi- Modeling sequences of discrete tokens autoregressively
cians, which we publicly release to support future research. has proven to be a powerful approach in natural language
processing (Brown et al., 2020; Cohen et al., 2022) and
Our experiments show through quantitative metrics and image or video generation (Esser et al., 2021; Ramesh et al.,
human evaluations that MusicLM outperforms previous 2021; Yu et al., 2022; Villegas et al., 2022). Quantization
systems such as Mubert (Mubert-Inc, 2022) and Riffu- is a key component to the success of autoregressive models
sion (Forsgren & Martiros, 2022), both in terms of quality for continuous signals, including images, videos, and audio.
and adherence to the caption. Furthermore, since describing The goal of quantization is to provide a compact, discrete
some aspects of music with words can be difficult or even representation, which at the same time allows for high-
impossible, we show how our method supports condition- fidelity reconstruction. VQ-VAEs (Van Den Oord et al.,
ing signals beyond text. Concretely, we extend MusicLM 2017) demonstrated impressive reconstruction quality at
to accept an additional melody in the form of audio (e.g., low bitrates in various domains and serve as the underlying
whistling, humming) as conditioning to generate a music quantizer for many approaches.
clip that follows the desired melody, rendered in the style
described by the text prompt. SoundStream (Zeghidour et al., 2022) is a universal neural
audio codec capable of compressing general audio at low
We acknowledge the risks associated with music generation, bitrates, while maintaining a high reconstruction quality. To
in particular, the potential misappropriation of creative con- achieve this, SoundStream uses residual vector quantization
tent. In accordance with responsible model development (RVQ), allowing scalability to higher bitrate and quality,
practices, we conduct a thorough study of memorization without a significant computational cost. More specifically,
by adapting and extending the methodology of Carlini et al. RVQ is a hierarchical quantization scheme composing a se-
(2022) used for text-based large language models. Our ries of vector quantizers, where the target signal is recon-
findings show that when feeding MuLan embeddings to structed as the sum of quantizer outputs. Due to the compo-
MusicLM, the sequences of generated tokens significantly sition of quantizers, RVQ avoids the exponential blowup in
differ from the corresponding sequences in the training set. the codebook size as the target bitrate increases. Moreover,
The key contributions of this work are the following: the fact that each quantizer is fitted to the residual of coarser
quantizers introduces a hierarchical structure to the quan-
1. We introduce MusicLM, a generative model that pro- tizers, where coarser levels are more important for high-
duces high-quality music at 24 kHz which is consistent fidelity reconstruction. This property is desirable for genera-
over several minutes while being faithful to a text con- tion, since the past context can be defined by only attending
ditioning signal. to the coarse tokens. Recently, SoundStream was extended
by EnCodec (Défossez et al., 2022) to higher bitrates and
2. We extend our method to other conditioning signals, stereophonic audio. In this work, we rely on SoundStream
such as a melody that is then synthesized according to as our audio tokenizer, since it can reconstruct 24 kHz mu-
the text prompt. Furthermore, we demonstrate long and sic at 6 kbps with high fidelity.
coherent music generation of up to 5-minute long clips.
2.2. Generative Models for Audio
3. We release the first evaluation dataset collected specif-
ically for the task of text-to-music generation: Mu- Despite the challenge of generating high-quality audio with
sicCaps is a hand-curated, high-quality dataset of long-term consistency, a series of approaches have recently
5.5k music-text pairs prepared by musicians. tackled the problem with some success. Jukebox (Dhari-
wal et al., 2020), for example, proposes a hierarchy of VQ-
VAEs at various time resolutions to achieve high temporal
MusicLM: Generating Music From Text
coherence, but the generated music displays noticeable arti- Symbolic representations of music (e.g., MIDI) can also
facts. PerceiverAR (Hawthorne et al., 2022a), on the other be used to drive the generative process as a form of strong
hand, proposes to model a sequence of SoundStream tokens conditioning, as demonstrated by Huang et al. (2019);
autoregressively, achieving high-quality audio, but compro- Hawthorne et al. (2019); Engel et al. (2020). MusicLM
mising the long-term temporal coherence. enables a more natural and intuitive way of providing a con-
ditioning signal, for example through a hummed melody,
Inspired by these approaches, AudioLM (Borsos et al., 2022)
which can also be combined with a text description.
addresses the trade-off between coherence and high-quality
synthesis by relying on a hierarchical tokenization and ge-
neration scheme. Concretely, the approach distinguishes 2.4. Text-Conditioned Image Generation
between two token types: (1) semantic tokens that allow Precursor to text-conditioned audio synthesis are the text-
the modeling of long-term structure, extracted from models conditioned image generation models, which made signifi-
pretrained on audio data with the objective of masked lan- cant progress in quality due to architectural improvements
guage modeling; (2) acoustic tokens, provided by a neural and the availability of massive, high-quality paired train-
audio codec, for capturing fine acoustic details. This allows ing data. Prominent Transformer-based autoregressive ap-
AudioLM to generate coherent and high-quality speech as proaches include Ramesh et al. (2021); Yu et al. (2022),
well as piano music continuations without relying on tran- while Nichol et al. (2022); Rombach et al. (2022b); Saharia
scripts or symbolic music representations. et al. (2022) present diffusion-based models. The text-to-
MusicLM builds on top of AudioLM with three important image approaches have been extended to generating videos
additional contributions: (1) we condition the generation from a text prompt (Wu et al., 2022a; Hong et al., 2022; Vil-
process on a descriptive text, (2) we show that the condition- legas et al., 2022; Ho et al., 2022).
ing can be extended to other signals such as melody, and The closest to our approach among these works is
(3) we model a large variety of long music sequences be- DALL·E 2 (Ramesh et al., 2022). In particular, similarly
yond piano music (from drum’n’bass over jazz to classical to the way DALL·E 2 relies on CLIP (Radford et al., 2021)
music). for text encoding, we also use a joint music-text embed-
ding model for the same purpose. In contrast to DALL·E 2,
2.3. Conditioned Audio Generation which uses a diffusion model as a decoder, our decoder is
based on AudioLM. Furthermore, we also omit the prior
Generating audio from a text description (such as “whistling
model mapping text embeddings to music embeddings, such
with laughter in the background”) has recently been tack-
that the AudioLM-based decoder can be trained on an audio-
led by several works. DiffSound (Yang et al., 2022) uses
only dataset and the music embedding is simply replaced
CLIP (Radford et al., 2021) as the text encoder and applies
during inference by the text embedding.
a diffusion model to predict the quantized mel spectrogram
features of the target audio based on the text embeddings.
AudioGen (Kreuk et al., 2022) uses a T5 (Raffel et al., 2.5. Joint Embedding Models for Music and Text
2020) encoder for embedding the text, and an autoregressive MuLan (Huang et al., 2022) is a music-text joint embedding
Transformer decoder for predicting target audio codes pro- model consisting of two embedding towers, one for each
duced by EnCodec (Défossez et al., 2022). Both approaches modality. The towers map the two modalities to a shared
rely on a modest amount of paired training data such as Au- embedding space of 128 dimensions using contrastive learn-
dioSet (Gemmeke et al., 2017) and AudioCaps (Kim et al., ing, with a setup similar to (Radford et al., 2021; Wu et al.,
2019) (totalling less than 5k hours after filtering). 2022b). The text embedding network is a BERT (Devlin
Closer to MusicLM, there are also works focusing on music et al., 2019) pre-trained on a large corpus of text-only data,
generation conditioned on text. In Mubert (Mubert-Inc, while we use the ResNet-50 variant of the audio tower.
2022), the text prompt is embedded by a Transformer, music MuLan is trained on pairs of music clips and their corre-
tags which are close to the encoded prompt are selected and sponding text annotations. Importantly, MuLan imposes
used to query the song generation API. Based on the selected only weak requirements on its training data quality, learn-
tags, Mubert generates a combination of sounds, which ing cross-modal correspondences even when the music-text
in turn were generated by musicians and sound designers. pairs are only weakly associated. The ability to link mu-
This is in contrast to Riffusion (Forsgren & Martiros, 2022), sic to unconstrained natural language descriptions makes it
which fine-tunes a Stable Diffusion model (Rombach et al., applicable for retrieval or zero-shot music tagging. In this
2022a) on mel spectrograms of music pieces from a paired work, we rely on the pretrained and frozen model of Huang
music-text dataset. We use both Mubert and Riffusion as et al. (2022).
baselines for our work, showing that we improve the audio
generation quality and adherence to the text description.
MusicLM: Generating Music From Text
Acoustic
MuLan tokens Semantic tokens Acoustic tokens
modeling SoundStream
Semantic Decoder
MuLan tokens Semantic tokens
modeling
Target Audio
Figure 2. Left: During training we extract the MuLan audio tokens, semantic tokens, and acoustic tokens from the audio-only training
set. In the semantic modeling stage, we predict semantic tokens using MuLan audio tokens as conditioning. In the subsequent acoustic
modeling stage, we predict acoustic tokens, given both MuLan audio tokens and semantic tokens. Each stage is modeled as a sequence-to-
sequence task using decoder-only Transformers. Right: During inference, we use MuLan text tokens computed from the text prompt as
conditioning signal and convert the generated audio tokens to waveforms using the SoundStream decoder.
to generate plausible audio. However, the generated sam- that is, how often a condition is strongly or weakly preferred.
ples might not necessarily adhere to the text description pro- The samples are selected from the genre-balanced 1k subset
vided as conditioning. of our evaluation data.
We report the FAD based on two audio embedding models,
both of which are publicly available: (1) Trill2 (Shor et al., Training data memorization. Large language models
2020), which is trained on speech data, and (2) VGGish3 , have the capacity to memorize patterns seen in the training
(Hershey et al., 2017) which is trained on the YouTube-8M data (Carlini et al., 2020). We adapt the methodology used
audio event dataset (Abu-El-Haija et al., 2016). Because in Carlini et al. (2022) to study the extent to which MusicLM
of the difference in training data, we expect the models to might memorize music segments. We focus on the first stage,
measure different aspects of the audio quality (speech and responsible for semantic modeling. We select N examples at
non-speech, respectively). random from the training set. For each example, we feed to
the model a prompt which includes the MuLan audio tokens
MA followed by a sequence of the first T semantic tokens S,
KL Divergence (KLD). There is a many-to-many rela-
with T ∈ {0, . . . , 250}, corresponding to up to 10 seconds.
tionship between text descriptions and music clips com-
We use greedy decoding to generate a continuation of 125 se-
patible with them. It is therefore not possible to directly
mantic tokens (5 seconds) and we compare the generated
compare the generated music with the reference at the level
tokens to the target tokens in the dataset. We measure exact
of the audio waveform. To assess the adherence to the input
matches as the fraction of examples for which generated and
text description, we adopt a proxy method similar to the
target tokens are identical over the whole sampled segment.
one proposed in Yang et al. (2022); Kreuk et al. (2022).
Specifically, we use a LEAF (Zeghidour et al., 2021) clas- In addition, we propose a methodology to detect approx-
sifier trained for multi-label classification on AudioSet, to imate matches, based on the observation that sequences of
compute class predictions for both the generated and the seemingly different tokens might lead to acoustically sim-
reference music and measure the KL divergence between ilar audio segments. Namely, we compute the histogram
probability distributions of class predictions. When the of semantic token counts over the corresponding vocab-
KL-divergence is low, the generated music is expected to ulary {0, . . . , 1023} from both the generated and target
have similar acoustic characteristics as the reference music, tokens, and define a matching cost measure between his-
according to the classifier. tograms as follows. First, we compute the distance matrix
between pairs of semantic tokens, which is populated by the
MuLan Cycle Consistency (MCC). As a joint music- Euclidean distances between the corresponding k-means
text embedding model, MuLan can be used to quantify the centroids used to quantize w2v-BERT to semantic tokens
similarity between music-text pairs. We compute the MuLan (see Section 3.1). Then, we solve an optimal transport prob-
embeddings from the text descriptions in MusicCaps as lem to find the matching cost between a pair of histograms
well as the generated music based on them, and define the using the Sinkhorn algorithm (Cuturi, 2013), considering
MCC metric as the average cosine similarity between these only the sub-matrix corresponding to non-zero token counts
embeddings. in the two histograms. To calibrate the threshold used to
determine whether two sequences might be approximate
matches, we construct negative pairs by permuting the
Qualitative evaluation. Ultimately, we rely on subjective
examples with target tokens and measure the empirical
tests to evaluate the adherence of generated samples to the
distribution of matching costs for such negative pairs. We
text description. We set up an A-vs-B human rating task, in
set the match threshold τ to 0.85, which leads to less than
which raters are presented with the text description and two
0.01% false positive approximate matches.
samples of music generated by two different models, or one
model and the reference music. There are five possible an-
swers: strong or weak preference for A or B, and no prefer- 5. Results
ence. The raters are instructed not to take the music quality
We evaluate MusicLM by comparing it with two recent
into account when making their decision, because this as-
baselines for music generation from descriptive text, namely
pect of the evaluation is already covered by the FAD metric.
Mubert (Mubert-Inc, 2022) and Riffusion (Forsgren & Mar-
We consider the output of n different models, in addition tiros, 2022). In particular, we generate audio by querying
to the reference music, thus a total of n + 1 conditions and the Mubert API,4 and by running inference on the Riffusion
n(n + 1)/2 pairs. To aggregate the results of the pairwise model.5 We perform our evaluations on MusicCaps, the eval-
tests and rank conditions, we count the number of “wins”, uation dataset we publicly release together with this paper.
2 4
tfhub.dev/google/nonsemantic-speech-benchmark/trill/3 github.com/MubertAI (accessed in Dec 2022 and Jan 2023)
3 5
tfhub.dev/google/vggish/1 github.com/riffusion/riffusion-app (accessed on Dec 27, 2022)
MusicLM: Generating Music From Text
7. Conclusions
0.0 1.0 2.0 5.0 10.0
Prompt length [sec] We introduce MusicLM, a text-conditioned generative
Figure 3. Memorization results for the semantic modeling stage. model that produces high-quality music at 24 kHz, consis-
We compare the semantic tokens generated for 5 seconds of audio tent over several minutes, while being faithful to the text
to corresponding tokens in the training set, considering exact and conditioning signal. We demonstrate that our method outper-
approximate matches. forms baselines on MusicCaps, a hand-curated, high-quality
dataset of 5.5k music-text pairs prepared by musicians.
6. Extensions Some limitations of our method are inherited from MuLan,
in that our model misunderstands negations and does not
Melody conditioning. We extend MusicLM in such a adhere to precise temporal ordering described in the text.
way that it can generate music based on both a text descrip- Moreover, further improvements of our quantitative evalu-
tion and a melody, which is provided in the form of hum- ations are needed. Specifically, since MCC also relies on
ming, singing, whistling, or playing an instrument. This MuLan, the MCC scores are favorable to our method.
requires extending the conditioning signal in a way that cap-
tures the target melody. To this end, we create a synthetic Future work may focus on lyrics generation, along with
dataset composed of audio pairs with matching melodies improvement of text conditioning and vocal quality. Another
but different acoustics. To create such pairs, we use differ- aspect is the modeling of high-level song structure like
ent versions of the same music clip, such as covers, instru- introduction, verse, and chorus. Modeling the music at a
mentals, or vocals. Additionally, we acquire data pairs of higher sample rate is an additional goal.
people humming and singing. We then train a joint embed-
ding model such that when two audio clips contain the same 8. Broader Impact
melody, the corresponding embeddings are close to each
other. For implementation details we refer to Appendix C. MusicLM generates high-quality music based on a text de-
scription, and thus it further extends the set of tools that as-
To extract the melody conditioning for MusicLM, we quan- sist humans with creative music tasks. However, there are
tize the melody embeddings with RVQ, and concatenate the several risks associated with our model and the use-case
resulting token sequences with the MuLan audio tokens MA . it tackles. The generated samples will reflect the biases
During inference, we compute melody tokens from the input present in the training data, raising the question about ap-
audio clip and concatenate them with the MuLan text tokens propriateness for music generation for cultures underrepre-
MT . Based on this conditioning, MusicLM can success- sented in the training data, while at the same time also rais-
fully generate music which follows the melody contained in ing concerns about cultural appropriation.
the input audio clip, while adhering to the text description.
We acknowledge the risk of potential misappropriation of
Long generation and story mode. In MusicLM, gene- creative content associated to the use-case. In accordance
ration is autoregressive in the temporal dimension which with responsible model development practices, we con-
makes it possible to generate sequences longer than those ducted a thorough study of memorization, adapting and
used during training. In practice, the semantic modeling extending a methodology used in the context of text-based
stage is trained on sequences of 30 seconds. To generate LLMs, focusing on the semantic modeling stage. We found
longer sequences, we advance with a stride of 15 seconds, that only a tiny fraction of examples was memorized ex-
using 15 seconds as prefix to generate an additional 15 sec- actly, while for 1% of the examples we could identify an ap-
onds, always conditioning on the same text description. proximate match. We strongly emphasize the need for more
With this approach we can generate long audio sequences future work in tackling these risks associated to music gene-
which are coherent over several minutes. ration — we have no plans to release models at this point.
Hawthorne, C., Simon, I., Roberts, A., Zeghidour, N., Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Gardner, J., Manilow, E., and Engel, J. H. Multi- Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
instrument music synthesis with spectrogram diffusion. J., Krueger, G., and Sutskever, I. Learning transferable
arXiv:2206.05408, 2022b. visual models from natural language supervision. In
International Conference on Machine Learning (ICML),
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., 2021.
Jansen, A., Moore, C., Plakal, M., Platt, D., Saurous,
R. A., Seybold, B., Slaney, M., Weiss, R., and Wilson, Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
K. Cnn architectures for large-scale audio classification. Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Explor-
In International Conference on Acoustics, Speech and ing the limits of transfer learning with a unified text-to-
Signal Processing (ICASSP), 2017. text transformer. Journal of Machine Learning Research
(JMLR), 2020.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
bilistic models. Advances in Neural Information Process-
ford, A., Chen, M., and Sutskever, I. Zero-shot text-to-
ing Systems (NeurIPS), 2020.
image generation. In Meila, M. and Zhang, T. (eds.), Inter-
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, national Conference on Machine Learning (ICML), 2021.
M., and Fleet, D. J. Video diffusion models. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
arXiv:2204.03458, 2022. M. Hierarchical text-conditional image generation with
clip latents. arXiv:2204.06125, 2022.
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J.
Cogvideo: Large-scale pretraining for text-to-video gene- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-
ration via transformers. arXiv:2205.15868, 2022. mer, B. High-resolution image synthesis with latent diffu-
sion models. In Proceedings of the IEEE/CVF Conference
Huang, C. A., Vaswani, A., Uszkoreit, J., Simon, I., on Computer Vision and Pattern Recognition (CVPR), pp.
Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., 10684–10695, June 2022a.
Dinculescu, M., and Eck, D. Music transformer: Gene-
rating music with long-term structure. In International Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-
Conference on Learning Representations (ICLR), 2019. mer, B. High-resolution image synthesis with latent diffu-
sion models. In Proceedings of the IEEE/CVF Conference
Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and on Computer Vision and Pattern Recognition, 2022b.
Ellis, D. P. W. Mulan: A joint embedding of music audio
and natural language. In International Society for Music Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
Information Retrieval Conference (ISMIR), 2022. ton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., and
Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. Norouzi, M. Photorealistic text-to-image diffusion mod-
Fréchet audio distance: A reference-free metric for evalu- els with deep language understanding. arXiv:2205.11487,
ating music enhancement algorithms. In INTERSPEECH, 2022.
2019.
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A
Kim, C. D., Kim, B., Lee, H., and Kim, G. Audiocaps: unified embedding for face recognition and clustering. In
Generating captions for audios in the wild. In NAACL- Proceedings of the IEEE conference on computer vision
HLT, 2019. and pattern recognition (CVPR), 2015.
Shor, J., Jansen, A., Maor, R., Lang, O., Tuval, O., de Chau-
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez,
mont Quitry, F., Tagliasacchi, M., Shavitt, I., Emanuel,
A., Copet, J., Parikh, D., Taigman, Y., and Adi, Y. Audio-
D., and Haviv, Y. Towards Learning a Universal Non-
gen: Textually guided audio generation, 2022.
Semantic Representation of Speech. In INTERSPEECH,
Mubert-Inc. Mubert. https://mubert. 2020.
com/, https://github.com/MubertAI/ van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Mubert-Text-to-Music, 2022. Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
and Kavukcuoglu, K. Wavenet: A generative model for
Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P.,
raw audio. In ISCA, 2016.
Mishkin, P., McGrew, B., Sutskever, I., and Chen, M.
GLIDE: towards photorealistic image generation and edit- Van Den Oord, A., Vinyals, O., et al. Neural discrete repre-
ing with text-guided diffusion models. In International sentation learning. Advances in neural information pro-
Conference on Machine Learning (ICML), 2022. cessing systems (NeurIPS), 2017.
MusicLM: Generating Music From Text
Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y.,
and Yu, D. Diffsound: Discrete diffusion model for text-
to-sound generation. arXiv:2207.09983, 2022.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z.,
Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson,
B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J.,
and Wu, Y. Scaling autoregressive models for content-
rich text-to-image generation, 2022.
Zeghidour, N., Teboul, O., de Chaumont Quitry, F., and
Tagliasacchi, M. LEAF: A learnable frontend for audio
classification. In 9th International Conference on Learn-
ing Representations, ICLR 2021, Virtual Event, Austria,
May 3-7, 2021. OpenReview.net, 2021. URL https:
//openreview.net/forum?id=jM76BCb6F9m.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and
Tagliasacchi, M. Soundstream: An end-to-end neural
audio codec. IEEE ACM Trans. Audio Speech Lang.
Process., 30, 2022.
Zen, H., Senior, A., and Schuster, M. Statistical paramet-
ric speech synthesis using deep neural networks. In In-
ternational Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2013.
MusicLM: Generating Music From Text
A. MusicCaps Dataset
Together with this paper, we release MusicCaps, a high-quality music caption dataset.6 This dataset includes music clips
from AudioSet (Gemmeke et al., 2017), paired with corresponding text descriptions in English. It contains a total of
5,521 examples, out of which 2,858 are from the AudioSet eval and 2,663 from the AudioSet train split. We further tag
1,000 examples as a balanced subset of our dataset, which is balanced with respect to the genres of the music contained. All
examples in the balanced subset are from the AudioSet eval split.
Examples of free text captions:
• “This folk song features a male voice singing the main melody in an emotional mood. This is accompanied by an
accordion playing fills in the background. A violin plays a droning melody. There is no percussion in this song. This
song can be played at a Central Asian classical concert.”
• “This is a live recording of a keyboardist playing a twelve bar blues progression on an electric keyboard. The player
adds embellishments between chord changes and the piece sounds groovy, bluesy and soulful.”
• “A synth is playing an arpeggio pluck with a lot of reverb rising and falling in velocity. Another synth sound is playing
pads and a sub bassline. This song is full of synth sounds creating a soothing and adventurous atmosphere. This song
may be playing at a festival during two songs for a buildup.”
• “A low sounding male voice is rapping over a fast paced drums playing a reggaeton beat along with a bass. Something
like a guitar is playing the melody along. This recording is of poor audio-quality. In the background a laughter can be
noticed. This song may be playing in a bar.”
• “The electronic music features a section that repeats roughly every two seconds. It consists of a beat that’s made of a
kick drum and claps. A buzzing synth sets the pulsation of the music by playing once every two beats. The whole music
sounds like a loop being played over and over. Towards the end of the excerpt a crescendo-like buzzing sound can be
heard, increasing the tension.”
• “pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead, soft
female vocal, punchy kick, sustained synth bass, claps, emotional, sad, passionate”
• “amateur recording, finger snipping, male mid range voice singing, reverb”
• “backing track, jazzy, digital drums, piano, e-bass, trumpet, acoustic guitar, digital keyboard song, medium tempo”
• “rubab instrument, repetitive melody on different octaves, no other instruments, plucked string instrument, no voice,
instrumental, fast tempo”
• “instrumental, white noise, female vocalisation, three unrelated tracks, electric guitar harmony, bass guitar, keyboard
harmony, female lead vocalisation, keyboard harmony, slick drumming, boomy bass drops, male voice backup
vocalisation”
6
kaggle.com/datasets/googleai/musiccaps
MusicLM: Generating Music From Text
Figure 4. Genre distribution of all 5.5k examples of MusicCaps, according to an AudioSet classifier.
Figure 5. Genre distribution of a balanced 1k example subset of MusicCaps, according to an AudioSet classifier.
MusicLM: Generating Music From Text
B. Qualitative Evaluation
Participants in the listening test were presented with two 10-second clips and a text caption, and asked which clip is best
described the text of the caption on a 5-point Likert scale. They were also instructed to ignore audio quality and focus just
on how well the text matches the music (similar to MuLan score). Figure 6 shows the user interface presented to raters.
We collected 1200 ratings, with each source involved in 600 pair-wise comparisons. Figures 7 and 8 show the granular
results of pairwise comparisons between the models. According to a post-hoc analysis using the Wilcoxon signed-rank test
with Bonferroni correction (with p < 0.01/15), the orderings shown in Figure 8 from raters are all statistically significant.
Figure 7. Pairwise comparisons from the human listener study. Each pair is compared on a 5-point Likert scale. Raters had a decisive
model preference in all cases except Mubert vs. Riffusion.
MusicLM: Generating Music From Text
Mu ps
ion
LM
Ca
rt
sic
sic
fus
be
Mu
Mu
Rif
All
MusicCaps 84 73 88 91 80
MusicLM 60 27 75 81 60
Riffusion 32 12 25 66 40
Mubert 19 9.2 19 34 20
Figure 8. Win percentage from the human listener study. Each row indicates the % of times listeners found the music to better match
the caption from that system to those from any other system (first column, N = 1200) and each system individually (other columns,
N = 600). The ground truth data (MusicCaps) clearly is the best match to the captions, but followed closely by MusicLM, which even
beats the ground truth in 27% of comparisons.
C. Melody Conditioning
We provide here implementation details of the model used for conditioning the music generation on melody. The model is
based on a small ViT (Dosovitskiy et al., 2021) composed of 12 layers, 6 attention heads, embedding dimension of 512
and feed-forward layer of dimension 1024. The input to the model are the temporal frames of the mel spectrogram of the
audio. We use semi-hard triplet loss (Schroff et al., 2015) to train the melody embedding model to generate 192 dimensional
embeddings for each 4 seconds of audio. The model learns to generate embeddings which are representative of a melody
while being invariant to acoustic properties related to the instruments being played. This is particularly advantageous,
since this representation is complementary to the representation learned by the MuLan embeddings. Hence, our melody
embeddings and the MuLan can be jointly and complementarily used for conditioning the music generation process. During
training, we consider input audio with a duration of 10 seconds. We extract three melody embeddings, with a hop length of
3 seconds, discretize each of them to tokens with residual vector quantization (RVQ) and concatenate the resulting token
sequences with the MuLan audio tokens MA . We use an RVQ composed of 24 quantizers, each with a vocabulary size of 512.