0% found this document useful (0 votes)
29 views11 pages

A Single Model Is Not All You Need

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views11 pages

A Single Model Is Not All You Need

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Compositional Generative Modeling: A Single Model is Not All You Need

Yilun Du 1 Leslie Kaelbling 1

Abstract 1026 Gemini Ultra

GPT-4
Large monolithic generative models trained on

Training FLOPs
massive amounts of data have become an increas- 1025
arXiv:2402.01103v1 [cs.LG] 2 Feb 2024

ingly dominant approach in AI research. In this PALM 540B PALM 2


paper, we argue that we should instead construct
1024 Gopher GPT-3.5
large generative systems by composing smaller
GPT-3
generative models together. We show how such a
compositional generative approach enables us to 2021 2022 2023 2024
learn distributions in a more data-efficient manner,
Figure 1. Rising Size and Cost of Models. While much of AI
enabling generalization to parts of the data distri-
research has focused on constructing increasingly larger monolithic
bution unseen at training time. We further show
models, training costs are exponentially rising by a factor of 3 every
how this enables us to program and construct new year with current models already costing several hundred million
generative models for tasks completely unseen at dollars per training run. Data from (Epoch, 2023).
training. Finally, we show that in many cases, we
can discover separate compositional components dollars for longer queries and answers (OpenAI).
from data.
In addition, adapting such large models to new task dis-
tributions is difficult. Directly fine-tuning larger models
is often prohibitively expensive, requiring a large compu-
1. Introduction tation cluster and an often difficult-to-acquire fine-tuning
In the past two years, increasingly large generative models dataset. Other works have explored leveraging language and
have become a dominant force in AI research, with com- a set of in-context examples to teach models new distribu-
pelling results in natural language (Brown et al., 2020), tions, but such adaptation is limited to settings that are well
computer vision (Rombach et al., 2022) and decision- expressed using a set of language instructions that are fur-
making (Reed et al., 2022). Much of the AI research field ther roughly similar to the distributions already seen during
has now focused on scaling and constructing increasingly training (Yadlowsky et al., 2023).
large generative models (Hoffmann et al., 2022), developing In this paper, we argue that as an alternative to studying
tools to build even larger models (Dao et al., 2022; Kwon how to scale and construct increasingly large monolithic
et al., 2023), and studying how properties emerge as these generative models, we should instead construct complex
models scale in size (Lu et al., 2023; Schaeffer et al., 2023). generative models compositionally from simpler models.
Despite significant scaling in generative models, existing Each constituent model captures the probability distribu-
models remain far from intelligent, exhibiting poor rea- tion of a subset of variables of the distribution of interest,
soning ability (Tamkin et al., 2021), extensive hallucina- which are combined to model the more complex full distri-
tions (Zhang et al., 2023), and poor understanding of com- bution. Individual distributions are therefore much simpler
monsense relationships in images (Figure 2) (Majumdar and computationally modeled with both fewer parameters
et al., 2023). Despite this, large models have already been and learnable from less data. Furthermore, the combined
trained on most of the existing data on the Internet and model can generalize to unseen portions of the data distri-
have reached the limits of modern computational hardware, bution as long as each constituent dimension is locally in
costing hundreds of millions of dollars to train (Figure 1). distribution.
Inference costs of such gigantic models are also prohibitive, Such compositional generative modeling enables us to effec-
requiring large computational clusters and a cost of several tively represent the natural sparsity and symmetry naturally
1
MIT. Correspondence to: Yilun Du <[email protected]>. found in nature. Sparsity of interactions, for instance be-
tween an agent and external environment dynamics can be
encoded by representing each with separate generative mod-

1
Compositional Generative Modeling: A Single Model is Not All You Need

should construct complex generative systems by represent-


ing them as a compositional system of simpler components
and illustrate its benefits across various domains.

2. Data Efficient Generative Modeling


The predominant paradigm for training generative models
has been to construct increasingly larger monolithic mod-
User: Is the blue cup below the red bowl?
GPT-4V: No, the blue cup is not below
els trained with greater amounts of data and computational
the red bowl. The blue cup is to the right power. While language models have demonstrated signif-
side of the picture, and the red bowl is DALL-E 3 Prompt: Create a picture of a Spot robot
not visible in the image feeding a person. icant improvements with increased scale (albeit still with
Figure 2. Limited Compositionality in Multimodal Models. Ex- difficulty in compositionality (Dziri et al., 2023)), current
isting large multimodal models such as GPT-4V and DALL-E 3 multimodal models such as DALL-E 3 and GPT-4V remain
still struggle with simple textual queries, often falling back to bi- unable to take advantage of even simple forms of compo-
ases in data. sitionally (Figure 2). Such models may be unable to ac-
curately generate images given combinations of relations
els. Sources of symmetry can be captured using multiple rarely seen in training data, or fail to understand simple
instances of the same independent generative component to spatial relations in images, despite being trained on a very
represent each occurrence of the symmetry, for instance by significant portion of the existing Internet.
tiling patch-level generative model over the patches in an
image. Compositional structure is widely used in existing One difficulty is that the underlying sample complexity of
work, to tractably represent high dimensional distributions learning generative models over joint distributions of vari-
in Probablistic Graphical Models (PGMs) (Koller & Fried- ables increases dramatically with the number of variables.
man, 2009), and even in existing generative models, i.e. As an example, consider learning probability distributions
autoregressive models which factorize distributions into a by maximizing log-likelihood over a set of random vari-
set of conditional probability distributions (represented by a ables A, B, C, D, each of which can take a set of K values.
single model). Directly learning a distribution over a single variable A by
requires O(K) values (Canonne, 2020). The data required
Compositional generative modeling further enables us to ef- to learn distributions over a joint set of variables generally
fectively program and construct new generative systems for increases exponentially – so that learning a joint distribution
unseen task distributions. Individual generative models can p(A, B, C, D) requires O(K 4 ) samples (Canonne, 2020).
be composed in new ways, with each model specifying a set
of constraints, and probabilistic composition seen as a com- Constructing large multimodal generative models such as
munication language among models, ensuring a distribution GPT-4V or DALL-E 3 falls into the same difficulty – as
is a constructed so that all constraints are satisfied to form the number of modalities jointly modeled increases, the
the task distribution of interest. Such programming further combination of samples required to see and learn the entire
requires no explicit training or data, enabling generalization data distribution exponentially increases. This is particularly
in inference even on distributions with no previously seen challenging in the multimodal setting as the existing data
data. We illustrate how such recombination enables general- on the Internet used to train these models is often highly
ization to new task distributions in decision making, image non-uniform, with many combinations of natural language
and video synthesis. and images unseen.

The underlying compositional components in generative One approach to significantly reduce the amount of data
modeling can in many cases be directly inferred and dis- necessary to learn generative models over complex joint
covered in an unsupervised manner from data, representing distributions is factorization – if we know that a distribution
compositional structure such as objects and relations. Such exhibits an independence structure between variables such
discovered components can then be similarly recombined as
to form new distributions – for instance, objects compo- p(A, B, C, D) ∝ p(A)p(B)p(C, D),
nents discovered by one generative model on one dataset
we can substantially reduce the data requirements by only
can be combined with components discovered by a separate
needing to learn these factors, composing them together to
generative model on another dataset to form hybrid scenes
form a more complex distribution. This also enables our
with objects in both datasets. We illustrate the efficacy of
learned joint distribution to generalize to unseen combina-
such discovered compositional structure across domains in
tions of variables so long as each local variable combination
images and trajectory dynamics.
is in distribution (illustrated in Figure 3). Even in settings
Overall, in this paper, we advocate for the idea that we where distributions are not accurately modeled as a product

2
Compositional Generative Modeling: A Single Model is Not All You Need

Full Data Distribution Full Data Distribution a) b)

Training Distribution Training Distribution

Factor 2 (Axis 2)
Figure 5. Compositional Trajectory Generation – By factorizing
Factor 1 (Axis 1)
a trajectory generative model into a set of components, models are
Monolithic Generative Model Compositional Generative Model able to more accurately simulate dynamics from limited trajectories
(a) and train in fewer training iterations (b).
Figure 3. Generalizing Outside Training Data. Given a narrow
slice of training data, we can learn generative models that gen-
Single
eralize outside the data through composition. We learn separate Model
generative models to model each axis of the data – the composition
of models can then cover the entire data space.
Composed Monolithic Compositional Compositional
Factor 1 Factor 2 Distribution Model Model Model

Product “A couch right next to the “A blue bird on a tree” “A green tree swaying in the “A pink sky” AND
Composition windows” AND “A table in AND “A red car behind wind” AND “A red brick “A blue mountain in the
front of the couch” AND “A the tree” AND “A green house located behind a tree” horizon” AND “Cherry
vase of flowers on top of the forest in the AND “A lawn in front of the Blossoms in front of the
table” background” house” mountain”

Mixture Figure 6. Compositional Visual Synthesis. By composing a set of


Composition
generative models modeling conditional image distributions given
a sentence description, we can more accurately synthesize images
Figure 4. Distribution Composition – When modeling simple given paragraph-level text descriptions. Figure adapted from (Liu
product (top) or mixture (bottom) compositions, learning two com- et al., 2022)
positional models on the factors is more data efficient than learning have typically modeled using a single joint distribution
a single monolithic model on the product distribution. The mono- p(s0 , a0 , . . . , sT , aT ) (Janner et al., 2022; Ajay et al., 2022).
lithic model is trained on twice as much data as individual factors.
In contrast to a monolithic generative distribution, given
structural knowledge of the environment – i.e., that it is
of independent factors, such a factorization can still lead to
a Markov Decision Process, a more factorized generative
a better generative model given limited data by reducing the
model to represent the distribution is as a product
hypothesis space (Murphy, 2022). This key idea of factoriz- Y
ing probability distributions has led to substantial work in p(τ ) ∝ p(si | si−1 , a).
probabilistic graphical models (PGMs) (Koller & Friedman, i
2009). In Figure 5 , we explore the efficacy of compositional and
monolithic models in characterizing trajectories in Maze2D,
Below, we illustrate across three settings how representing
which consists of a 4D state space (2D position and veloc-
a target distribution p(x) in a factorized manner can sub-
ity) and 2D action space (2D forces), using the model in
stantially improve generative modeling performance from a
(Janner et al., 2022) (with the compositional model repre-
limited amount of data:
senting trajectory chunksize 8 to ensure compatibility with
Simple Distribution Composition. In Figure 4, we the architecture). We plot the accuracy of generated trajec-
consider modeling a distribution p(x) that is a product tories at unseen start states as the function of the number
p(x) ∝ p1 (x)p2 (x) or mixture p(x) ∝ p1 (x) + p2 (x) of of agent episodes used to train models, where each episode
two factors p1 (x) and p2 (x). We compare training either a has length of approximately 10000 timesteps. As seen in
single model on p(x) or learning two generative models on the Figure 5(a), given only a very limited number of agent
the factors p1 (x) and p2 (x). We find that training composi- episodes in an environment, a factorized model can more
tional models leads to a more accurate distribution modeling accurately simulate trajectory dynamics. In addition, we
if the same amount of data is used to learn p(x) as is used found that training a single joint generative model also took
to learn both p1 (x) and p2 (x). Even when modeling simple a substantially larger number of iterations to train than the
distributions, the data complexity of modeling each factor factorized model as illustrated in Figure 5(b).
is simpler than representing the joint distribution.
Compositional Visual Generation. Finally, we con-
Trajectory Modeling. Next, we consider modeling sider modeling a probability distribution p(x | T ) in text-to-
a probability distribution p(τ ) over trajectories τ = image synthesis, where x is an image and T is a complex
(s0 , a0 , s1 , a1 , . . . , sT , aT ), which many recent works text description. While this distribution is usually char-

3
Compositional Generative Modeling: A Single Model is Not All You Need

acterized by a single generative model, we can factor the

U-Maze
generation as a product of distributions given sentences t1 ,
t2 , and t3 in the description T
p(x | T ) ∝ p(x | t1 )p(x | t2 )p(x | t3 ).
This representation of the distribution is more data efficient:

Medium
we only need to see the full distribution of images given
single sentences. In addition, it enables us to generalize to
unseen regions of p(x | T ) such as unseen combinations
of sentences and longer text descriptions. In Figure 6, we
illustrate the efficacy of such an approach.

Large
3. Generalization to New Distributions
In the previous section, we’ve illustrated how composition
can enable us to effectively model a distribution p(x), in- Figure 7. Planning through Probability Composition. By com-
posing a probability density trained on modeling dynamics in an
cluding areas we have not seen any data in. In this section,
environment ptraj (τ ) with a probability density pgoal (τ, g) which
we further illustrate how composition enables generaliza- specifies a specific goal state, we can sample plans from specified
tion, allowing us to re-purpose a generative model p(x) to start to a goal condition. Figure from (Janner et al., 2022),
solve a new task by constructing a new generative model where the horizontal axis illustrates progression of sampling.
q(x).
geomBox geomA geomB geomC
Consider the task of planning, where we wish to construct a
generative model q(τ ) which samples plans that reach a goal in(A, Box) close-to(A, B) cfree(A, B) cfree(A, C) ......
Box
state g starting from a start state s. Given a generative model
A poseBox poseA poseB poseC
D
p(τ ), which sample legal, but otherwise unconstrained, state C The arm trajectory trajA
E poseA0
sequences in an environment, we can construct an additional B valid-traj trajA
connects A’s initial pose
poseA0 and the target pose
graspA
generative model r(τ, s, g) which has high likelihood when poseA given graspA.
(a) Visualization of the environment (b) Visualization of the constraint graphs associated with the
τ has start state s and goal state g and low likelihood every- while placing object A. object placement. There are three decision variables.
where else. By composing the two distributions
Figure 8. Manipulation through Constraint Composition. New
q(τ ) ∝ p(τ )r(τ, s, g), (1) object manipulation problems can be converted into a graph of
we can construct our desired planning distribution q(τ ), constraints between variables. Each constraint can be represented
exploiting the fact that probability can be treated as a “cur- as a low-dimensional factor of the joint distribution, with sampling
rency” to combine models, enabling us to selectively choose from the composition of distributions corresponding to solving the
arrangement problem. Figure adapted from (Yang et al., 2023b).
trajectories that satisfy the constraints in both distributions.
Below, we illustrate a set of applications where we can an object) and c ∈ C is a constraint such as collision-free.
construct new compositional generative models q(x) to Given such a specification, we can solve the robotics tasks
solve tasks in planning, constraint satisfaction, hierarchical by sampling from the composed distribution
Y
decision-making, and image and video generation. q(V ) ∝ pc (V c | U c ),
c∈C
Planning with Trajectory Composition. We first con-
sider constructing q(τ ) representing planning as described corresponding to solving the constraint satisfaction problem.
in Equation 1. In Figure 7 we illustrate how sampling from Such an approach enables effective generalization to new
this composed distribution enables successful planning from problems (Yang et al., 2023b).
start to goal states. Quantatively, this approach performs Hierarchical Planning with Foundation Models. We
well also as illustrated in (Janner et al., 2022). further illustrate how we can construct a generative model
Manipulation through Constraint Satisfaction. We that functions as a hierarchical planner for long-horizon
next illustrate how we can construct a generative model q(V ) tasks. We construct q(τtext , τimage , τaction ), which jointly
to solve a variety of robotic object arrangement tasks. As models the distribution over a text plan τtext , image plan
illustrated in Figure 8, many object arrangement tasks can be τimage , and action plan τaction given a natural language goal
formulated as continuous constraint satisfaction problems g and image observation o, by combining pre-existing foun-
consisting of a graph G = ⟨V, U, C⟩, where v ∈ V is a dation models trained on Internet knowledge. We formulate
decision variable (such as the pose of an object), while each q(τtext , τimage , τaction ) through the composition
u ∈ U is a conditioning variable (such as the geometry of pLLM (τtext , g)pVideo (τimage , τtext , o)pAction (τaction , τimage ).

4
Compositional Generative Modeling: A Single Model is Not All You Need

Internet Internet Egocentric Original


Text Video Images Generation

Prepare Sandwich
Adapted
Generation
(Digital Art)

Adapted
Task Visual Action Generation
(Outdoor Video)
Model Model Model

Adapted
Figure 9. Hierarchical Planning through Composition. By com- Generation
(Storybook
posing a set of foundation models trained on Internet data (lan- Illustration)

guage, videos, action), we can zero-shot construct a hierarchical


Figure 11. Video Stylization through Composition. By compos-
planning system. Figure adapted from (Ajay et al., 2023).
ing one video model with a model specifying style, we can stylize
video generations. Figure adapted from (Yang et al., 2023a).
Movie still of epic space battle

Starship Enterprise firing phasers

Giant mocha robot holding a glowing sword the compositional distribution


Glowing phaser beam
ppretrained (τ | text)padapt (τ | text)
Sun with lens flare

Portion of Mars. Debris from atmosphere to generate new videos in different specified styles. The
efficacy of using composition to adapt the style of a video
Figure 10. Image Tapestries through Composition. By compos-
model is illustrated in (Yang et al., 2023a).
ing a set of probability distributions defined over different spatial
regions in an image, we can construct detailed image tapestries.
Figure adapted from (Du et al., 2023). 4. Generative Modeling with Learned
Compositional Structure
This distribution assigns a high likelihood to sequences of
natural-language instructions τtext that are plausible ways A limitation of compositional generative modeling dis-
to reach a final goal g (leveraging textual knowledge em- cussed in the earlier sections is that it requires a priori knowl-
bedded in an LLM) which are consistent with visual plans edge about the independence structure of the distribution we
τimage starting from image o (leveraging visual dynamics wish to model. However, these compositional components
information embedded in a video model), which are fur- can also be discovered jointly while learning a probability
ther consistent with execution with actions τaction (leverag- distribution by formulating maximum likelihood estimation
ing action information in a large action model). Sampling as maximizing the likelihood of the factorized distribution
from this distribution then corresponds to finding sequences Y
τtext , τimage , τaction that are mutually consistent with all con- pθ (x) ∝ piθ (x).
i
straints, and thus constitute successful hierarchical plans to
accomplish the task. We provide an illustration of this com- Similar to the previous two sections, the discovery of the
position in Figure 9 with efficacy of this approach demon- learned components piθ (x) enables more data-efficient learn-
strated in (Ajay et al., 2023). ing of the generative model as well as the ability to generate
samples from new task distributions. Here, we illustrate
Controllable Image Synthesis. Composition can also three examples of how different factors can be discovered
allows us to construct a generative model q(x | D) to gener- in an unsupervised manner.
ate images x from a detailed scene description D consisting
of text and bounding-box descriptions {texti , bboxi }i=1:N . Discovering Factors from an Input Image. Given an
This compositional distribution is input image x of a scene, we can parameterize a probability
Y distribution over the pixel values of the image as a product
q(x|D) ∝ p(xbboxi | texti ), of the compositional generative models
i∈{1,...,N } Y
pθ (x) ∝ pθ (x | Enci (x)),
where each distribution is defined over bounding boxes in i
an image. In Figure 10, we illustrate the efficacy of this
where Enc(·) is a learned neural encoder with low-
approach for constructing complex images.
dimensional latent output to encourage each component
Style Adaptation of Video Models. Finally, composi- to capture distinct regions of an image. By training models
tion can be used to construct a generative model q(τ ) that to autoencode images with this likelihood expression, each
synthesizes video in new styles. Given a pretrained video component distribution pθ (x | Enci (x)) finds interpretable
model ppretrained (τ | text) and a small video model of a par- decomposition of images corresponding to individual ob-
ticular style padapt (τ | text), we can sample videos τ from jects in a scene as well global factors of variation in the

5
Compositional Generative Modeling: A Single Model is Not All You Need

Dataset 1 Dataset 2 Composed Components GT Springs GT Charged Recombined

+ =

+ =

+ =
Figure 12. Composition of Discovered Objects. Probabilistic
components corresponding to individual objects in a scene are
discovered unsupervised in two datasets using two separate models. Figure 13. Composition of Discovered Relation Potentials In
Discovered components (illustrated with yellow boxes) can be a particle dataset, particles exhibit potentials corresponding to
multiplied together to form new scenes with a hybrid composition invisible springs between particles (Col. 1) or charges between
of objects. Figure adapted from (Su et al., 2024). particles (Col. 2). By swapping discovered probabilistic compo-
nents between each pair of objects between particle systems, we
can recombine trajectories framed in green but with a pair of edge
scene such as lighting (Du et al., 2021; Su et al., 2024). In
potentials from trajectories formed in red in Col. 3. Figure adapted
Figure 14, we illustrate how these these discovered compo-
from (Comas et al., 2023)
nents, pθ (x | z1 ) and pθ (x | z2 ) from a model trained on
cubes and spheres, pϕ (x | z3 ) and pϕ (x | z4 ) from a sep-
arate model trained on trucks and boots can be composed
together to form the distribution
p4(x)p5(x)
pθ (x | z1 )pθ (x | z2 )pϕ (x | z3 )pϕ (x | z4 ),
to construct hybrid scenes with objects from both datasets.
Discovering Relational Potentials. Given a trajectory τ p3(x)p5(x)
of N particles, we can similarly parameterize a probability Discovered Recombined
distribution over the reconstruction of the particle system Image Distribution Components Components
as a product of components defined over each pairwise Figure 14. Discovering Image Classes. Given a distribution of
interaction between particles images drawn from 5 image classes in ImageNet, discovered com-
Y
pθ (τ ) ∝ pθ (τ | Encij (τ )), ponents correspond to each image class. Components can further
i,j∀j̸=i
be composed together to form new images. Figure adapted from
(Liu et al., 2023).
where Encij (τ ) corresponds to latent encoding interactions
between particle i and j. In Figure 13, we illustrate how generate images with multiple classes of objects.
these discovered relational potentials on one particle system
can be composed with relational potentials discovered on a
separate set of forces to simulate those forces on the particle 5. Implementing Compositional Generation
system. In this section, we discuss some challenges with implement-
Discovering Object Classes From Image Distributions. ing compositional sampling with common generative model
Given a distribution of images p(x) representing images parameterizations and discuss a generative model parame-
drawn from different classes in Imagenet, we can model the terization that enables effective compositional generation.
likelihood of the distribution as as composition We then present some practical implementations of compo-
Y sitional sampling in both continuous and discrete domains.
pθ (x) ∝ pϕ (w | x) piθ (x)wi .
i
5.1. Challenges With Sampling from Compositional
In Figure 14, we illustrate that the discovered components in
Distributions
this setting represent each of the original Imagenet classes
in the input distribution of images. We further illustrate how Given two probability densities p1 (x) and p2 (x), it is of-
these discovered components to be composed together to ten difficult to directly sample from the product density

6
Compositional Generative Modeling: A Single Model is Not All You Need

p1 (x)p2 (x). Existing generative models typically represent 5.2. Effective Compositional Sampling on Continuous
probability distributions in a factorized manner to enable Distributions
efficient learning and sampling, such as at the token level in
Given a composed distribution represented as EBM E(x)
autoregressive models (Van Den Oord et al., 2016) or across
defined over inputs x ∈ RD , directly finding a low energy
various noise levels in diffusion models (Sohl-Dickstein
sample through MCMC becomes increasingly inefficient as
et al., 2015). However, depending on the form of the factor-
the data dimension D rises. To more effectively find low-
ization, the models may not be straightforward to compose.
energy samples in EBMs in high-dimensional continuous
For instance, consider two learned autoregressive factoriza- spaces, we can use the gradient of the energy function to
tions p1 (xi |x0:i−1 ) and p2 (xi |x0:i−1 ) over sequences x0:T . help guide sampling. In Du & Mordatch (2019), Langevin
The autogressive factorization of the product distribution dynamics is used to implement efficient sampling, where a
pproduct (x) ∝ p1 (x)p2 (x) corresponds to sample can be repeatedly optimized using the expression
X
pproduct (xi |x0:i−1 ) = p1 (xi+1:T |x0:i )p1 (xi |x0:i−1 ) xt = xt−1 − λ∇x E(x) + ϵ, ϵ ∼ N (0, σ),
xi+1:T where x0 is initialized from uniform noise. By convert-
p2 (xi+1:T |x0:i )p2 (xi |x0:i−1 ), ing different operations such as products, mixtures, and
where we need to marginalize over all possible future values inversions of probability distributions into composite en-
of xi+1:T . Since this marginalization is different depen- ergy functions, the above sampling procedure allows us to
dent on the value of xi , pproduct (xi |x0:i−1 ) is not equivalent effectively compositionally sample from composed distribu-
to p1 (xi |x0:i−1 )p2 (xi |x0:i−1 ) and therefore autoregressive tions (Du et al., 2020a).
factorizations are not directly compositional. Similarly, two There has been a substantial body of recent work on im-
learned score functions from diffusion models are not di- proving learning in EBMs (Du & Mordatch, 2019; Nijkamp
rectly composable as they do not correspond to the noisy et al., 2019; Grathwohl et al., 2019; Du et al., 2020b; Grath-
gradient of the product distribution (Du et al., 2023). wohl et al., 2021) but EBMs still lag behind other generative
While it is often difficult to combine generative models, approaches in efficiency and scalability of training. By lever-
representing the probability density explicitly enables us aging the close connection of diffusion models with EBMs
to combine models by manipulating the density. One such in (Song & Ermon, 2019) we can also directly implement
approach is to represent probability density as an Energy- the compositional operations with EBMs with diffusion
Based Model, pi (x) ∝ e−Ei (x) (Hinton, 2002; Du & Mor- models (Du et al., 2023), which we briefly describe below.
datch, 2019). Under this factorization by definition, we can Given a diffusion model representing a distribution p(x), we
construct the product density corresponding to can interpret the T learned denoising functions ϵ(x, t) of the
e−(E1 (x)+E2 (x)) ∝ e−E1 (x) e−E2 (x) , (2) diffusion model as representing T separate EBM distribu-
tions, e−E(x,t) , where ∇x E(x, t) = ϵ(x, t). This sequence
corresponding to a new EBM E1 (x) + E2 (x). It is impor-
of EBM distributions transition from e−E(x,T ) representing
tant to observe that EBMs generally represent probability
the Gaussian distribution N (0, 1) to e−E(x,0) representing
densities in an unnormalized manner, and the product of
the target distribution pi (x). We can draw samples from
two normalized probability densities p1 (x) and p2 (x) will
this sequence of EBMs using annealed importance sam-
be an unnormalized probability density as well (where the
pling (Du et al., 2023), where we initialize a sample from
normalization constant is intractable to compute as it re-
Gaussian noise and sequentially run several steps of MCMC
quires marginalization over the sample space). Additional
on each EBM distribution, starting at e−E(x,T ) and ending
operations between probability densities such as mixtures
at e−E(x,0) .
and inversions of distributions can also be expressed as
combinations of energy functions (Du et al., 2020a). This EBM interpretation of diffusion models allows them
to be composed using operations such as Equation 2 by ap-
To generate samples from any EBM distribution, it is neces-
plying the operation to each intermediate EBM correspond-
sary to run Markov Chain Monte Carlo (MCMC) to itera-
ing to the component diffusion distributions, for instance
tively refine a starting sample to one that is high likelihood
e−(E1 (x,k)+E2 (x,k)) . We can then use an annealed impor-
(low energy) under the EBM. We present practical MCMC
tance sampling procedure on this sequence of composite
algorithms for sampling from composed distributions in
EBMs. Note that this annealed importance procedure is
continuous spaces in Section 5.2 and discrete spaces in
necessary for accurate compositional sampling – using the
Section 5.3 with EBMs. Recently, new methods for imple-
reverse diffusion process directly on this composed score
menting compositional sampling using separately trained
does not sample from the composed distribution (Du et al.,
classifiers to efficiently specify each conditioned factor have
2023).
been developed (Garipov et al., 2023), which we encourage
the reader to also read. A variety of different MCMC samplers such as ULA,

7
Compositional Generative Modeling: A Single Model is Not All You Need

MALA, U-HMC, and HMC can be used as intermediate Audio Model Memory Model
MCMC samplers for this sequence of EBM distributions. Audio
One easy-to-implement MCMC transition kernel that is easy
to understand is the standard diffusion reverse sampling ker-
nel at a fixed noise level. We illustrate in Appendix A that Image Model
this is equivalent to running a ULA MCMC sampling step. Language
This allows compositional sampling in diffusion models Model
to be easily implemented by simply constructing the score
function corresponding to the composite distribution we
wish to sample from and then using the standard diffusion Image
sampling procedure, but with the diffusion reverse step ap- Action
plied multiple times at each noise level.
Video Model Action Model
5.3. Effective Compositional Sampling on Discrete Figure 15. Decentralized Decision Making. By composing gen-
Distributions erative models operating over various modalities we can construct
decentralized architectures for intelligent agents. Communication
Given an EBM representing a composed distribution E(x) between models is induced by inference over the joint distribution.
on a high dimensional discrete landscape, we can use Gibbs
sampling to sample from the resultant distribution, where terms of both buildability and interpretability. As individ-
we repeatedly resample values of individual dimensions of x ual models are responsible for independent subsets of data,
using the marginal energy function E(xi | x−i ). However, each model can be built separately and modularly by dif-
this process is increasingly inefficient as the underlying ferent institutions. Simultaneously, at execution time, it is
dimensionality of the data increases. significantly easier to understand and monitor the execu-
The use of a gradient of the energy function E(x) to ac- tion of each simpler consitituent model than a single large
celerate sampling in the discrete landscape is difficult, as monolithic model.
the gradient operation is not well defined in discrete space In addition, such compositional systems can be more en-
(though there are also promising discrete analogs of gradi- vironmentally friendly and easier to deploy than large
ent samplers (Grathwohl et al., 2021)). However, we can monolithic models. As individual models are substantially
leverage our learned generative distributions to accelerate smaller, they can run efficiently using small amounts of
sampling, by using one generative model as a proposal dis- computation. Simultaneously, it is more straightforward
tribution and the remaining energy functions to implement to deploy separate models across separate computational
a Metropolis-Hastings step (Li et al., 2022; Verkuil et al., machines.
2022).
In the setting of constructing an artificially intelligent agent,
As an example, to sample from an energy function E(x) = such a compositional architecture may look like a decentral-
E1 (x) + E2 (x), given an initial MCMC sample xt , we ized decision-making system in Figure 15. In this system,
can draw a new sample xt+1 by sampling from the learned separate generative models are responsible for processing
distribution e−E1 (x) , and accept the new sample xt+1 with each modality an agent receives and other models respon-
a Metropolis acceptance rate sible for decision-making. Sampling composed generative
a(xt+1 ) = clip(eE2 (xt )−E2 (xt+1 ) , 0, 1). distribution of models corresponds to message passing be-
tween models, inducing cross-communication between mod-
els similar to a set of daemons communicating with each
6. Discussion and Future Directions other (Selfridge, 1988). Individual generative models in
this architecture can be substituted with existing models
Most recent research on building generative models has such as LLMs for proposing plausible plans for actions and
focused on increasing the computational scale and data on text-to-video presenting future world states.
which models are trained. We have presented an orthogonal
direction to constructing complex generative systems, Finally, while we have provided a few promising results on
by building systems compositionally, combining simpler applications of compositional generative modeling, there
generative models to form more complex ones. We have remains much to be explored. Future lines of work in area
illustrated how this can be more data and computation- include (1) How do we pass messages and sample from joint
efficient to learn, enable flexible reprogramming, and how distributions? (2) How can we discover the compositional
such components can be discovered from raw data. structure inside a generative system? (3) How can we dy-
namically modify and discover the correct compositional
Such compositional systems have additional benefits in structure of generative models given distribution shift?

8
Compositional Generative Modeling: A Single Model is Not All You Need

References In International Conference on Machine Learning, pp.


8489–8510. PMLR, 2023.
Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola,
T., and Agrawal, P. Is conditional generative model- Dziri, N., Lu, X., Sclar, M., Li, X. L., Jian, L., Lin, B. Y.,
ing all you need for decision-making? arXiv preprint West, P., Bhagavatula, C., Bras, R. L., Hwang, J. D., et al.
arXiv:2211.15657, 2022. Faith and fate: Limits of transformers on compositionality.
arXiv preprint arXiv:2305.18654, 2023.
Ajay, A., Han, S., Du, Y., Li, S., Gupta, A., Jaakkola,
T., Tenenbaum, J., Kaelbling, L., Srivastava, A., and Epoch. Key trends and figures in machine learning, 2023.
Agrawal, P. Compositional foundation models for hi- URL https://epochai.org/trends. Accessed:
erarchical planning. arXiv preprint arXiv:2309.08587, 2024-01-23.
2023.
Garipov, T., De Peuter, S., Yang, G., Garg, V., Kaski, S.,
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., and Jaakkola, T. Compositional sculpting of iterative
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., generative processes. arXiv preprint arXiv:2309.16115,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., 2023.
Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J.,
Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D.,
Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,
Norouzi, M., and Swersky, K. Your classifier is secretly
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
an energy based model and you should treat it like one.
S., Radford, A., Sutskever, I., and Amodei, D. Language
arXiv preprint arXiv:1912.03263, 2019.
models are few-shot learners. In Advances in Neural
Information Processing Systems, 2020. Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D.,
and Maddison, C. Oops i took a gradient: Scalable sam-
Canonne, C. L. A short note on learning discrete distribu-
pling for discrete distributions. In International Con-
tions. arXiv preprint arXiv:2002.11457, 2020.
ference on Machine Learning, pp. 3831–3841. PMLR,
Comas, A., Du, Y., Lopez, C. F., Ghimire, S., Sznaier, M., 2021.
Tenenbaum, J. B., and Camps, O. Inferring relational
Hinton, G. E. Training products of experts by minimizing
potentials in interacting systems. In International Con-
contrastive divergence. Neural Computation, 14(8):1771–
ference on Machine Learning, pp. 6364–6383. PMLR,
1800, 2002.
2023.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. Flashat- bilistic models. arXiv preprint arXiv:2006.11239, 2020.
tention: Fast and memory-efficient exact attention with
io-awareness. Advances in Neural Information Process- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
ing Systems, 35:16344–16359, 2022. Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A.,
Welbl, J., Clark, A., et al. Training compute-optimal
Du, Y. and Mordatch, I. Implicit generation and gen- large language models. arXiv preprint arXiv:2203.15556,
eralization in energy-based models. arXiv preprint 2022.
arXiv:1903.08689, 2019.
Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Plan-
Du, Y., Li, S., and Mordatch, I. Compositional visual gen- ning with diffusion for flexible behavior synthesis. arXiv
eration with energy based models. Advances in Neural preprint arXiv:2205.09991, 2022.
Information Processing Systems, 33:6637–6647, 2020a.
Koller, D. and Friedman, N. Probabilistic graphical models:
Du, Y., Li, S., Tenenbaum, J., and Mordatch, I. Improved principles and techniques. MIT press, 2009.
contrastive divergence training of energy based models.
arXiv preprint arXiv:2012.01316, 2020b. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient
Du, Y., Li, S., Sharma, Y., Tenenbaum, B. J., and Mordatch, memory management for large language model serving
I. Unsupervised learning of compositional energy con- with pagedattention. In Proceedings of the 29th Sym-
cepts. In Advances in Neural Information Processing posium on Operating Systems Principles, pp. 611–626,
Systems, 2021. 2023.
Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Diele- Li, S., Du, Y., Tenenbaum, J. B., Torralba, A., and Mor-
man, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and datch, I. Composing ensembles of pre-trained models
Grathwohl, W. S. Reduce, reuse, recycle: Compositional via iterative consensus. arXiv preprint arXiv:2210.11522,
generation with energy-based diffusion models and mcmc. 2022.

9
Compositional Generative Modeling: A Single Model is Not All You Need

Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Su, J., Liu, N., Tenenbaum, J. B., and Du, Y. Com-
Compositional visual generation with composable dif- positional image decomposition with diffusion models,
fusion models. In European Conference on Computer 2024. URL https://openreview.net/forum?
Vision, pp. 423–439. Springer, 2022. id=88FcNOwNvM.

Liu, N., Du, Y., Li, S., Tenenbaum, J. B., and Torralba, Tamkin, A., Brundage, M., Clark, J., and Ganguli, D.
A. Unsupervised compositional concepts discovery Understanding the capabilities, limitations, and soci-
with text-to-image generative models. arXiv preprint etal impact of large language models. arXiv preprint
arXiv:2306.05357, 2023. arXiv:2102.02503, 2021.

Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H. T., Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
and Gurevych, I. Are emergent abilities in large lan- K. Pixel recurrent neural networks. In International
guage models just in-context learning? arXiv preprint conference on machine learning, pp. 1747–1756. PMLR,
arXiv:2309.01809, 2023. 2016.

Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, Verkuil, R., Kabeli, O., Du, Y., Wicky, B. I., Milles, L. F.,
S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Dauparas, J., Baker, D., Ovchinnikov, S., Sercu, T., and
Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Rives, A. Language models generalize beyond natural
Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., proteins. bioRxiv, pp. 2022–12, 2022.
Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., and
Vincent, P. A connection between score matching and de-
Rajeswaran, A. Openeqa: Embodied question answering
noising autoencoders. Neural computation, 23(7):1661–
in the era of foundation models. 2023.
1674, 2011.
Murphy, K. P. Probabilistic machine learning: an introduc-
Yadlowsky, S., Doshi, L., and Tripuraneni, N. Pretraining
tion. MIT press, 2022.
data mixtures enable narrow model selection capabilities
Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, in transformer models. arXiv preprint arXiv:2311.00871,
Y. N. On the anatomy of mcmc-based maximum likeli- 2023.
hood learning of energy-based models. arXiv preprint
Yang, M., Du, Y., Dai, B., Schuurmans, D., Tenenbaum,
arXiv:1903.12370, 2019.
J. B., and Abbeel, P. Probabilistic adaptation of text-to-
OpenAI. URL https://openai.com/pricing. video models. arXiv preprint arXiv:2306.01872, 2023a.

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Yang, Z., Mao, J., Du, Y., Wu, J., Tenenbaum, J. B., Lozano-
Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Pérez, T., and Kaelbling, L. P. Compositional diffusion-
Y., Kay, J., Springenberg, J. T., et al. A generalist agent. based continuous constraint solvers. arXiv preprint
arXiv preprint arXiv:2205.06175, 2022. arXiv:2309.00966, 2023b.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang,
Ommer, B. High-resolution image synthesis with latent X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in
diffusion models, 2022. the ai ocean: A survey on hallucination in large language
models. arXiv preprint arXiv:2309.01219, 2023.
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent
abilities of large language models a mirage? arXiv
preprint arXiv:2304.15004, 2023.

Selfridge, O. G. Pandemonium: A paradigm for learning. In


Neurocomputing: Foundations of research, pp. 115–122.
1988.

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and


Ganguli, S. Deep unsupervised learning using nonequilib-
rium thermodynamics. arXiv preprint arXiv:1503.03585,
2015.

Song, Y. and Ermon, S. Generative modeling by estimating


gradients of the data distribution. Advances in Neural
Information Processing Systems, 32, 2019.

10
Compositional Generative Modeling: A Single Model is Not All You Need

Appendix

A. Implementing ULA Transitions as Multiple Reverse Diffusion Steps


We illustrate how a step of reverse sampling on a diffusion model at a fixed noise level is equivalent to ULA MCMC
sampling at the same fixed noise level. We use the αt and βt formulation from (Ho et al., 2020). The reverse sampling step
on an input xt at a fixed noise level at timestep t is given by a Gaussian with a mean
βt
µθ (xt , t) = xt − √ ϵθ (xt , t).
1 − ᾱt
with the variance of βt (using the variance small noise schedule in (Ho et al., 2020)). This corresponds to a sampling update,
βt
xt+1 = xt − √ ϵθ (xt , t) + βt ξ, ξ ∼ N (0, 1).
1 − ᾱt

Note that the expression ϵ√θ 1−


(xt ,t)
ᾱt
corresponds to the score ∇x pt (x), through the denoising score matching objective (Vincent,
2011), where the EBM pt (x) corresponds to the data distribution perturbed with t steps of noise. The reverse sampling step
can be equivalently written as
xt+1 = xt − βt ∇x pt (x) + βt ξ, ξ ∼ N (0, 1). (A1)

The ULA sampler draws an MCMC sample from the EBM probability distribution pt (x) using the expression

xt+1 = xt − η∇x pt (x) + 2ηξ, ξ ∼ N (0, 1), (A2)
where η is the step size of sampling.
By substituting η = βt in the ULA sampler, the sampler becomes

xt+1 = xt − βt ∇x pt (x) + 2βt ξ, ξ ∼ N (0, 1). (A3)

Note the similarity of ULA sampling in Eqn A3 and the reverse sampling procedure in Eqn A1, where there is a factor of 2
scaling of the added Gaussian noise in the ULA sampling procedures. This means that we can implement the
√ ULA sampling
by running the standard reverse process, but by scaling the noise added in each timestep by a factor of 2. Alternatively,
we can directly we can directly use the reverse sampling procedure in Eqn A1 to run ULA, where this then corresponds to
sampling a tempered variant of pt (x) with temperature √12 (corresponding to less stochastic samples from the composed
probability distribution).

11

You might also like