A Single Model Is Not All You Need
A Single Model Is Not All You Need
GPT-4
Large monolithic generative models trained on
Training FLOPs
massive amounts of data have become an increas- 1025
arXiv:2402.01103v1 [cs.LG] 2 Feb 2024
1
Compositional Generative Modeling: A Single Model is Not All You Need
The underlying compositional components in generative One approach to significantly reduce the amount of data
modeling can in many cases be directly inferred and dis- necessary to learn generative models over complex joint
covered in an unsupervised manner from data, representing distributions is factorization – if we know that a distribution
compositional structure such as objects and relations. Such exhibits an independence structure between variables such
discovered components can then be similarly recombined as
to form new distributions – for instance, objects compo- p(A, B, C, D) ∝ p(A)p(B)p(C, D),
nents discovered by one generative model on one dataset
we can substantially reduce the data requirements by only
can be combined with components discovered by a separate
needing to learn these factors, composing them together to
generative model on another dataset to form hybrid scenes
form a more complex distribution. This also enables our
with objects in both datasets. We illustrate the efficacy of
learned joint distribution to generalize to unseen combina-
such discovered compositional structure across domains in
tions of variables so long as each local variable combination
images and trajectory dynamics.
is in distribution (illustrated in Figure 3). Even in settings
Overall, in this paper, we advocate for the idea that we where distributions are not accurately modeled as a product
2
Compositional Generative Modeling: A Single Model is Not All You Need
Factor 2 (Axis 2)
Figure 5. Compositional Trajectory Generation – By factorizing
Factor 1 (Axis 1)
a trajectory generative model into a set of components, models are
Monolithic Generative Model Compositional Generative Model able to more accurately simulate dynamics from limited trajectories
(a) and train in fewer training iterations (b).
Figure 3. Generalizing Outside Training Data. Given a narrow
slice of training data, we can learn generative models that gen-
Single
eralize outside the data through composition. We learn separate Model
generative models to model each axis of the data – the composition
of models can then cover the entire data space.
Composed Monolithic Compositional Compositional
Factor 1 Factor 2 Distribution Model Model Model
Product “A couch right next to the “A blue bird on a tree” “A green tree swaying in the “A pink sky” AND
Composition windows” AND “A table in AND “A red car behind wind” AND “A red brick “A blue mountain in the
front of the couch” AND “A the tree” AND “A green house located behind a tree” horizon” AND “Cherry
vase of flowers on top of the forest in the AND “A lawn in front of the Blossoms in front of the
table” background” house” mountain”
3
Compositional Generative Modeling: A Single Model is Not All You Need
U-Maze
generation as a product of distributions given sentences t1 ,
t2 , and t3 in the description T
p(x | T ) ∝ p(x | t1 )p(x | t2 )p(x | t3 ).
This representation of the distribution is more data efficient:
Medium
we only need to see the full distribution of images given
single sentences. In addition, it enables us to generalize to
unseen regions of p(x | T ) such as unseen combinations
of sentences and longer text descriptions. In Figure 6, we
illustrate the efficacy of such an approach.
Large
3. Generalization to New Distributions
In the previous section, we’ve illustrated how composition
can enable us to effectively model a distribution p(x), in- Figure 7. Planning through Probability Composition. By com-
posing a probability density trained on modeling dynamics in an
cluding areas we have not seen any data in. In this section,
environment ptraj (τ ) with a probability density pgoal (τ, g) which
we further illustrate how composition enables generaliza- specifies a specific goal state, we can sample plans from specified
tion, allowing us to re-purpose a generative model p(x) to start to a goal condition. Figure from (Janner et al., 2022),
solve a new task by constructing a new generative model where the horizontal axis illustrates progression of sampling.
q(x).
geomBox geomA geomB geomC
Consider the task of planning, where we wish to construct a
generative model q(τ ) which samples plans that reach a goal in(A, Box) close-to(A, B) cfree(A, B) cfree(A, C) ......
Box
state g starting from a start state s. Given a generative model
A poseBox poseA poseB poseC
D
p(τ ), which sample legal, but otherwise unconstrained, state C The arm trajectory trajA
E poseA0
sequences in an environment, we can construct an additional B valid-traj trajA
connects A’s initial pose
poseA0 and the target pose
graspA
generative model r(τ, s, g) which has high likelihood when poseA given graspA.
(a) Visualization of the environment (b) Visualization of the constraint graphs associated with the
τ has start state s and goal state g and low likelihood every- while placing object A. object placement. There are three decision variables.
where else. By composing the two distributions
Figure 8. Manipulation through Constraint Composition. New
q(τ ) ∝ p(τ )r(τ, s, g), (1) object manipulation problems can be converted into a graph of
we can construct our desired planning distribution q(τ ), constraints between variables. Each constraint can be represented
exploiting the fact that probability can be treated as a “cur- as a low-dimensional factor of the joint distribution, with sampling
rency” to combine models, enabling us to selectively choose from the composition of distributions corresponding to solving the
arrangement problem. Figure adapted from (Yang et al., 2023b).
trajectories that satisfy the constraints in both distributions.
Below, we illustrate a set of applications where we can an object) and c ∈ C is a constraint such as collision-free.
construct new compositional generative models q(x) to Given such a specification, we can solve the robotics tasks
solve tasks in planning, constraint satisfaction, hierarchical by sampling from the composed distribution
Y
decision-making, and image and video generation. q(V ) ∝ pc (V c | U c ),
c∈C
Planning with Trajectory Composition. We first con-
sider constructing q(τ ) representing planning as described corresponding to solving the constraint satisfaction problem.
in Equation 1. In Figure 7 we illustrate how sampling from Such an approach enables effective generalization to new
this composed distribution enables successful planning from problems (Yang et al., 2023b).
start to goal states. Quantatively, this approach performs Hierarchical Planning with Foundation Models. We
well also as illustrated in (Janner et al., 2022). further illustrate how we can construct a generative model
Manipulation through Constraint Satisfaction. We that functions as a hierarchical planner for long-horizon
next illustrate how we can construct a generative model q(V ) tasks. We construct q(τtext , τimage , τaction ), which jointly
to solve a variety of robotic object arrangement tasks. As models the distribution over a text plan τtext , image plan
illustrated in Figure 8, many object arrangement tasks can be τimage , and action plan τaction given a natural language goal
formulated as continuous constraint satisfaction problems g and image observation o, by combining pre-existing foun-
consisting of a graph G = ⟨V, U, C⟩, where v ∈ V is a dation models trained on Internet knowledge. We formulate
decision variable (such as the pose of an object), while each q(τtext , τimage , τaction ) through the composition
u ∈ U is a conditioning variable (such as the geometry of pLLM (τtext , g)pVideo (τimage , τtext , o)pAction (τaction , τimage ).
4
Compositional Generative Modeling: A Single Model is Not All You Need
Prepare Sandwich
Adapted
Generation
(Digital Art)
Adapted
Task Visual Action Generation
(Outdoor Video)
Model Model Model
Adapted
Figure 9. Hierarchical Planning through Composition. By com- Generation
(Storybook
posing a set of foundation models trained on Internet data (lan- Illustration)
Portion of Mars. Debris from atmosphere to generate new videos in different specified styles. The
efficacy of using composition to adapt the style of a video
Figure 10. Image Tapestries through Composition. By compos-
model is illustrated in (Yang et al., 2023a).
ing a set of probability distributions defined over different spatial
regions in an image, we can construct detailed image tapestries.
Figure adapted from (Du et al., 2023). 4. Generative Modeling with Learned
Compositional Structure
This distribution assigns a high likelihood to sequences of
natural-language instructions τtext that are plausible ways A limitation of compositional generative modeling dis-
to reach a final goal g (leveraging textual knowledge em- cussed in the earlier sections is that it requires a priori knowl-
bedded in an LLM) which are consistent with visual plans edge about the independence structure of the distribution we
τimage starting from image o (leveraging visual dynamics wish to model. However, these compositional components
information embedded in a video model), which are fur- can also be discovered jointly while learning a probability
ther consistent with execution with actions τaction (leverag- distribution by formulating maximum likelihood estimation
ing action information in a large action model). Sampling as maximizing the likelihood of the factorized distribution
from this distribution then corresponds to finding sequences Y
τtext , τimage , τaction that are mutually consistent with all con- pθ (x) ∝ piθ (x).
i
straints, and thus constitute successful hierarchical plans to
accomplish the task. We provide an illustration of this com- Similar to the previous two sections, the discovery of the
position in Figure 9 with efficacy of this approach demon- learned components piθ (x) enables more data-efficient learn-
strated in (Ajay et al., 2023). ing of the generative model as well as the ability to generate
samples from new task distributions. Here, we illustrate
Controllable Image Synthesis. Composition can also three examples of how different factors can be discovered
allows us to construct a generative model q(x | D) to gener- in an unsupervised manner.
ate images x from a detailed scene description D consisting
of text and bounding-box descriptions {texti , bboxi }i=1:N . Discovering Factors from an Input Image. Given an
This compositional distribution is input image x of a scene, we can parameterize a probability
Y distribution over the pixel values of the image as a product
q(x|D) ∝ p(xbboxi | texti ), of the compositional generative models
i∈{1,...,N } Y
pθ (x) ∝ pθ (x | Enci (x)),
where each distribution is defined over bounding boxes in i
an image. In Figure 10, we illustrate the efficacy of this
where Enc(·) is a learned neural encoder with low-
approach for constructing complex images.
dimensional latent output to encourage each component
Style Adaptation of Video Models. Finally, composi- to capture distinct regions of an image. By training models
tion can be used to construct a generative model q(τ ) that to autoencode images with this likelihood expression, each
synthesizes video in new styles. Given a pretrained video component distribution pθ (x | Enci (x)) finds interpretable
model ppretrained (τ | text) and a small video model of a par- decomposition of images corresponding to individual ob-
ticular style padapt (τ | text), we can sample videos τ from jects in a scene as well global factors of variation in the
5
Compositional Generative Modeling: A Single Model is Not All You Need
+ =
+ =
+ =
Figure 12. Composition of Discovered Objects. Probabilistic
components corresponding to individual objects in a scene are
discovered unsupervised in two datasets using two separate models. Figure 13. Composition of Discovered Relation Potentials In
Discovered components (illustrated with yellow boxes) can be a particle dataset, particles exhibit potentials corresponding to
multiplied together to form new scenes with a hybrid composition invisible springs between particles (Col. 1) or charges between
of objects. Figure adapted from (Su et al., 2024). particles (Col. 2). By swapping discovered probabilistic compo-
nents between each pair of objects between particle systems, we
can recombine trajectories framed in green but with a pair of edge
scene such as lighting (Du et al., 2021; Su et al., 2024). In
potentials from trajectories formed in red in Col. 3. Figure adapted
Figure 14, we illustrate how these these discovered compo-
from (Comas et al., 2023)
nents, pθ (x | z1 ) and pθ (x | z2 ) from a model trained on
cubes and spheres, pϕ (x | z3 ) and pϕ (x | z4 ) from a sep-
arate model trained on trucks and boots can be composed
together to form the distribution
p4(x)p5(x)
pθ (x | z1 )pθ (x | z2 )pϕ (x | z3 )pϕ (x | z4 ),
to construct hybrid scenes with objects from both datasets.
Discovering Relational Potentials. Given a trajectory τ p3(x)p5(x)
of N particles, we can similarly parameterize a probability Discovered Recombined
distribution over the reconstruction of the particle system Image Distribution Components Components
as a product of components defined over each pairwise Figure 14. Discovering Image Classes. Given a distribution of
interaction between particles images drawn from 5 image classes in ImageNet, discovered com-
Y
pθ (τ ) ∝ pθ (τ | Encij (τ )), ponents correspond to each image class. Components can further
i,j∀j̸=i
be composed together to form new images. Figure adapted from
(Liu et al., 2023).
where Encij (τ ) corresponds to latent encoding interactions
between particle i and j. In Figure 13, we illustrate how generate images with multiple classes of objects.
these discovered relational potentials on one particle system
can be composed with relational potentials discovered on a
separate set of forces to simulate those forces on the particle 5. Implementing Compositional Generation
system. In this section, we discuss some challenges with implement-
Discovering Object Classes From Image Distributions. ing compositional sampling with common generative model
Given a distribution of images p(x) representing images parameterizations and discuss a generative model parame-
drawn from different classes in Imagenet, we can model the terization that enables effective compositional generation.
likelihood of the distribution as as composition We then present some practical implementations of compo-
Y sitional sampling in both continuous and discrete domains.
pθ (x) ∝ pϕ (w | x) piθ (x)wi .
i
5.1. Challenges With Sampling from Compositional
In Figure 14, we illustrate that the discovered components in
Distributions
this setting represent each of the original Imagenet classes
in the input distribution of images. We further illustrate how Given two probability densities p1 (x) and p2 (x), it is of-
these discovered components to be composed together to ten difficult to directly sample from the product density
6
Compositional Generative Modeling: A Single Model is Not All You Need
p1 (x)p2 (x). Existing generative models typically represent 5.2. Effective Compositional Sampling on Continuous
probability distributions in a factorized manner to enable Distributions
efficient learning and sampling, such as at the token level in
Given a composed distribution represented as EBM E(x)
autoregressive models (Van Den Oord et al., 2016) or across
defined over inputs x ∈ RD , directly finding a low energy
various noise levels in diffusion models (Sohl-Dickstein
sample through MCMC becomes increasingly inefficient as
et al., 2015). However, depending on the form of the factor-
the data dimension D rises. To more effectively find low-
ization, the models may not be straightforward to compose.
energy samples in EBMs in high-dimensional continuous
For instance, consider two learned autoregressive factoriza- spaces, we can use the gradient of the energy function to
tions p1 (xi |x0:i−1 ) and p2 (xi |x0:i−1 ) over sequences x0:T . help guide sampling. In Du & Mordatch (2019), Langevin
The autogressive factorization of the product distribution dynamics is used to implement efficient sampling, where a
pproduct (x) ∝ p1 (x)p2 (x) corresponds to sample can be repeatedly optimized using the expression
X
pproduct (xi |x0:i−1 ) = p1 (xi+1:T |x0:i )p1 (xi |x0:i−1 ) xt = xt−1 − λ∇x E(x) + ϵ, ϵ ∼ N (0, σ),
xi+1:T where x0 is initialized from uniform noise. By convert-
p2 (xi+1:T |x0:i )p2 (xi |x0:i−1 ), ing different operations such as products, mixtures, and
where we need to marginalize over all possible future values inversions of probability distributions into composite en-
of xi+1:T . Since this marginalization is different depen- ergy functions, the above sampling procedure allows us to
dent on the value of xi , pproduct (xi |x0:i−1 ) is not equivalent effectively compositionally sample from composed distribu-
to p1 (xi |x0:i−1 )p2 (xi |x0:i−1 ) and therefore autoregressive tions (Du et al., 2020a).
factorizations are not directly compositional. Similarly, two There has been a substantial body of recent work on im-
learned score functions from diffusion models are not di- proving learning in EBMs (Du & Mordatch, 2019; Nijkamp
rectly composable as they do not correspond to the noisy et al., 2019; Grathwohl et al., 2019; Du et al., 2020b; Grath-
gradient of the product distribution (Du et al., 2023). wohl et al., 2021) but EBMs still lag behind other generative
While it is often difficult to combine generative models, approaches in efficiency and scalability of training. By lever-
representing the probability density explicitly enables us aging the close connection of diffusion models with EBMs
to combine models by manipulating the density. One such in (Song & Ermon, 2019) we can also directly implement
approach is to represent probability density as an Energy- the compositional operations with EBMs with diffusion
Based Model, pi (x) ∝ e−Ei (x) (Hinton, 2002; Du & Mor- models (Du et al., 2023), which we briefly describe below.
datch, 2019). Under this factorization by definition, we can Given a diffusion model representing a distribution p(x), we
construct the product density corresponding to can interpret the T learned denoising functions ϵ(x, t) of the
e−(E1 (x)+E2 (x)) ∝ e−E1 (x) e−E2 (x) , (2) diffusion model as representing T separate EBM distribu-
tions, e−E(x,t) , where ∇x E(x, t) = ϵ(x, t). This sequence
corresponding to a new EBM E1 (x) + E2 (x). It is impor-
of EBM distributions transition from e−E(x,T ) representing
tant to observe that EBMs generally represent probability
the Gaussian distribution N (0, 1) to e−E(x,0) representing
densities in an unnormalized manner, and the product of
the target distribution pi (x). We can draw samples from
two normalized probability densities p1 (x) and p2 (x) will
this sequence of EBMs using annealed importance sam-
be an unnormalized probability density as well (where the
pling (Du et al., 2023), where we initialize a sample from
normalization constant is intractable to compute as it re-
Gaussian noise and sequentially run several steps of MCMC
quires marginalization over the sample space). Additional
on each EBM distribution, starting at e−E(x,T ) and ending
operations between probability densities such as mixtures
at e−E(x,0) .
and inversions of distributions can also be expressed as
combinations of energy functions (Du et al., 2020a). This EBM interpretation of diffusion models allows them
to be composed using operations such as Equation 2 by ap-
To generate samples from any EBM distribution, it is neces-
plying the operation to each intermediate EBM correspond-
sary to run Markov Chain Monte Carlo (MCMC) to itera-
ing to the component diffusion distributions, for instance
tively refine a starting sample to one that is high likelihood
e−(E1 (x,k)+E2 (x,k)) . We can then use an annealed impor-
(low energy) under the EBM. We present practical MCMC
tance sampling procedure on this sequence of composite
algorithms for sampling from composed distributions in
EBMs. Note that this annealed importance procedure is
continuous spaces in Section 5.2 and discrete spaces in
necessary for accurate compositional sampling – using the
Section 5.3 with EBMs. Recently, new methods for imple-
reverse diffusion process directly on this composed score
menting compositional sampling using separately trained
does not sample from the composed distribution (Du et al.,
classifiers to efficiently specify each conditioned factor have
2023).
been developed (Garipov et al., 2023), which we encourage
the reader to also read. A variety of different MCMC samplers such as ULA,
7
Compositional Generative Modeling: A Single Model is Not All You Need
MALA, U-HMC, and HMC can be used as intermediate Audio Model Memory Model
MCMC samplers for this sequence of EBM distributions. Audio
One easy-to-implement MCMC transition kernel that is easy
to understand is the standard diffusion reverse sampling ker-
nel at a fixed noise level. We illustrate in Appendix A that Image Model
this is equivalent to running a ULA MCMC sampling step. Language
This allows compositional sampling in diffusion models Model
to be easily implemented by simply constructing the score
function corresponding to the composite distribution we
wish to sample from and then using the standard diffusion Image
sampling procedure, but with the diffusion reverse step ap- Action
plied multiple times at each noise level.
Video Model Action Model
5.3. Effective Compositional Sampling on Discrete Figure 15. Decentralized Decision Making. By composing gen-
Distributions erative models operating over various modalities we can construct
decentralized architectures for intelligent agents. Communication
Given an EBM representing a composed distribution E(x) between models is induced by inference over the joint distribution.
on a high dimensional discrete landscape, we can use Gibbs
sampling to sample from the resultant distribution, where terms of both buildability and interpretability. As individ-
we repeatedly resample values of individual dimensions of x ual models are responsible for independent subsets of data,
using the marginal energy function E(xi | x−i ). However, each model can be built separately and modularly by dif-
this process is increasingly inefficient as the underlying ferent institutions. Simultaneously, at execution time, it is
dimensionality of the data increases. significantly easier to understand and monitor the execu-
The use of a gradient of the energy function E(x) to ac- tion of each simpler consitituent model than a single large
celerate sampling in the discrete landscape is difficult, as monolithic model.
the gradient operation is not well defined in discrete space In addition, such compositional systems can be more en-
(though there are also promising discrete analogs of gradi- vironmentally friendly and easier to deploy than large
ent samplers (Grathwohl et al., 2021)). However, we can monolithic models. As individual models are substantially
leverage our learned generative distributions to accelerate smaller, they can run efficiently using small amounts of
sampling, by using one generative model as a proposal dis- computation. Simultaneously, it is more straightforward
tribution and the remaining energy functions to implement to deploy separate models across separate computational
a Metropolis-Hastings step (Li et al., 2022; Verkuil et al., machines.
2022).
In the setting of constructing an artificially intelligent agent,
As an example, to sample from an energy function E(x) = such a compositional architecture may look like a decentral-
E1 (x) + E2 (x), given an initial MCMC sample xt , we ized decision-making system in Figure 15. In this system,
can draw a new sample xt+1 by sampling from the learned separate generative models are responsible for processing
distribution e−E1 (x) , and accept the new sample xt+1 with each modality an agent receives and other models respon-
a Metropolis acceptance rate sible for decision-making. Sampling composed generative
a(xt+1 ) = clip(eE2 (xt )−E2 (xt+1 ) , 0, 1). distribution of models corresponds to message passing be-
tween models, inducing cross-communication between mod-
els similar to a set of daemons communicating with each
6. Discussion and Future Directions other (Selfridge, 1988). Individual generative models in
this architecture can be substituted with existing models
Most recent research on building generative models has such as LLMs for proposing plausible plans for actions and
focused on increasing the computational scale and data on text-to-video presenting future world states.
which models are trained. We have presented an orthogonal
direction to constructing complex generative systems, Finally, while we have provided a few promising results on
by building systems compositionally, combining simpler applications of compositional generative modeling, there
generative models to form more complex ones. We have remains much to be explored. Future lines of work in area
illustrated how this can be more data and computation- include (1) How do we pass messages and sample from joint
efficient to learn, enable flexible reprogramming, and how distributions? (2) How can we discover the compositional
such components can be discovered from raw data. structure inside a generative system? (3) How can we dy-
namically modify and discover the correct compositional
Such compositional systems have additional benefits in structure of generative models given distribution shift?
8
Compositional Generative Modeling: A Single Model is Not All You Need
9
Compositional Generative Modeling: A Single Model is Not All You Need
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Su, J., Liu, N., Tenenbaum, J. B., and Du, Y. Com-
Compositional visual generation with composable dif- positional image decomposition with diffusion models,
fusion models. In European Conference on Computer 2024. URL https://openreview.net/forum?
Vision, pp. 423–439. Springer, 2022. id=88FcNOwNvM.
Liu, N., Du, Y., Li, S., Tenenbaum, J. B., and Torralba, Tamkin, A., Brundage, M., Clark, J., and Ganguli, D.
A. Unsupervised compositional concepts discovery Understanding the capabilities, limitations, and soci-
with text-to-image generative models. arXiv preprint etal impact of large language models. arXiv preprint
arXiv:2306.05357, 2023. arXiv:2102.02503, 2021.
Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H. T., Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
and Gurevych, I. Are emergent abilities in large lan- K. Pixel recurrent neural networks. In International
guage models just in-context learning? arXiv preprint conference on machine learning, pp. 1747–1756. PMLR,
arXiv:2309.01809, 2023. 2016.
Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, Verkuil, R., Kabeli, O., Du, Y., Wicky, B. I., Milles, L. F.,
S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Dauparas, J., Baker, D., Ovchinnikov, S., Sercu, T., and
Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Rives, A. Language models generalize beyond natural
Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., proteins. bioRxiv, pp. 2022–12, 2022.
Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., and
Vincent, P. A connection between score matching and de-
Rajeswaran, A. Openeqa: Embodied question answering
noising autoencoders. Neural computation, 23(7):1661–
in the era of foundation models. 2023.
1674, 2011.
Murphy, K. P. Probabilistic machine learning: an introduc-
Yadlowsky, S., Doshi, L., and Tripuraneni, N. Pretraining
tion. MIT press, 2022.
data mixtures enable narrow model selection capabilities
Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, in transformer models. arXiv preprint arXiv:2311.00871,
Y. N. On the anatomy of mcmc-based maximum likeli- 2023.
hood learning of energy-based models. arXiv preprint
Yang, M., Du, Y., Dai, B., Schuurmans, D., Tenenbaum,
arXiv:1903.12370, 2019.
J. B., and Abbeel, P. Probabilistic adaptation of text-to-
OpenAI. URL https://openai.com/pricing. video models. arXiv preprint arXiv:2306.01872, 2023a.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Yang, Z., Mao, J., Du, Y., Wu, J., Tenenbaum, J. B., Lozano-
Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Pérez, T., and Kaelbling, L. P. Compositional diffusion-
Y., Kay, J., Springenberg, J. T., et al. A generalist agent. based continuous constraint solvers. arXiv preprint
arXiv preprint arXiv:2205.06175, 2022. arXiv:2309.00966, 2023b.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang,
Ommer, B. High-resolution image synthesis with latent X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in
diffusion models, 2022. the ai ocean: A survey on hallucination in large language
models. arXiv preprint arXiv:2309.01219, 2023.
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent
abilities of large language models a mirage? arXiv
preprint arXiv:2304.15004, 2023.
10
Compositional Generative Modeling: A Single Model is Not All You Need
Appendix
The ULA sampler draws an MCMC sample from the EBM probability distribution pt (x) using the expression
√
xt+1 = xt − η∇x pt (x) + 2ηξ, ξ ∼ N (0, 1), (A2)
where η is the step size of sampling.
By substituting η = βt in the ULA sampler, the sampler becomes
√
xt+1 = xt − βt ∇x pt (x) + 2βt ξ, ξ ∼ N (0, 1). (A3)
√
Note the similarity of ULA sampling in Eqn A3 and the reverse sampling procedure in Eqn A1, where there is a factor of 2
scaling of the added Gaussian noise in the ULA sampling procedures. This means that we can implement the
√ ULA sampling
by running the standard reverse process, but by scaling the noise added in each timestep by a factor of 2. Alternatively,
we can directly we can directly use the reverse sampling procedure in Eqn A1 to run ULA, where this then corresponds to
sampling a tempered variant of pt (x) with temperature √12 (corresponding to less stochastic samples from the composed
probability distribution).
11