Diffusion Models & Representation Learning
Diffusion Models & Representation Learning
Abstract—Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can
be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey
explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential
aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches
related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned
from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and
arXiv:2407.00783v1 [cs.CV] 30 Jun 2024
self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between
diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link:
https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.
Index Terms—deep generative modeling, diffusion models, denoising diffusion models, score-based models, image generation,
representation learning.
1 I NTRODUCTION
Diffusion Models [68, 151, 154] have recently emerged as
the state-of-the-art of generative modeling, demonstrating
remarkable results in image synthesis [43, 67, 68, 141] and
across other modalities including natural language [9, 70, 77,
101], computational chemistry [6, 71] and audio synthesis
[80, 92, 109]. The remarkable generative capabilities of Dif-
fusion Models suggest that Diffusion Models learn both low
and high-level features of their input data, potentially mak-
ing them well-suited for general representation learning.
Unlike other generative models like Generative Adversarial
Networks (GANs) [22, 53, 84] and Variational Autoencoders
(VAEs) [88, 137], diffusion models do not contain fixed archi-
tectural components that capture data representations [124].
This makes diffusion model-based representation learning
challenging. Nevertheless, approaches leveraging diffusion
models for representation learning have seen increasing
interest, simultaneously driven by advancements in training Fig. 1. Shows yearly numbers of both published and preprint papers on
and sampling of Diffusion Models. diffusion models and representation learning. For 2024, the green bar
Current state-of-the-art self-supervised representation indicates the number of papers collected up to and including June 2024,
and the dashed grey bar indicates the projected number for the whole
learning approaches [8, 24, 33, 55] have demonstrated great year.
scalability. It is thus likely that diffusion models exhibit
similar scaling properties [159]. Controlled generation ap-
proaches like Classifier Guidance [43] and Classifier-free fusion models to train on much larger, annotation-free
Guidance [67] used to obtain state-of-the-art generation datasets.
results rely on annotated data, which represents a bottle- This survey paper aims to elucidate the relationship
neck for scaling up diffusion models. Guidance approaches and interplay between diffusion models and representation
that leverage representation learning and that are thus learning. We highlight two central perspectives: Using diffu-
annotation-free offer a solution, potentially enabling dif- sion models themselves for representation learning and us-
ing representation learning for improving diffusion models.
• Michael Fuest is a Master’s student at the Technical University of We introduce a taxonomy of current approaches and derive
Munich. Pingchuan Ma, Ming Gui, and Johannes Schusterbauer are generalized frameworks that demonstrate commonalities
PhD students at LMU Munich. Vincent Tao Hu, a PostDoc from LMU
Munich, is also the corresponding author.
among current approaches.
E-mail: [email protected] Interest in exploring the representation learning capabil-
• Björn Ommer is a full professor at LMU where he heads the Computer Vi- ities of diffusion models has been growing since the original
sion & Learning Group (previously Computer Vision Group Heidelberg). formulation of diffusion models by Ho et al. [68], Sohl-
Dickstein et al. [151], Song et al. [154]. As demonstrated
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
Fig. 2. Left: Shows qualitative generation results from diffusion models conditioned using self-supervised guidance signals. Right: Shows qualitative
results of downstream image tasks that leverage representations learned in training diffusion models. Adapted from Li et al. [100], Hu et al. [73],
Pan et al. [130], Baranchuk et al. [15], Yang and Wang [173].
[68] suggest fixing the covariance Σθ (xt , t) to a constant discrete timesteps. Here, the diffusion process is expressed
value, which enables rewriting the parametrized reverse as a continuous time-dependent function σ(t). Noise is
mean as a function of the added noise ϵ(xt , t) instead of gradually added whenever a sample x moves forward
x0 : in time, and gradually removed if the image follows the
1 1 − αt reverse trajectory. More specifically, the diffusion process
µθ (xt , t) = √ xt − √ ϵθ (xt , t). (6)
αt 1 − ᾱt can be expressed using an Itô Stochastic Differential Equa-
tion (SDE) [83], where the vector-valued drift coefficient
This reparametrization allows for the derivation of a
simplification of the objective Lvlb which we denote Lsimple
f (·, t) : Rd → Rd and the scalar-valued diffusion coefficient
that measures the distance between the predicted noise
g(·) : R → R need to be selected when implementing a
diffusion model:
ϵθ (xt , t) and the actual noise ϵt as follows:
2 dx = f (x, t)dt + g(t)dw, (9)
Lsimple = Et∼[1,T ] Ex0 ∼p(x0 ) Eϵt ∼N (0,I) ∥ϵt − ϵθ (xt , t)∥ .
(7) where w is the standard Wiener process. There are two
Instead of predicting the mean and covariance directly, widely used choices of the SDE formulation used to model
the network is now parametrized to predict the added noise the diffusion process. The first is the Variance-Preserving
for a diffusion timestep and noisy image input. The reverse (VP) SDE, used in the work of Ho et al. p [68] which is
mean is obtained using Equation 6, and the covariance is given by f (x, t) = − 21 β(t)x and g(t) = β(t), where
fixed. Noise prediction networks have the benefit of being β(t) = βt as T goes to infinity. Note that this is equivalent to
able to recover xt−1 from xt in the final sampling stages the continuous formulation of the DDPM parametrization
by predicting zero noise [79]. This is more difficult for direct in Equation 1. The second is the Variance-Exploding (VE)
parametrizations of x̂0 . There is therefore a tradeoff between
q resulting from a choice of f (x, t) = 0 and
SDE [153],
the two, where direct parametrizations can be more bene-
g(t) = 2σ(t) dσ(t) dt . The VE SDE gets its name since the
ficial for very noisy inputs in the initial sampling stages,
variance continually increases with increasing t, whereas the
and noise prediction parametrization can be beneficial in
variance in the VP SDE is bounded [154]. Anderson [7] de-
the latter sampling stages [27].
rives an SDE that reverses a diffusion process, which results
In efforts to improve sampling efficiency, Salimans and
in the following when applied to the Variance Exploding
Ho [143] introduce velocity prediction as a further alterna-
SDE:
tive parametrization. Velocity is a linear combination of the r
denoised input and the added noise, commonly defined as: dσ(t) dσ(t)
dx = −2σ(t) ∇x log p(x; σ(t)) dt + 2σ(t) dw.
v = ᾱt ϵ − (1 − ᾱt )xt . (8) dt dt
(10)
This parametrization combines benefits of both data and ∇x log p(x; σ(t)) is known as the score function. This
noise parametrizations, allowing the denoising network to score function is generally not known, so it needs to be
flexibly learn noise prediction as well as reconstruction dy- approximated using a neural network. A neural network
namics based on the signal-to-noise ratio. This parametriza- D(x; σ) that minimizes the L2-denoising error can be used
tion has led to stable results in diffusion distillation ap- to extract the score function since ∇x log p(x; σ(t)) =
D(x;σ)−x
proaches [143], and can speed up generation [19]. σ2 . This idea is known as Denoising Score Matching
Recently, several works [32, 133, 153, 154] further pro- [161].
pose to think of the noise in terms of continuous instead of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
Fig. 3. Left: An exemplary visualization of the U-Net architecture [140]. Consists of an encoder and a decoder, with residual connections that
preserve gradient flow and low-level input details. Adapted from [135]. Right: An exemplary visualization of the DiT architecture. Shows the high-
level architecture, as well as a breakdown of the adaLN-Zero DiT block. Adapted from Peebles and Xie [132].
to regulate the strength of the conditioning signal within the samples x that minimize a loss function L(x) and are likely
model. Diffusion model guidance has recently emerged as under the model’s distribution p. The optimization is formu-
an approach to more precisely trade-off generation quality lated as minx0 L(x), where x0 is the source noise point. The
and diversity. loss function L(x) can be modified for conditional sampling
Dhariwal and Nichol [42] use classifier guidance, a to generate a sample belonging to a particular class y.
compute-efficient method leveraging a pre-trained noise-
robust classifier to improve sample quality. Classifier guid-
3 M ETHODS
ance is based on the observation that a pre-trained diffusion
model can be conditioned using the gradients of a classifier Having covered the main preliminaries for diffusion mod-
parametrized by ϕ outputting pϕ (c|xt , t). The gradients of els, we outline a series of methods related to diffusion
the log-likelihood of this classifier ∇xt log pϕ (c|xt , t) can be models and representation learning in the following sec-
used to guide the diffusion process towards generating an tion. In subsection 3.1 we describe and categorize current
image belonging to class label y. The score estimator for frameworks utilizing representations learned by pre-trained
p(x|c) can be written as diffusion models for downstream recognition tasks. In sub-
section 3.2, we describe methods that leverage advances in
∇xt log (pθ (xt )pϕ (c|xt )) = ∇xt log pθ (xt )+∇xt log pϕ (c|xt ). representation learning to improve diffusion models them-
(11) selves.
By using Bayes’ theorem, the noise prediction network
can then be rewritten to estimate:
3.1 Diffusion Models for Representation Learning
ϵ̂θ (xt , c) = ϵθ (xt , c) − wσt ∇xt log pϕ (c|xt ), (12) Learning useful representations is one of the main moti-
where the parameter w modulates the strength of the vations for designing architectures like VAEs [88, 89] and
conditioning signal. Classifier guidance is a versatile ap- GANs [22, 84]. Contrastive learning approaches, where the
proach that increases sample quality, but it is heavily reliant goal is to learn a feature space in which representations of
on the availability of a noise-robust pre-trained classifier, similar images are very close together, and vice versa for
which in turn relies on the availability of annotated data, dissimilar images (e.g. SimCLR [34], MoCo [60]), have also
which is not available in many applications. led to significant advances in representation learning. These
To address this limitation, Classifier-free guidance contrastive methods are not fully self-supervised however,
(CFG) [67] eliminates the need for a pre-trained classifier. since they require supervision in the form of augmentations
CFG works by training an unconditional diffusion model that preserve the original content of the image.
parametrized by ϵθ (xt , t, ϕ) together with a conditional Diffusion models offer a promising alternative to these
model parametrized by ϵθ (xt , t, c). For the unconditional approaches. While diffusion models are primarily designed
model, a null input token ϕ is used as a conditioning signal for generation tasks, the denoising process encourages the
c. The network is trained by randomly dropping out the learning of semantic image representations [15], that can
conditioning signal with probability puncond . Sampling is be used for downstream recognition tasks. The diffusion
then performed using a weighted combination of condi- model learning process is similar to the learning process of
tional and unconditional score estimates: Denoising Autoencoders (DAE) [18, 162], which are trained
to reconstruct images corrupted by adding noise. The main
ϵ̃θ (xt , c) = (1 + w)ϵθ (xt , c) − wϵθ (xt , ϕ). (13) difference is that diffusion models additionally take the
diffusion timestep t as input, and can thus be viewed as
This sampling method does not rely on the gradients of a
multi-level DAEs with different noise scales [169]. Since
pre-trained classifier but still requires an annotated dataset
DAEs learn meaningful representations in the compressed
to train the conditional denoising network. Fully uncondi-
latent space, it is intuitive that diffusion models exhibit
tional approaches have yet to match classifier-free guidance,
similar representation learning capabilities. We outline and
though recent works using diffusion model representations
discuss current approaches in the following section.
for self-supervised guidance show promise [73, 100]. These
methods do not need annotated data, allowing the use of
3.1.1 Leveraging intermediate activations
larger unlabelled datasets.
Table 1 shows the requirements of current guidance Baranchuk et al. [15] investigate the intermediate activations
methods. While classifier and classifier-free guidance im- from the U-Net network that approximates the Markov
prove generation results, they require annotated training step of the reverse diffusion process in DDPMs [42]. They
data. Self-guidance and online guidance are fully self- show that for certain diffusion timesteps, these intermediate
supervised alternatives that achieve competitive perfor- activations capture semantic information that can be used
mance without annotations. for downstream semantic segmentation. The authors take
Classifier and classifier-free guidance are controlled gen- a noise-predictor network ϵθ (xt , t) trained on the LSUN-
eration methods that rely on conditional training. Training- Horse [177] and FFHQ-256 [84] datasets and extract feature
free approaches modify the generation process of a pre- maps produced by one of the network’s 18 decoder blocks
trained model by binding multiple diffusion processes [14] for label-efficient downstream segmentation tasks. Selecting
or using time-independent energy functions [179]. Other the ideal diffusion timestep and decoder block activation
controlled generation methods take a variational perspec- to extract is non-trivial. To understand the efficacy of pixel-
tive [54, 119, 146, 164], treating controlled generation as a level representations of different decoder blocks, the authors
source point optimization problem [17]. The goal is to find train a multi-layer perceptron (MLP) to predict the semantic
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
TABLE 2
Summary of the methods using diffusion models for representation learning.
label from features produced by different decoder blocks mark comparisons against diffusion-based methods like Hy-
on a specific diffusion step t. The representations from a bViT [174] and SBGC [190] on CIFAR-10 and Tiny-ImageNet
fixed set of blocks B of the pre-trained U-Net decoder and [41] show that EDM-based Denoising Diffusion Autoen-
higher diffusion timesteps are upsampled to the image size coders (DDAEs) outperform previous supervised and un-
using bilinear interpolation and concatenated. The obtained supervised diffusion-based methods on both generation
feature vectors are then used to train an ensemble of in- and recognition, especially after fine-tuning. Benchmarking
dependent MLPs which predict a semantic label for each against contrastive learning methods shows that the EDM-
pixel. The final prediction is obtained by majority voting. based DDAE is comparable with Sim-CLRs considering
This method, denoted DDPM-Seg, outperforms baselines model sizes, and outperforms SimCLRs with comparable
that exploit alternative generative models and achieves seg- parameters on CIFAR-10 and Tiny-ImageNet.
mentation results competitive with MAE [61], illustrating ODISE [170] is a related approach that unites text-
that intermediate denoising network activations contain to-image diffusion models with discriminative models to
semantic image features. perform panoptic segmentation [90, 91], a segmentation
Xiang et al. [169] extend this approach to further ar- approach unifying instance and semantic segmentation into
chitectures and image recognition on CIFAR-10 and Tiny- a common framework for comprehensive scene understand-
ImageNet. They investigate the discriminative efficacy of ing. ODISE extracts the internal features of a pre-trained
extracted features for different backbones (U-Net and DiT text-to-image diffusion model. These features are input to
[132]) under different frameworks (DDPM and EDM [85]). a mask generator trained on annotated masks. A mask
The relationship between feature quality and layer-noise classification module then categorizes each generated bi-
combinations is evaluated through grid search, where the nary mask into an open vocabulary category by relating the
quality of feature representations is determined using linear predicted mask’s diffusion features with text embeddings of
probing. The best-performing features lie in the middle of object category names. The authors use the Stable Diffusion
up-sampling using relatively small noising levels, which is U-Net DDPM backbone and extract features by computing
in line with conclusions drawn in DDPM-Seg [15]. Bench- a single forward pass and extracting the intermediate ac-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
tivations f = UNet(xt , τ (s), t) where τ (s) is an encoded p2 can be semantic (pixels that contain similar semantics),
representation of the image caption s obtained leveraging a geometrical (pixels that contain different views of an object)
pre-trained text encoder τ . Interestingly, the authors obtain or temporal (pixels that contain the same object deforming
the best results using t = 0, whereas previous methods over time). DIFT (Diffusion Features) [157] is an approach
obtain better results using higher diffusion timesteps. To leveraging pre-trained diffusion model representations for
overcome reliance on available image captions, Xu et al. correspondence tasks. DIFT also relies on extracting dif-
[170] additionally train an MLP-based implicit captioner fusion model features. Similarly to previous approaches,
that computes an implicit text embedding from the image diffusion timestep and network layer numbers used for
itself. ODISE establishes a new state-of-the-art in open- extraction are an important consideration. The authors ob-
vocabulary segmentation and is a further example of the serve more semantically meaningful features for large dif-
rich semantic representations learned by denoising diffusion fusion timesteps and earlier network layer combinations,
models. whereas lower-level features are captured in smaller dif-
Mukhopadhyay et al. [125] also propose leveraging in- fusion timesteps and later denoising network layers. DIFT
termediate activations from the unconditional ADM U-Net is shown to outperform other self-supervised and weakly-
architecture [42] for ImageNet classification. The method- supervised methods across a range of correspondence tasks,
ology for layer and timestep selection is similar to previ- showing on-par performance with state-of-the-art methods
ous approaches. Additionally, the impact of different sizes on semantic correspondence specifically.
for feature map pooling is evaluated and several different Zhang et al. [183] evaluate how learned diffusion fea-
lightweight architectures for classification (including linear, tures relate across multiple images, instead of focusing on
MLP, CNN, and attention-based classification heads) are downstream tasks for single images. To investigate this, they
used. Feature quality is found to be mostly insensitive to employ Stable Diffusion features for semantic correspon-
pooling size, and is mostly dependent on time steps and dence as well. The authors observe that Stable Diffusion
the selected block number. Their approach, which we term features have a strong sense of spatial layout, but some-
guided diffusion classification (GDC), achieves competitive times provide inaccurate semantic matches. DINOv2 [128],
performance against other unified models, namely BigBi- a method for self-supervised representation learning using
GAN [44] and MAGE [99]. The attention-based classification knowledge distillation and vision transformers, produces
heads perform best on ImageNet-50, but perform poorly on more sparse features that provide more accurate matches.
Fine-Grained Visual Classification datasets, indicating their Zhang et al. [183] therefore propose to combine the two fea-
reliance on a large amount of available data. tures and employ zero-shot evaluation of nearest neighbor
In a continuation of their previous work, Mukhopad- search on the combined features to achieve state-of-the-art
hyay et al. [126] extend this approach by introducing two performance on several semantic correspondence datasets
methods for more fine-grained block and denoising time like SPair-71k and TSS.
step selection. The first is DifFormer [126], an attention SD4Match [103] builds on this approach by using various
mechanism replacing the fixed pooling and linear classi- prompt tuning and conditioning techniques. One method,
fication head from [125] with an attention-based feature SD4Match-Class, fine-tunes prompt embedding Θ for each
fusion head. This fusion head is designed to replace the semantic class using a semantic matching loss [102]. Given
fixed flattening and pooling operation required to generate images IA B
t and It , the Stable Diffusion U-Net f (·) extracts
vector feature representations from the U-Net CNN used feature maps Ft and FB
A
t by Ft = f (It , t, Θ). Correspon-
in the GDC approach with a learnable pooling mechanism. dence points are predicted by normalizing feature maps and
The second mechanism is DifFeed [126], a dynamic feedback computing a correlation map, which is converted to a prob-
mechanism that decouples the feature extraction process ability distribution using a softmax operation. Additionally,
into two forward passes. In the first forward pass, only Li et al. [103] propose conditioning prompts on input im-
the selected decoder feature maps are stored. These are ages using a Conditional Prompting Module (CPM), which
fed to an auxiliary feedback network that learns to map includes a DINOv2 feature extractor, linear layers, and an
decoder features to a feature space suitable for adding them adaptive MaxPooling layer. The conditioning embedding
to the encoder blocks of corresponding blocks. In the second Θcond is formed by concatenating feature representations
forward pass, the feedback features are added to the encoder and projecting them to the prompt embedding dimension.
features, and the DifFeed attention head is used on top The final prompt ΘAB is obtained by appending Θcond to
of those second forward pass features. These additional a global prompt Θglobal . This method sets new benchmark
improvements further increase the quality of learned rep- accuracies on SPair-71k [122], PF-Willow, and PF-Pascal [59],
resentations and improve ImageNet and fine-grained visual surpassing methods like DIFT [157] and SD+DINO [183]
classification performance. Luo et al. [116] introduce Diffusion Hyperfeatures, a
The previously described diffusion representation learn- framework designed to consolidate multiple intermediate
ing methods focus on segmentation and classification, activation maps across diffusion timesteps for downstream
which are only a subset of downstream recognition tasks. recognition. Activations are consolidated using an inter-
Correspondence tasks are another subset that generally in- pretable aggregation network, that takes the collection of
volves identifying and matching points or features between intermediate feature maps as input and produces a single
different images. The problem setting is as follows: Consider feature descriptive feature map as output. While other
two images I1 and I2 and a pixel location p1 in I1 . A approaches manually select fixed diffusion timesteps and
correspondence task involves finding the corresponding activations from a pre-determined number of intermediate
pixel location p2 in I2 . The relationship between p1 and network layers, Diffusion Hyperparameters cache all feature
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
maps across all layers and timesteps in the diffusion process when feeding the denoising network the latent represen-
to generate a dense set of activations. This high dimensional tation of the input image generated by using the VQGAN
set of activations is upsampled, passed through a bottleneck encoder [47] to obtain feature maps F . Finally, pϕ3 (y|F)
layer B and weighed with a unique learnable mixing weight serves as a light-weight prediction head implemented as
wl,s for each layer and timestep combination. The final a semantic feature pyramid network [90] that is adapted
diffusion hyperfeatures take on the form to the downstream task. VDM is evaluated on semantic
S X
L
segmentation and depth estimation, and achieves highly
competitive performance and fast convergence compared to
X
wl,s Bl (rl,s ), (14)
s=0 l=1
methods with other pre-training paradigms.
A more indirect application of text-to-image diffusion
where L is the number of layers, S is a subsample model representations is instructional image editing [23, 51,
of the number of diffusion timesteps and r is an activa- 98], where the desired image edit is described by a natural
tion feature map. Bottleneck layers and mixing weights language instruction rather than a description of the desired
are finetuned on the specific downstream task. Similar to new image [81]. Prompt-based image editing is challenging
previous approaches, Diffusion Hyperfeatures is used for since small changes in the textual prompt can lead to vastly
semantic correspondence. The authors extract activations different generation outcomes. [65] propose a textual editing
from Stable-Diffusion and tune the aggregation network on method for pre-trained text-conditioned diffusion models
a subset of SPair-71k. Diffusion Hyperfeatures outperforms that leverages the semantic strength of the intermediate
models that use self-supervised descriptors or supervised cross-attention layers in the denoising backbone. This ap-
hypercolumns on the SPair-71k and CUB datasets. proach is based on a key observation also employed in
Hedlin et al. [62] focus on optimizing the prompt embed- [62, 187]: Cross-attention maps contain rich information on
dings by exploiting intermediate attention maps specifically. the spatial layout and geometry of the generated image. In-
Given a certain input text prompt, these attention activation jecting the cross-attention layers obtained when generating
maps correspond to the semantics of the prompt. Instead an image I into the generation process of the edited image
of optimizing a global or a class-dependent prompt embed- I ∗ ensures that the edited image preserves the original
ding Θ using the semantic loss, Hedlin et al. [62] optimize spatial layout. Hertz et al. [65] use Imagen [141] to conduct
the embedding to maximize the cross-attention at the loca- experiments and demonstrate promising results on text-only
tion of interest. Locating corresponding points in a second localized editing, global editing, and real image editing. Fol-
image then comes down to conditioning on the optimized lowing works like Plug-and-play Diffusion Features [160]
prompt, and selecting the point with the pixel attaining further improve upon this by leveraging all intermediate
the maximum attention map value within the target image. activation maps to enable instructional image editing. Other
Note that this approach does not utilize supervised training techniques like TokenFlow [52] and work by Yatim et al.
specific to semantic correspondence. However, they require [175] have extended this idea to the video space, using
test-time optimization which is costly. Text prompts are diffusion features to enable prompt-based video editing
optimized using an off-the-shelf diffusion model without text-driven motion transfer.
fine-tuning. Several further works building on aforemen-
tioned approaches [120, 184] exist, showing that exploiting 3.1.2 A general representation extraction framework
pre-trained diffusion models for semantic correspondence
Many of the methods outlined in the previous section follow
remains a promising application of diffusion models.
a similar procedure in leveraging learned representations of
Zhao et al. [187] propose Visual Perception with a pre-
pre-trained diffusion models for downstream vision tasks.
trained Diffusion Model (VDM), a framework closely re-
In this section, we aim to consolidate these approaches to a
lated to USCSD [62] that employs a text feature refinement
common three-step framework. We do this to provide clarity
network as well as an additional recognition encoder for
on the relationship between diffusion models and their use
semantic segmentation and depth estimation. Here, the de-
for downstream predictive tasks. To leverage intermediate
noising network is fed with refined text representations as
activations for downstream tasks, a selection methodology
well as an input image, and the resulting feature maps as
that outputs the ideal diffusion timestep input as well as the
well as the cross-attention maps between the text and image
intermediate layer number(s) whose activation maps have
features are used to provide guidance for a decoder. To
the highest predictive performance when upsampled and
achieve this, the prediction model is written as pϕ (y|x, S),
linearly probed must be applied. This can be a trainable
where S represents the set of all category labels of the
model [116], a grid search procedure [169] or a learning
downstream task. The prediction model is implemented as
agent [173]. The goal of this methodology is generally to
the following:
select timestep t ∈ T and a set of decoder block numbers
pϕ (y|x, S) = pϕ3 (y|F)pϕ2 (F|x, C)pϕ1 (C|S), (15) B that maximize predictive performance on a downstream
task. Given a set of possible timesteps T and a set of decoder
where F denotes the set of feature maps and C denotes blocks B , the goal is to find:
the text features. Here, pϕ1 (C|S) denotes a text adapter con-
sisting of a two-layer MLP that refines the text features ob- (t∗ , B ∗ ) = arg min Ldiscr (t, B) (16)
t∈T,B⊆B
tained by applying the CLIP text encoder to a text template
of "a photo of a [CLS]". pϕ2 (F|x) extracts the feature where Ldiscr (t, B) represents the discriminative loss at
maps from the denoising network given the input image x timestep t when the blocks in B are used for downstream
and the set of refined text features C . The authors use t = 0 prediction. Generally, discriminative tasks will require more
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Fig. 4. A high-level overview of a framework for extracting representations from pre-trained diffusion models for downstream tasks.
high-level features corresponding to structural elements and representation and z is the student model representation.
shapes, whereas generative tasks mapping random noise to The distance between the two is minimized during training
images will require the computation of lower-level features. using a loss function Lkd . After the distillation, the student
The ideal intermediate layer number as well as the optimal network is reapplied as a feature extractor and fine-tuned
diffusion timestep will largely depend on the exact down- on the available task labels. Previous approaches for using
stream prediction task, the dataset, and the architecture of diffusion model representations rely on grid-search to deter-
the diffusion model used. mine which diffusion timestep to use for feature extraction.
Once the ideal timestep and layer number are deter- Here, the authors formulate a reinforcement learning envi-
mined, an input image and the selected diffusion timestep ronment where the action space is the set of all possible
are passed to the diffusion model, and the intermediate acti- timesteps t available for selection, and the reward function
vations in the selected decoder blocks computed in the for- is the negative task loss −Ltask (y, g(z(t) ; θg )). Given the
ward pass are extracted and generally concatenated and pre- input x, a policy network πθπ (t|x) is trained to determine
processed depending on the downstream task (e.g. through which timestep t to use for representation extraction. Once
upsampling, pooling, etc.). Finally, a classification head is the timestep is selected, the authors use the feature represen-
trained on the annotated dataset, taking the preprocessed tations in the mid-block of the DPM for the selected timestep
∗
features extracted from the diffusion model as input. This t∗ to obtain z(t ) . After the distillation phase, the student
classification head can be an MLP, a CNN, or an attention- network is used as a feature extractor and subsequently fine-
based network depending on the availability of labeled data tuned on the task label y.
and predictive performance on the dataset. The diffusion Li et al. [96] introduce DreamTeacher, a knowledge distil-
model weights are usually frozen in this probing process, lation method using a feature regressor module that distills
but additional fine-tuning regimes can increase discrimina- the learned representations of a generative model G into a
tive performance for certain datasets and architectures (see target image recognition backbone f . Given a feature dataset
e.g., Xiang et al. [169]). Fig. 4 shows an overview of the D = {xi , fig }N i=1 consisting of images x and extracted
g g
generalized framework. features fi , f is trained by distilling fi into the intermediate
features of f (xi ). The features are extracted from G by
3.1.3 Knowledge transfer running a forward diffusion process for T timesteps and
g
Aside from leveraging intermediate activations from pre- conducting a single denoising step to extract fi from the
trained diffusion models directly as inputs to a recognition intermediate layers of the U-Net backbone. The extracted
network, several recent approaches propose a more indirect features are distilled using a feature regressor module with
method of reusing learned representations for downstream a top-down architecture containing lateral skip connections
tasks. We summarize these under the term knowledge transfer that aligns the image backbone features with the generative
methods. This reflects the common idea of distilling repre- features. Intermediate CNN encoder features fle at layers
sentations from pre-trained diffusion models and then trans- l and regressor outputs flr are used to compute an MSE
ferring them to auxiliary networks in a way that is distinct feature regression loss inspired by FitNet [139]:
from simply providing aggregated feature activation maps L
as input. Several of these approaches are discussed in the 1X r 2
LMSE = ∥fl − W(fle )∥2 (17)
following section. L
l=1
Yang and Wang [173] propose RepFusion, a knowledge
W is a non-learnable operator implemented as Lay-
distillation approach that dynamically extracts intermediate
erNorm [11]. This loss is combined with the activation-
representations at different time steps using a reinforcement
based Attention Transfer (AT) objective [181], which distills
learning framework, and uses the extracted representations
a one dimensional ”attention map” for each spatial feature.
as auxiliary supervision for student networks. Given an
DreamTeacher is evaluated on a range of downstream recog-
input x with label y, the authors extract a pair of features,
nition tasks by fine-tuning the pre-trained backbone with
one from the diffusion probabilistic model (DPM) and one
additional classification heads for each task. DreamTeacher
from the student model, where z(t) is the diffusion model
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
outperforms existing contrastive and masking-based self- noise scales. This insight is also applied in DiffAE [134],
supervised methods on the COCO [106], ADE20k [189] and which uses diffusion models for representation learning
BDD100K [178] benchmarks. via autoencoding. Preechakul et al. [134] separate latent
Both RepFusion and DreamTeacher are inspired by ear- representations into a compact semantic representation and
lier works on knowledge distillation [66, 139]. Li et al. [95] a stochastic representation. DiffAE consists of a semantic
propose a slightly different knowledge transfer approach: encoder, that generates a semantic representation zsem , as
Diffusion Classifier, a method for zero-shot classification well as a conditional DDIM [152]. This DDIM acts both
that leverages conditional density estimates from text-to- as the stochastic encoder, which maps x0 to xT , and as
image diffusion models. This classifier converts the diffu- the decoder, which maps xT to x0 . xT represents the
sion model into a classifier by computing class conditional stochastic representation and captures low-level variation,
likelihoods pθ (x|ci ) and using Bayes’ theorem to obtain whereas zsem encodes higher-level semantics. During infer-
predicted class probabilities p(ci |x). Since direct computa- ence, [134] fit a second latent DDIM to zsem , and sample
tion of pθ (x|ci ) is intractable, they use the Evidence Lower from this DDIM and xT to facilitate unconditional sampling.
Bound (ELBO) in its place. The classifier is derived by Variations in xT with fixed zsem result in minor changes
adding noise repeatedly and estimating noise reconstruction in generated images, while varying z leads to different
losses for each class using Monte Carlo methods. While reconstructions, showing DiffAE’s efficiency in generating
Diffusion Classifier suffers from high inference time, it gen- semantically meaningful and decodable representations. In-
erally outperforms DDPM-Seg Baranchuk et al. [15] on most foDiffusion [165] extends DiffAE, supporting custom priors
datasets and is competitive with CLIP ResNet-50 [136] and and improving latent representations zsem via mutual infor-
OpenCLIP ViT-H/14 [36]. mation regularization.
Zhang et al. [186] observe that there is a gap between
3.1.4 Reconstructing diffusion models the true and the predicted posterior mean of xt−1 when
Previous diffusion representation learning techniques do predicting from xt in the diffusion reverse process. Clas-
not propose making fundamental modifications to diffu- sifier guidance can be viewed as reconstructing informa-
sion model architectures and training methodologies. While tion lost in the diffusion forward process by shifting the
these techniques often show encouraging performance for posterior mean to fill that gap. They propose Pre-trained
downstream tasks, they fail to generate deep insights into DPM AutoEncoding (PDAE), a method for adapting DPMs
the architectural components and techniques required to to decoders for image reconstruction. Instead of using a
learn useful representations. It remains largely unclear for class label y to fill this information gap, PDAE employs a
example whether the representation learning abilities of model to predict mean shift according to encoded repre-
diffusion models are driven by the diffusion process, or sentations z, ensuring that z contains as much information
by the model’s denoising capabilities. It is also unclear as possible from x0 . Specifically, Zhang et al. [186] employ
what architectural and optimization choices can improve an encoder Ephi (x0 ) = z along with a gradient estimator
diffusion models’ representation learning capabilities. Gψ (xt , z, t) that simulates ∇xt log(p(z|xt ) to modify the
Chen et al. [35] investigate these questions by decon- conditional DPM training objective. This modified objective
structing a denoising diffusion model (DDM), modifying forces the predicted mean shift to fill the aforementioned
individual model components to turn a DDM into a De- posterior mean gap. With a trained Gψ (xt , z, t), the score
noising Autoencoder. The deconstruction process consists of the implicit classifier p(z|xt ) can be used analogously to
of three stages. In the first stage, the DDM is reoriented classifier-guided sampling. PDAE is evaluated using similar
for self-supervised learning. This entails the removal of experiments as used in [134] and exhibits improved training
class conditioning and a reconstruction of the VQGAN efficiency and performance.
tokenizer [47] used in the DiT baseline. Both the perceptual Pan et al. [130] propose a different method for DDM
and adversarial loss terms rely on annotated data and are reconstruction. They introduce a masked diffusion model
thus removed. This essentially converts the VQGAN to a (MDM), designed for self-supervised semantic segmenta-
VAE. The second stage consists of simplifying the VAE tion. MDM substitutes the conventional diffusion process
tokenizer even further, replacing it with different autoen- with a masking mechanism inspired by the masked autoen-
coder variants. Surprisingly, the authors find that using coder [61]. The representations learned by the pre-trained
simpler autoencoder variants, like patch-wise PCA, does MDM are extracted following Baranchuk et al. [15]. The
not degrade performance substantially. The authors con- proposed MDM is a variant of a time-dependent denoising
clude that the dimensionality per token of the latent space autoencoder, that takes a masked input image and subse-
has a much larger impact on probing accuracy than the quently reconstructs the uncorrupted image. While other
chosen autoencoder. The final deconstruction step includes DDMs and MAE use an MSE reconstruction loss, Pan et al.
converting the DDM to predict the denoised input instead [130] propose using the structural similarity index (SSIM)
of the added noise and removing input scaling, as well as loss. This is done to narrow the gap between reconstruction
changing the diffusion model to operate directly in the pixel and subsequent segmentation tasks. MDM is pre-trained
space. This final stage results in what the authors call the on a set of unlabeled images using the described self-
latent Denoising Autoencoder (l-DAE). They conclude that supervised approach. The learned representations are then
representation learning abilities are largely driven by the extracted to train an MLP-based classification head on a
denoising-driven process rather than the diffusion process. smaller labeled dataset. Features based on specific block
l-DAE is inspired by the observation that diffusion setting B are extracted by selecting the activation maps from
models resemble hierarchical autoencoders with varying each of the specified blocks, upsampling activation maps
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
to match the image size, and concatenating the activations. combined loss L consisting of a standard cross-entropy loss
The method achieves state-of-the-art results against existing to train p(y|x) and the simple denoising loss to train p(x).
supervised segmentation methods on multiple benchmark HybViT provides stable training and outperforms previous
datasets even when only 10% of labels are available. Diff- hybrid models on both generative and discriminative tasks,
MAE [166] is a similar approach that uses a conditional but lags behind generative-only models in generation qual-
generative objective, where the distribution of the masked ity. HybViT also requires more training iterations to achieve
pixels xm v
0 conditioned on the visible pixels x0 is modeled, high classification performance, and the sampling speed
and diffusion is only applied to masked regions. during inference is slow.
Hudson et al. [82] introduce a novel view generation Joint Diffusion Models (JDM) [40] is a related work that
learning goal as well as a bottleneck layer to aid repre- produces meaningful representations across generative and
sentation learning. They present SODA, a self-supervised discriminative tasks. Using a U-Net backbone, JDM consists
diffusion model that consists of an encoder and a denoising of an encoder eν , a decoder dψ , and a classifier gω . The
decoder. The encoder produces a concise latent represen- encoder maps an input xt to feature vectors Zt = eν (xt ).
tation, which is used for denoising decoder guidance by The decoder reconstructs these into a denoised sample
modulation of the decoder activations. The encoder E(x) xt−1 = dψ (Zt ), and the classifier predicts the target class
converts an input view x into a compressed latent rep- ŷ = gω (Zt ). The combined training objective includes cross-
resentation z, which is used to generate a novel output entropy loss Lclass and the noise prediction network’s sim-
view x′ relating to the input x. x′ is created through a plified objective Lt,diff (ν, ψ), resulting in the following loss:
diffusion process conditioned on the latent representation T
z via feature modulation. In addition to this, the authors L(ν, ψ, ω) = Lclass (ν, ω)−L0 (ν, ψ)−
X
Lt,diff (ν, ψ)−LT (ν, ψ).
use layer modulation, where the latent representation is
t=2
partitioned, with each partition zi modulating a specific
pair of layer activations. This enables further specialization JDM also enables a simplification of classifier guidance.
among the latent subvectors, where some are optimized By applying the classifier to noisy images xt , the classifier
to capture finer levels of granularity than others. During is effectively augmented to be robust to noise. To guide the
training, Hudson et al. [82] opt to randomly zero out a generated sample towards a target label, representations Zt
subset of the latent subvectors, effectively implementing are optimized according to the classifier gradient, giving
a layer-wise generalization of classifier-free guidance. This Z′t = Zt − α∇zt log gω (y|Zt ). JDM achieves state-of-the-
further increases control over the generative process since art performance for joint models on CIFAR and CelebA
the trained model can then be conditioned using a curated datasets, outperforming HybViT.
subset of latent subvectors. Tian et al. [158] propose the Alternating Denoising Diffu-
SmoothDiffusion [58] is a work focusing on improving sion Process (ADDP). ADDP alternately denoises pixels and
the smoothness of the latent space of diffusion models, VQ tokens. Given an image x0 , a pre-trained VQ Encoder
which refers to the consistency of perturbations in the latent [26] maps time image to VQ tokens z0 . The alternating
and the image space. SmoothDiffusion enforces smooth- diffusion process masks regions of z0 with a Markov chain
ness over its latent space by proposing a novel step-wise according to diffusion timestep t, producing zt . Unreliable
variation regularization method in training. The resulting tokens z̄t are generated by a token predictor and fed into a
smoothed latents benefit a wide range of image interpola- VQ Decoder to synthesize xt , replacing the masked regions
tion, image inversion and image editing tasks. of z0 . A pixel-to-token generation network is then trained
to approximate the distribution of z̄t−1 . During sampling,
3.1.5 Joint diffusion models ADDP starts with a representation of pure unreliable tokens
Many current diffusion-based representation learning meth- z̄T and iteratively denoises the token sequence by predicting
ods focus on using the diffusion model’s latent variables to z̄t−1 . For recognition, the representations learned by the
benefit the training of a separate recognition network. These pixel-to-token generation network can be forwarded to dif-
frameworks are conceptually equivalent to constructing hy- ferent task-specific recognition heads. ADDP with the VQ-
brid models that solely concentrate on synthesis in the pre- GAN tokenizer [47] MAGE-Large [99] token predictor and
training stage, and on downstream recognition in the post- ViT-Large [45] pixel-to-token encoder, outperforms previous
training/fine-tuning phase. The recognition head and the unified models in image classification, object detection, se-
diffusion denoising network do not share a parametrization, mantic segmentation, and unconditional generation.
and the recognition head is often trained separately while
keeping the weights of the denoising network frozen. A 3.1.6 Generative augmentation
natural question that arises is whether this separation is nec- A lot of state-of-the-art representation learning methods
essary and whether approaches that optimize a generative [33, 55, 60] rely on a fixed set of data augmentations to define
and a discriminative objective simultaneously in a shared positive labels for learning representations. This approach
parametrization can improve representation learning. encourages encoders to learn to map the original and the
HybViT [174] is an approach that establishes a direct con- augmented image to similar embedding space representa-
nection between diffusion models and vision transformers tions [10]. These augmentations should not alter the seman-
by training a single hybrid model for both image classifica- tics of the image, and they should not render the image
tion and image generation. This hybrid model uses a shared unrealistic in a real-world setting. A set of standard trans-
parametrization for image classification and reconstruction. formations might not adequately capture the distribution
The authors use a ViT backbone to train a model with a of real-world data, raising the question of how to design
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
transformations that create diverse images and improve the downstream segmentation performance. Individual token
generalization of learned representations. attention maps of all layers are averaged and converted to
Ayromlou et al. [10] propose using latent diffusion mod- binary masks using an adaptive threshold mechanism based
els [138] to generate novel views of the original image that on an AffinityNet [4]. Additionally, a noise-learning module
preserve the semantic content, while closely following the prunes low-quality segmentation masks, and the authors
distribution of real images. This augmentation method is employ several prompt engineering and static image trans-
denoted by: formations to further enhance the diversity of the generated
( images and corresponding segmentation masks.
G(z; ϕ(x)) if p ≤ p0
T0 (x) = (18)
x otherwise, 3.2 Representation Learning for Diffusion Model Guid-
ance
where G denotes a conditional generative model taking
noise vector z ∼ N (0, I) and condition vector ϕ(x) as Despite the remarkable performance of generative mod-
inputs. ϕ is a pre-trained image encoder such as CLIP els, there exists a gap in quality between conditional and
[136], p ∈ [0, 1] is a random number and p0 is a hyper- unconditional image generation approaches [25]. This is
parameter specifying the probability of applying the aug- especially the case for GANs [53], which suffer from mode
mentation. Ayromlou et al. [10] show that using generative collapse when trained in a fully unsupervised setting [110].
augmentation leads to consistent improvements in learned Unconditional GANs often fail to accurately model multi-
representations over standard transformations across other modal distributions, e.g. not being able to generate all digits
representation learning techniques. for MNIST [110]. Class-conditional GANs [22] [123] miti-
Shipard et al. [150] take this approach one step further, gate this issue, but require labeled data. Recent approaches
using Stable Diffusion to generate a fully synthetic dataset to like self-conditioned GANs [110] and instance-conditioned
improve model-agnostic zero-shot classification (MA-ZSC). GANs [25] attempt to train conditional GANs without re-
They use Stable Diffusion, employing several variations of quiring labeled data, and are able to achieve competitive
prompts designed to increase the diversity of the synthetic generation results.
dataset. An image classifier is subsequently trained on this Diffusion models have since surpassed the image gen-
synthetic dataset, and zero-shot classification results on CI- eration capabilities of GANs [42], but suffer from a simi-
FAR10, CIFAR100, and EuroSAT [64] are evaluated. Shipard lar performance discrepancy between conditional and fully
et al. [150] observe substantial classification architecture- self-supervised approaches. Current state-of-the-art diffu-
agnostic improvements on the aforementioned datasets, sion models are conditional models that rely on guidance
achieving comparable performance to state-of-the-art zero- approaches that also require annotated data. Self-supervised
shot classification methods like CLIP. guidance approaches can leverage much larger unlabeled
Moving beyond classification, Schnell et al. [148] apply datasets for pre-training, and thus have the potential to tran-
similar ideas to scribble-supervised segmentation [104, 129], scend current image generation approaches. One intuitive
a weakly-supervised form of semantic segmentation that approach for leveraging representation learning to facilitate
uses sparse annotations in the form of scribbles drawn these guidance methods is to explore methods that assign
over the images. They introduce ScribbleGen, a diffusion labels to unlabeled data, e.g. through clustering and clas-
model conditioned on semantic scribbles that generates syn- sification approaches. We introduce several approaches in
thetic training images for data augmentation. ScribbleGen the following section. Fig. 5 shows a proposed taxonomy of
utilizes a ControlNet [185] denoising diffusion model for representation learning techniques for diffusion guidance.
noise prediction given xt and conditioning signal c. The 3.2.1 Assignment-based guidance
number of classes is denoted by different color scribbles in
Sheynin et al. [149] propose kNN-Diffusion, an efficient
RGB images, and the conditioning signal c is supplemented
text-to-image diffusion model trained without large-scale
by a text prompt stating all classes in the image. Schnell
image text pairings. To facilitate text-guided image gener-
et al. [148] trade-off photorealism and image diversity by
ation without paired text-image data, a shared text-image
introducing an encode ratio λ ∈ [0, 1]. This diffusion param-
encoder mapping text-image pairs into the same latent space
eter controls the number of noise-adding forward diffusion
is required. The authors use CLIP to achieve this, a pre-
steps, where λ = 1 leads to no change but λ < 1 leads
trained encoder trained using contrastive loss on a large-
to λ · T steps, meaning less noise is added to the input
scale text-image pair dataset. kNN-Diffusion leverages k -
image. The authors evaluate both a fixed and an adaptive
Nearest-Neighbors search to generate k embeddings from a
λ, where the encoding ratio is gradually increased to pro-
retrieval model. The retrieval model uses the input image
vide increasingly diverse synthetic images during training.
representation during training, and the text prompt rep-
ScribbleGen achieves state-of-the-art performance on the
resentation curing inference. This approach eliminates the
PASCAL VOC12 segmentation dataset [48] using scribbles
need for annotated data but still requires a pre-trained en-
from Scribblesup [104].
coder like CLIP, which in turn requires a large-scale dataset
DiffuMask [167] is another generative augmentation
of text-image embeddings for pre-training.
method designed to improve downstream semantic seg-
Blattmann et al. [20] propose retrieval-augmented dif-
mentation tasks. The idea here is to exploit cross-attention
fusion models (RDM), which equip diffusion models with
maps between text prompts and generated images to extend
an image database for composing new scenes based
image synthesis to semantic mask generation. Synthetically
on retrieved images. Inspired by advances in retrieval-
generated masks are used for data augmentation to improve
augmented NLP [21, 168], RDM enhances performance with
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Fig. 5. A hierarchical overview of current diffusion model training frameworks that leverage representation learning techniques for conditional
generation and guidance.
fewer parameters and computational resources. Despite be- trained feature extractor to generate feature representations
ing trained only on images, RDM allows conditional synthe- for clustering.
sis due to the shared image-text feature space of CLIP [136]. For this reason, Hu et al. [73] extend their work to
RDM includes a trainable conditional latent diffusion model propose an online feature clustering method using the
pθ , an external image database D, and a fixed sampling Sinkhorn-Knopp algorithm. This is challenging since the
(k)
strategy ξk that selects a subset MD of D based on a idea requires obtaining conditioning signals for clustering
query x. One strategy ξk (x, D) is to retrieve the k nearest during training from a diffusion model that is dependent on
neighbors using a distance function d(·, x). The retrieved this conditioning. This issue is solved by introducing a zero
data is processed through a frozen image encoder ϕ and vector into the conditional diffusion model for the signals
used to condition pθ . During training, ξk retrieves k nearest used to identify the clustering. For each image example, the
neighbors for a query image x using cosine similarity in conditional diffusion model conditioned on this zero vector
CLIP’s image feature space as the distance function d(x, y). undergoes a fully-connected feature prediction head used to
This approach ensures that retrieved image representations compute features that are mapped to a set of learnable pro-
are useful for generation tasks and allows for text condi- totypes denoted M . This method uses a combination of the
tioning due to CLIP’s shared feature space. The dataset diffusion training loss and a Sinkhorn-Knopp loss to achieve
D and retrieval strategy ξk can be changed at test time, guidance signals c that are based on clustering features us-
adding flexibility for different conditioning modalities and ing M . The promise of this method is high, with self-guided
adaptability to other data distributions. diffusion outperforming related unconditional generation
Hu et al. [75] propose a method also motivated by elim- baseline comparisons on ImageNet256 and LSUN-Churches
inating the need for annotated data. Self-guided diffusion while being competitive with class guidance methods that
is a framework encompassing a feature extraction function rely on ground truth labels. The online approach specifically
gϕ and a self-annotation function fψ . The feature extraction does not rely on ground truth labels or any external pre-
function is a self-supervised feature extractor that maps the trained models. Adaloglou et al. [2] build on the aforemen-
input data x ∈ D to a feature space H, where D denotes tioned cluster-based guidance approaches by utilizing EDM
the dataset. This feature representation is an input of fψ , [85], TEMI clustering [1] and a method for deriving an upper
which maps feature representation gϕ (x; D) to a guidance cluster bound for feature-based clustering.
signal k . This framework can be applied to achieve self- Other approaches to diffusion model guidance rely on
labeled guidance, where k is a one-hot embedding derived generating pseudo-labels for unlabeled data. You et al. [176]
using k -means clustering as the self-annotation function f propose dual pseudo training (DPT), which uses a classifier
on compacted features generated by gϕ . More fine-grained trained on limited labeled data to generate pseudo-labels.
spatial guidance is achieved by self-boxed guidance, which These are then used to condition a diffusion model to
uses a mapping from feature space H to a bounding box generate pseudo images, which are in turn used as data
as the self-annotation function f , as well as self-segmented augmentation to retrain a classifier on a mix of pseudo
guidance, which uses a mapping to a segmentation mask and real images. DPT involves three stages. First, a semi-
to generate guidance signals by clustering. Self-guidance supervised classifier is trained on partially labeled data to
significantly outperforms unconditional diffusion models, predict pseudo-labels ŷ for all images x ∈ X . Second,
and even outperforms classifier-free guided diffusion mod- a conditional generative model is trained on the dataset
els that use ground-truth annotations on image generation. S1 = {(x, y)|x ∈ X } with pseudo-labels. Finally, the classi-
This suggests that the clusters are potentially more aligned fier is retrained on real data that is augmented by the gener-
with the visual similarity of the images, and are better ated data. DPT achieves highly competitive performance on
guidance signals than ground-truth labels alone. While this ImageNet classification and generation with as little as five
approach is self-supervised, it still relies on an external pre- labels per class, outperforming several supervised diffusion
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
model benchmarks like ADM [42] and LDM [138]. models on a self-supervised representation distribution
mapped from the image distribution using a pre-trained en-
3.2.2 A generalized framework for assignment-based guid- coder. The idea is to train a Representation Diffusion Model
ance (RDM) on the representations generated by a pre-trained
Assignment-based guidance approaches all rely on assign- encoder to generate low-dimensional image representations.
ing annotation to inputs during training, which enables After this, a pixel generator conditioned on the representa-
controlled generation during inference when conditioning tion is trained to map noise distributions to image distri-
on this annotation. We therefore propose to formulate a butions. RCG consists of three main components. The first
generalized framework that encapsulates all assignment- is a pre-trained image encoder, which converts the original
based guidance approaches discussed here. This framework image distribution into a representation distribution. The
consists of three main components. The first is a self- authors propose using self-supervised contrastive learning
supervised image encoder E(x), that maps inputs to a low- methods (e.g. MoCo v3) for generating this representation
dimensional feature representation z. Using a multi-modal distribution. The second is a representation generator in the
feature extractor like CLIP has the advantage of enabling form of an RDM, which learns to generate representations
text-based as well as image-based conditioning, but other from Gaussian noise following the DDIM [152] sampling
feature extractors can be used, provided they generate se- process. The final component is a pixel generator that crafts
mantically meaningful image representations. image pixels conditioned on image representations. RCG
The second is a self-annotation function f (z), which uses can easily incorporate classifier-free guidance for uncondi-
the image representation to produce annotation c for input tional generation tasks, since the pixel generator is condi-
image x. In the simplest case, this self-annotation function tioned on self-supervised representations. RCG emerges as
is an external pre-trained image classifier that generates a highly promising method for bridging the gap between
pseudo-class labels from image representations, similar to conditional and unconditional image generation, outper-
the approach employed in DPT [176], where the external forming pre-existing unconditional generation approaches
classifier is subsequently re-trained on the conditionally on ImageNet, and exhibiting competitive performance with
generated images. In other cases, the self-annotation func- current state-of-the-art class-conditional approaches.
tion is a retrieval model, which uses a distance function d Readout Guidance (RG) [117] makes use of auxiliary
to retrieve images similar to the training image, and uses readout heads trained on top of a frozen diffusion model to
representations of the retrieved images for generating the extract properties of the generated image that can be used
guidance signal c. for guidance. These properties can include human pose,
The final component is a denoising network Dθ (xt , c, t), depth maps, edges, and even higher-order properties like
which takes the noisy image xt , the diffusion timestep t and similarity to another image. During sampling, the properties
the guidance signal c as input, and denoises the image. Dur- extracted by the readout heads can be compared to user-
ing inference, controlled generation is enabled by passing an defined control targets, and used in a methodology similar
initial guidance signal k (which can be multi-modal as long to classifier guidance [43] to guide generation.
as the embedding space of the encoder E is shared between Lin and Yang [105] identified a novel self-perceptual
modalities) through the encoder to generate representation objective that enhances diffusion models, enabling them to
z = E(k). The conditioning signal c is then generated by generate more realistic samples. Contrary to the conven-
passing z to the self-annotation function f where c = f (z). tional approach of training or employing an image encoder,
Passing xt , c and t to the denoising network Dθ now enables the authors demonstrate that a pre-trained diffusion model
synthesis of novel images semantically similar to the initial inherently functions as a perceptual network and can be
guidance signal k. used to generate perceptual representations. The perceptual
One of the main motivations behind the design of loss facilitates the model’s ability to generate more realistic
assignment-based guidance methods is the reliance on exist- images even with unconditional synthesis.
ing methods on labeled data. While it could be argued that Also inspired by the downsides of classifier guidance
the aforementioned assignment-based guidance approaches and classifier-free guidance, Hong et al. [69] introduce Self-
are indirectly reliant on annotated data through the pre- Attention Guidance (SAG). SAG adversarially blurs regions
trained image encoder, it is important to note that this that contain salient information by leveraging intermediate
encoder can be replaced with a fully self-supervised encoder self-attention activation maps, using the residual informa-
as well. CLIP relies on the availability of a large-scale dataset tion as guidance. This increases the generation quality with-
of image-caption pairs and is thus not fully self-supervised, out requiring external information or additional training.
but other representation learning methods are also able to The self-attention mechanism, contained in both U-Net and
generate semantic representations. CLIP is used in many DiT diffusion backbones, allows the noise predictor to at-
approaches to facilitate both text prompt-based and image tend to the most informative features of the input. The self-
N ×(HW )×(HW )
conditioning during inference, which may no longer be pos- attention maps AS t ∈ R are aggregated and
sible when using primarily image-based feature extractors. reshaped to dimension RH×W using global average pooling
A summary of the training and inference methodology can and nearest-neighbor upsampling to match the resolution
be found in Fig. 6. of xt . The difference between the blurred image x̃t and xt
is used as conditioning, thereby retaining the information
3.2.3 Representation-based guidance masked in this process.
Li et al. [100] present Representation-Conditioned Image
Generation (RCG), a framework conditioning diffusion
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
Fig. 6. A generalization of assignment-based guidance training and sampling pipelines. Samples are conditioned on annotations generated by a
self-annotation function f , using features extracted by a pre-trained image encoder (e.g., CLIP [136]).
3.2.4 Objective-based guidance between weak and strong depth predictors, guiding the
Many of the previous outlined approaches focus on elim- generation process using the gradient of Ldc with respect to
inating the need for pre-trained classifiers, encoders and xt in a methodology similar to [42]. Depth prior guidance
dataset annotations for training conditional diffusion mod- employs an additional small-resolution diffusion U-Net on
els. Other recent works [46, 86] have demonstrated that the depth domain, adding noise to depth predictions and
internal diffusion model representations can be used to using a denoising objective Ldp . The gradient of Ldp is
improve generation control over the structural and semantic treated like an external classifier gradient and added to
composition of generated images. the image generation objective. Combining both methods
One such approach is Self-guidance for Controllable Im- during training results in enhanced depth semantics in
age Generation [46] (which we denote SGCIG to distinguish generated images.
it from [75]). SGCIG is a zero-shot method designed to Perturbed Attention Guidance (PAG) [3] is a sampling
increase user control over structural and semantic elements guidance method that improves generation quality for both
of objects in images generated by text-to-image diffusion conditional and unconditional settings. PAG does not re-
models. Incorporating similar ideas as [65], the authors of quire additional training or external pre-trained models.
SGCIG leverage representations from intermediate activa- Instead, Ahn et al. [3] introduce an implicit discriminator
tions and attention maps to steer the generation process. SG- D that differentiates between desirable and undesirable
CIG works by adding a series of guidance terms to the ob- samples during the diffusion process, where y is a desirable
jective of the denoising network that each define a series of and ŷ is an undesirable sample. The diffusion sampling
properties that can be used to perform image manipulations. process is then redefined to incorporate the derivative of
Image edits can then be carried out by guiding properties to the discriminator loss LD . The score with undesirable label
change in the pixel generation process. While the method ŷ cannot be approximated using the existing denoising net-
is limited to the manipulation of objects explicitly stated work ϵθ (xt ). Thus the score is estimated by perturbing the
in the conditioning text prompt, it represents a promising forward pass of a pre-trained denoising network, denoted
first step towards increased control over generated images. by ϵ̂θ . PAG works by perturbing the self-attention maps in
Diffusion Handles [131] extend this to 3D object editing, the diffusion U-Net, replacing them with an identity matrix
using manipulated diffusion model activations to produce to guide the sampling process away from degraded samples.
plausible edits. The final noise prediction is obtained by feeding xt into
Depth-aware guidance (DAG) [86] is a related method both ϵθ (·) and ϵ̂θ (·) to get the final noise prediction ϵ̃θ .
that uses semantic information from intermediate denoising PAG improves generation quality in both conditional and
network layers for improved depth-aware image synthe- unconditional settings, and can be combined with existing
sis. Kim et al. [86] propose training depth predictors with guidance methods like classifier guidance.
limited depth-labeled data using internal U-Net backbone
representations, similar to DDPM-Seg [15]. The used depth 4 C HALLENGES & F UTURE D IRECTIONS
predictors are pixel-wise shallow MLP regressors estimat- 4.1 General Challenges
ing depth values from intermediate U-Net features ft at
timestep t. Features are concatenated across layers to form Diffusion model-based representation learning is a novel
gt , with depth maps dt = MLP(gt , t) generated using research field with a lot of potential for theoretical and prac-
an appended time-embedding block. This depth predictor tical improvements. Improving synergies between represen-
is trained using a limited depth-labeled dataset. To now tation learning and generative models is akin to a chicken-
guide the diffusion process toward depth-aware generation, and-egg problem, where better diffusion models simulta-
two guidance strategies are introduced: Depth consistency neously lead to higher quality image representations, and
guidance uses pseudo-labels with a consistency loss Ldc better representation learning methods improve generative
quality of diffusion models when applied to self-supervised
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
by retrieving from trillions of tokens,” in ICLR. PMLR, [43] ——, “Diffusion Models Beat GANs on Image Synthesis,”
2022, pp. 2206–2240. in NeurIPS, vol. 34, 2021.
[22] A. Brock, J. Donahue, and K. Simonyan, “Large scale [44] J. Donahue and K. Simonyan, “Large scale adversarial
gan training for high fidelity natural image synthesis,” representation learning,” in NeurIPS, 2019.
in ICLR, 2019. [45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
[23] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
Learning to follow image editing instructions,” in CVPR, G. Heigold, S. Gelly et al., “An image is worth 16x16
2023, pp. 18 392–18 402. words: Transformers for image recognition at scale,”
[24] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, ICLR, 2021.
P. Bojanowski, and A. Joulin, “Emerging properties in [46] D. Epstein, A. Jabri, B. Poole, A. Efros, and A. Holynski,
self-supervised vision transformers,” in ICCV, 2021, pp. “Diffusion self-guidance for controllable image genera-
9650–9660. tion,” in NeurIPS, vol. 36, 2023, pp. 16 222–16 239.
[25] A. Casanova, M. Careil, J. Verbeek, M. Drozdzal, [47] P. Esser, R. Rombach, and B. Ommer, “Taming transform-
and A. Romero Soriano, “Instance-conditioned gan,” in ers for high-resolution image synthesis,” in CVPR, 2021,
NeurIPS, 2021. pp. 12 873–12 883.
[26] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Free- [48] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
man, “Maskgit: Masked generative image transformer,” and A. Zisserman, “The pascal visual object classes (voc)
in CVPR, 2022. challenge,” International journal of computer vision, vol. 88,
[27] Z. Chang, G. A. Koulieris, and H. P. H. Shum, “On the pp. 303–338, 2010.
Design Fundamentals of Diffusion Models: A Survey,” [49] Z. Fei, M. Fan, C. Yu, and J. Huang, “Scalable Diffusion
arXiv, 2023. Models with State Space Backbone,” arXiv, 2024.
[28] H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, [50] J. Schusterbauer, M. Gui, P. Ma, N. Stracke, S. A. Bau-
M. Irani, I. Mosseri, and L. Wolf, “The Hidden Language mann, and B. Ommer, “Boosting latent diffusion with
of Diffusion Models,” in ICLR, 2024. flow matching,” arXiv, 2023.
[29] G. Chen, Y. Huang, J. Xu, B. Pei, Z. Chen, Z. Li, J. Wang, [51] Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang,
K. Li, T. Lu, and L. Wang, “Video Mamba Suite: State J. Bao, Z. Zhang, H. Li, H. Hu et al., “Instructdiffusion: A
Space Model as a Versatile Alternative for Video Under- generalist modeling interface for vision tasks,” in CVPR,
standing,” arXiv, 2024. 2024, pp. 12 709–12 720.
[30] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [52] M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Token-
A. L. Yuille, “Semantic Image Segmentation with Deep flow: Consistent diffusion features for consistent video
Convolutional Nets and Fully Connected CRFs,” arXiv, editing,” in ICLR, 2024.
2016. [53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
[31] ——, “Deeplab: Semantic image segmentation with deep D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
convolutional nets, atrous convolution, and fully con- “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
nected crfs,” IEEE Transactions on Pattern Analysis & Ma- [54] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffu-
chine Intelligence, vol. 40, no. 4, pp. 834–848, 2017. sion models as plug-and-play priors,” in NeurIPS, 2022.
[32] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and [55] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond,
W. Chan, “WaveGrad: Estimating Gradients for Wave- E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo,
form Generation,” in ICLR, 2021. M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos,
[33] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A and M. Valko, “Bootstrap Your Own Latent - A New Ap-
simple framework for contrastive learning of visual rep- proach to Self-Supervised Learning,” in NeurIPS, vol. 33,
resentations,” in ICML. PMLR, 2020, pp. 1597–1607. 2020, pp. 21 271–21 284.
[34] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. [56] A. Gu and T. Dao, “Mamba: Linear-Time Sequence Mod-
Hinton, “Big self-supervised models are strong semi- eling with Selective State Spaces,” arXiv, 2024.
supervised learners,” in NeurIPS, 2020. [57] M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko,
[35] X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing De- O. Grebenkova, S. A. Baumann, V. T. Hu, and B. Ommer,
noising Diffusion Models for Self-Supervised Learning,” “Depthfm: Fast monocular depth estimation with flow
arXiv, 2024. matching,” arXiv, 2024.
[36] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, [58] J. Guo, X. Xu, Y. Pu, Z. Ni, C. Wang, M. Vasu, S. Song,
G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, G. Huang, and H. Shi, “Smooth diffusion: Crafting
and J. Jitsev, “Reproducible scaling laws for contrastive smooth latent spaces in diffusion models,” in CVPR, 2024.
language-image learning,” in CVPR, 2023, pp. 2818–2829. [59] B. Ham, M. Cho, C. Schmid, and J. Ponce, “Proposal flow:
[37] F. Croitoru, V. Hondru, R. Ionescu, and M. Shah, “Dif- Semantic correspondences from object proposals,” IEEE
fusion models in vision: A survey,” IEEE Transactions on Transactions on Pattern Analysis & Machine Intelligence,
Pattern Analysis & Machine Intelligence, vol. 45, no. 09, pp. vol. 40, no. 7, pp. 1711–1725, 2017.
10 850–10 869, 2023. [60] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Mo-
[38] Q. Dao, H. Phung, B. Nguyen, and A. Tran, “Flow match- mentum contrast for unsupervised visual representation
ing in latent space,” arXiv, 2023. learning,” in CVPR, 2020.
[39] A. Davtyan, S. Sameni, and P. Favaro, “Efficient video [61] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick,
prediction via sparsely conditioned flow matching,” in “Masked autoencoders are scalable vision learners,” in
ICCV, 2023, pp. 23 263–23 274. CVPR, 2022.
[40] K. Deja, T. Trzciński, and J. M. Tomczak, “Learning [62] E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar,
data representations with joint diffusion models,” in Joint A. Tagliasacchi, and K. M. Yi, “Unsupervised semantic
European Conference on Machine Learning and Knowledge correspondence using stable diffusion,” in NeurIPS, 2023.
Discovery in Databases. Springer, 2023, pp. 543–559. [63] J. Heek, E. Hoogeboom, and T. Salimans, “Multistep
[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, consistency models,” arXiv, 2024.
“Imagenet: A large-scale hierarchical image database,” in [64] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat:
CVPR, 2009. A novel dataset and deep learning benchmark for land
[42] P. Dhariwal and A. Nichol, “Diffusion models beat gans use and land cover classification,” IEEE Journal of Selected
on image synthesis,” in NeurIPS, 2021. Topics in Applied Earth Observations and Remote Sensing,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
vol. 12, no. 7, pp. 2217–2226, 2019. “Depth-aware guidance with self-estimated depth repre-
[65] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, sentations of diffusion models,” Pattern Recognition, vol.
Y. Pritch, and D. Cohen-Or, “Prompt-to-Prompt Image 153, p. 110474, 2024.
Editing with Cross Attention Control,” arXiv, 2022. [87] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational
[66] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowl- Diffusion Models,” in NeurIPS, vol. 34, 2021.
edge in a Neural Network,” arXiv, 2015. [88] D. P. Kingma and M. Welling, “Auto-encoding variational
[67] J. Ho and T. Salimans, “Classifier-free diffusion guid- bayes,” in ICLR, 2014.
ance,” in NeurIPS Workshop, 2021. [89] ——, “An Introduction to Variational Autoencoders,”
[68] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- Foundations and Trends® in Machine Learning, vol. 12, no. 4,
bilistic models,” in NeurIPS, 2020. pp. 307–392, 2019.
[69] S. Hong, G. Lee, W. Jang, and S. Kim, “Improving Sample [90] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic
Quality of Diffusion Models Using Self-Attention Guid- feature pyramid networks,” in CVPR, 2019, pp. 6399–
ance,” in ICCV, 2023. 6408.
[70] E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and [91] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár,
M. Welling, “Argmax flows and multinomial diffusion: “Panoptic segmentation,” in CVPR, 2019, pp. 9404–9413.
Learning categorical distributions,” in NeurIPS, 2021. [92] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro,
[71] E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling, “DiffWave: A Versatile Diffusion Model for Audio Syn-
“Equivariant Diffusion for Molecule Generation in 3D,” thesis,” arXiv, 2021.
in Proceedings of the 39th International Conference on Ma- [93] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already
chine Learning. PMLR, 2022, pp. 8867–8887. have a semantic latent space,” in ICLR, 2023.
[72] E. Hoogeboom, J. Heek, and T. Salimans, “simple diffu- [94] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz,
sion: End-to-end diffusion for high resolution images,” in M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar et al.,
Proceedings of the 40th International Conference on Machine “Voicebox: Text-guided multilingual universal speech
Learning. PMLR, 2023, pp. 13 213–13 232. generation at scale,” arXiv, 2023.
[73] V. T. Hu, Y. Chen, M. Caron, Y. M. Asano, C. G. M. Snoek, [95] A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and
and B. Ommer, “Guided Diffusion from Self-Supervised D. Pathak, “Your diffusion model is secretly a zero-shot
Diffusion Features,” arXiv, 2023. classifier,” in ICCV, 2023.
[74] V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, [96] D. Li, H. Ling, A. Kar, D. Acuna, S. W. Kim, K. Kreis,
E. Gavves, P. Mettes, B. Ommer, and C. G. M. Snoek, A. Torralba, and S. Fidler, “Dreamteacher: Pretraining
“Motion flow matching for human motion synthesis and image backbones with deep generative models,” in ICCV,
editing,” arXiv, 2023. 2023, pp. 16 698–16 708.
[75] V. T. Hu, D. W. Zhang, Y. M. Asano, G. J. Burghouts, and [97] K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, and
C. G. Snoek, “Self-guided diffusion models,” in CVPR, Y. Qiao, “VideoMamba: State Space Model for Efficient
2023, pp. 18 413–18 422. Video Understanding,” arXiv, 2024.
[76] V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, [98] S. Li, C. Chen, and H. Lu, “MoEController: Instruction-
J. Schusterbauer, and B. Ommer, “ZigMa: A DiT-style based Arbitrary Image Manipulation with Mixture-of-
Zigzag Mamba Diffusion Model,” arXiv, 2024. Expert Controllers,” arXiv, 2024.
[77] V. T. Hu, D. Wu, Y. M. Asano, P. Mettes, B. Fernando, [99] T. Li, H. Chang, S. Mishra, H. Zhang, D. Katabi, and
B. Ommer, and C. G. M. Snoek, “Flow matching for D. Krishnan, “Mage: Masked generative encoder to unify
conditional text generation in a few sampling steps,” in representation learning and image synthesis,” in CVPR,
EACL, 2024. 2023, pp. 2142–2152.
[78] V. T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, and [100] T. Li, D. Katabi, and K. He, “Return of Unconditional
C. Snoek, “Latent space editing in transformer-based Generation: A Self-supervised Representation Generation
flow matching,” in Proceedings of the AAAI Conference on Method,” arXiv, 2024.
Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2247–2255. [101] X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B.
[79] C.-W. Huang, J. H. Lim, and A. Courville, “A variational Hashimoto, “Diffusion-LM Improves Controllable Text
perspective on diffusion-based generative models and Generation,” arXiv, 2022.
score matching,” in NeurIPS, 2021. [102] X. Li, K. Han, X. Wan, and V. A. Prisacariu, “SimSC: A
[80] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Simple Framework for Semantic Correspondence with
Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text- Temperature Learning,” arXiv, 2023.
To-Audio Generation with Prompt-Enhanced Diffusion [103] X. Li, J. Lu, K. Han, and V. A. Prisacariu, “Sd4match:
Models,” in Proceedings of the 40th International Conference Learning to prompt stable diffusion model for semantic
on Machine Learning. PMLR, 2023, pp. 13 916–13 932. matching,” in CVPR, 2024, pp. 27 558–27 568.
[81] Y. Huang, J. Huang, Y. Liu, M. Yan, J. Lv, J. Liu, W. Xiong, [104] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup:
H. Zhang, S. Chen, and L. Cao, “Diffusion Model-Based Scribble-supervised convolutional networks for semantic
Image Editing: A Survey,” arXiv, 2024. segmentation,” in CVPR, 2016, pp. 3159–3167.
[82] D. A. Hudson, D. Zoran, M. Malinowski, A. K. Lampinen, [105] S. Lin and X. Yang, “Diffusion Model with Perceptual
A. Jaegle, J. L. McClelland, L. Matthey, F. Hill, and A. Ler- Loss,” arXiv, 2024.
chner, “Soda: Bottleneck diffusion models for representa- [106] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
tion learning,” in CVPR, 2024. D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
[83] K. Itô, “Stochastic differential equations in a differentiable Common objects in context,” in ECCV, 2014.
manifold,” Nagoya Mathematical Journal, vol. 1, pp. 35–47, [107] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel,
1950. and M. Le, “Flow matching for generative modeling,” in
[84] T. Karras, S. Laine, and T. Aila, “A style-based genera- ICLR, 2023.
tor architecture for generative adversarial networks,” in [108] H. Liu, M. Zaharia, and P. Abbeel, “Ringattention with
ICCV, 2019, pp. 4401–4410. blockwise transformers for near-infinite context,” in
[85] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating ICLR, 2024.
the design space of diffusion-based generative models,” [109] J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger:
in NeurIPS, 2022. Singing Voice Synthesis via Shallow Diffusion Mecha-
[86] G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, and S. Kim, nism,” in AAAI, vol. 36, no. 10, 2022, pp. 11 020–11 028.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
Michael Fuest is a research intern in the Björn Ommer is a professor at LMU and leads
Computer Vision & Learning Group at Ludwig- the Computer Vision & Learning Group. He was
Maximilians-Universität München (LMU). He re- previously with Heidelberg University’s Depart-
cently received his master’s degree in Manage- ment of Mathematics and Computer Science,
ment & Technology with a major in Computer IWR, and HCI. He studied computer science and
Science from the Technical University of Munich, physics at the University of Bonn, completed
and is currently a visiting researcher at the MIT his Ph.D. at ETH Zurich where his dissertation
Laboratory for Information & Decision Systems. received the ETH Medal, and held a post-doc po-
sition with Jitendra Malik at UC Berkeley. He is a
member of the Bavarian AI Council, an editor for
IEEE T-PAMI, an ELLIS Fellow, faculty of ELLIS
unit Munich, a PI at the Munich Center for Machine Learning (MCML),
and has held various roles at numerous CVPR, ICCV, ECCV, and
NeurIPS conferences. He delivered the opening keynote at NeurIPS’23.
Pingchuan Ma is a Ph.D. student in the Com-
puter Vision & Learning Group at Ludwig Maxi-
milian University of Munich (LMU) and a Munich
Center for Machine Learning (MCML) member.
He previously received his master’s degree in
Applied Computer Science from Heidelberg Uni-
versity, where he developed an interest in deep
metric learning and style transfer. He served as
a reviewer for CVPR 2024 and NeurIPS 2024.
His current research focuses on leveraging gen-
erative models for tasks beyond generation and
exploring multi-modality representation learning.