0% found this document useful (0 votes)

53 views21 pages

Diffusion Models & Representation Learning

Uploaded by

arian aghamohseni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views21 pages

Diffusion Models & Representation Learning

Uploaded by

arian aghamohseni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Diffusion Models and Representation Learning:

A Survey
Michael Fuest, Pingchuan Ma, Ming Gui, Johannes Schusterbauer, Vincent Tao Hu, Björn Ommer

Abstract—Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can
be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey
explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential
aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches
related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned
from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and
arXiv:2407.00783v1 [cs.CV] 30 Jun 2024

self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between
diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link:
https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.

Index Terms—deep generative modeling, diffusion models, denoising diffusion models, score-based models, image generation,
representation learning.

1 I NTRODUCTION
Diffusion Models [68, 151, 154] have recently emerged as
the state-of-the-art of generative modeling, demonstrating
remarkable results in image synthesis [43, 67, 68, 141] and
across other modalities including natural language [9, 70, 77,
101], computational chemistry [6, 71] and audio synthesis
[80, 92, 109]. The remarkable generative capabilities of Dif-
fusion Models suggest that Diffusion Models learn both low
and high-level features of their input data, potentially mak-
ing them well-suited for general representation learning.
Unlike other generative models like Generative Adversarial
Networks (GANs) [22, 53, 84] and Variational Autoencoders
(VAEs) [88, 137], diffusion models do not contain fixed archi-
tectural components that capture data representations [124].
This makes diffusion model-based representation learning
challenging. Nevertheless, approaches leveraging diffusion
models for representation learning have seen increasing
interest, simultaneously driven by advancements in training Fig. 1. Shows yearly numbers of both published and preprint papers on
and sampling of Diffusion Models. diffusion models and representation learning. For 2024, the green bar
Current state-of-the-art self-supervised representation indicates the number of papers collected up to and including June 2024,
and the dashed grey bar indicates the projected number for the whole
learning approaches [8, 24, 33, 55] have demonstrated great year.
scalability. It is thus likely that diffusion models exhibit
similar scaling properties [159]. Controlled generation ap-
proaches like Classifier Guidance [43] and Classifier-free fusion models to train on much larger, annotation-free
Guidance [67] used to obtain state-of-the-art generation datasets.
results rely on annotated data, which represents a bottle- This survey paper aims to elucidate the relationship
neck for scaling up diffusion models. Guidance approaches and interplay between diffusion models and representation
that leverage representation learning and that are thus learning. We highlight two central perspectives: Using diffu-
annotation-free offer a solution, potentially enabling dif- sion models themselves for representation learning and us-
ing representation learning for improving diffusion models.
• Michael Fuest is a Master’s student at the Technical University of We introduce a taxonomy of current approaches and derive
Munich. Pingchuan Ma, Ming Gui, and Johannes Schusterbauer are generalized frameworks that demonstrate commonalities
PhD students at LMU Munich. Vincent Tao Hu, a PostDoc from LMU
Munich, is also the corresponding author.
among current approaches.
E-mail: [email protected] Interest in exploring the representation learning capabil-
• Björn Ommer is a full professor at LMU where he heads the Computer Vi- ities of diffusion models has been growing since the original
sion & Learning Group (previously Computer Vision Group Heidelberg). formulation of diffusion models by Ho et al. [68], Sohl-
Dickstein et al. [151], Song et al. [154]. As demonstrated
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

in Fig. 1, we expect this trend to continue this year. The

increased volume of published works on diffusion models p
p(xt |xt−1 ) = N xt ; 1 − βt xt−1 , βt I ,
and representation learning makes it more difficult for re- (1)
searchers to identify state-of-the-art approaches and stay on ∀t ∈ {1, . . . , T },
top of current developments. This can hinder progress in the
where T denotes the number of diffusion timesteps, βt
space, which is why we feel a comprehensive overview and
is a time-dependent variance schedule and I is an identity
categorization is required.
matrix with dimensionality equal to x0 [37]. Note that other
Research on representation learning and diffusion mod-
parametrizations of the transition kernel p(xt |xt−1 ) are also
els is in its infancy. Many of the current approaches rely
applicable in the same manner [87, 188]. We proceed with
on using diffusion models solely trained for generative syn-
the parametrization used in DDPMs [68] to simplify the dis-
thesis for representation learning. We therefore hypothesize
cussion moving forward. A noisy image xt can be sampled
that there are significant opportunities for further progress
directly from x0 with the help of a reparametrization trick
in this area in the future and that diffusion models can
[151] as follows:
increasingly challenge the current state-of-the-art in rep-
resentation learning. Fig. 2 shows qualitative results from
√
p(xt |x0 ) = N xt ; ᾱt x0 ; (1 − ᾱt )I , (2)
existing methods. We hope that this survey can contribute
to advances in diffusion-based representation learning, by Qt
where αt := 1 − βt and ᾱt := i=1 αi . Given the
clarifying commonalities and differences among current ap-
original input image x0 , we can now obtain xt in one step
proaches. In summary, the main contributions of this paper
by sampling Gaussian vector ϵt ∼ N (0, I) and applying:
are the following:
√ p
• Comprehensive Overview: Offers a thorough survey xt = ᾱt x0 + (1 − ᾱt )ϵt . (3)
of the interplay between diffusion models and repre-
We can generate novel samples from p(x0 ) starting
sentation learning, providing clarity on how diffu-
from a pure noise image xT ∼ π(xT ) = N (0, I)
sion models can be used for representation learning
with dimensionality equivalent to the data and sequen-
and vice versa.
tially denoise it such that at every step, pθ (xt−1 |xt ) =
• Taxonomy of Approaches: We introduce a taxonomy
N (xt−1 ; µθ (xt , t), Σθ (xt , t)). In practice, this requires train-
of current approaches in diffusion-based represen-
ing a neural network pθ (xt−1 |xt ) that predicts the mean
tation learning, categorizing and highlighting com-
µ0 (xt , t) and the covariance Σθ (xt , t) given a diffusion
monalities and differences among them.
timestep t and the noisy input image xt [172]. Training
• Generalized Frameworks: The paper derives gener-
this neural network with a maximum likelihood objective
alized frameworks for both diffusion model feature
is intractable [37], so the objective is amended to minimize
extraction and assignment-based guidance, offering
a Variational Lower-Bound of the Negative Log-Likelihood
a structured view on a large number of works on
instead [68, 151]:
diffusion models and representation learning.
• Future Directions: We identify key opportunities
for further progress in the field, encouraging the Lvlb = − log pθ (x0 |x1 ) + DKL (p(xT |x0 )∥π(xT ))
(4)
X
exploration of diffusion models and flow matching + DKL (p(xt−1 |xt , x0 )∥pθ (xt−1 |xt )) ,
as a new state-of-the-art in representation learning. t>1

where DKL is the Kullback-Leibler divergence. This

objective ensures that the neural network is trained to
2 BACKGROUND
minimize the distance between pθ (xt−1 |xt ) and the true
The following section outlines the required mathematical posterior of the forward process when conditioned on x0 .
foundations of diffusion models. We also highlight current The denoising network is generally applied to parametrize
architecture backbones of diffusion models and provide a the reverse mean µθ (x, t) of the distribution of the reverse
brief overview of sampling methods and conditional gener- transition pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t)) [27].
ation approaches. The true value of the reverse mean is a function of x0 , which
is unknown in the reverse process and must therefore be
2.1 Mathematical Foundations estimated using input timestep t and the noisy data xt .
Consider a set of training examples drawn from an under- Specifically, the reverse mean is formulated as the following:
√ √
lying probability distribution p(x). The idea behind gener- ᾱt−1 (1 − ᾱt−1 )xt + ᾱt−1 (1 − αt )x0
ative diffusion models is to learn a denoising process that µ(xt , t) := , (5)
1 − ᾱt
maps samples of random noise to novel images sampled
where the original data x0 is unavailable in the reverse
from p(x) [133]. To achieve this, images are corrupted by
process and must therefore be estimated. We denote the
gradually adding different levels of Gaussian noise. Given
denoising network’s prediction of the original data as x̂0 .
an uncorrupted training sample x0 ∼ p(x), where index
This prediction x̂0 can then be used to obtain µθ (xt , t) using
0 denotes the fact that the sample is not corrupted, the
Equation 5. Parametrizing with x̂0 directly is beneficial
corrupted samples x1 , x2 ..., xT are generated according to
at the beginning of sampling, since predicting x̂0 directly
a Markovian process. One common choice for the transition
helps the denoising network to learn higher-level structural
kernel p(xt |xt−1 ) is the following:
features [115].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

Fig. 2. Left: Shows qualitative generation results from diffusion models conditioned using self-supervised guidance signals. Right: Shows qualitative
results of downstream image tasks that leverage representations learned in training diffusion models. Adapted from Li et al. [100], Hu et al. [73],
Pan et al. [130], Baranchuk et al. [15], Yang and Wang [173].

[68] suggest fixing the covariance Σθ (xt , t) to a constant discrete timesteps. Here, the diffusion process is expressed
value, which enables rewriting the parametrized reverse as a continuous time-dependent function σ(t). Noise is
mean as a function of the added noise ϵ(xt , t) instead of gradually added whenever a sample x moves forward
x0 : in time, and gradually removed if the image follows the
1 1 − αt reverse trajectory. More specifically, the diffusion process
µθ (xt , t) = √ xt − √ ϵθ (xt , t). (6)
αt 1 − ᾱt can be expressed using an Itô Stochastic Differential Equa-
tion (SDE) [83], where the vector-valued drift coefficient
This reparametrization allows for the derivation of a
simplification of the objective Lvlb which we denote Lsimple
f (·, t) : Rd → Rd and the scalar-valued diffusion coefficient
that measures the distance between the predicted noise
g(·) : R → R need to be selected when implementing a
diffusion model:
ϵθ (xt , t) and the actual noise ϵt as follows:
2 dx = f (x, t)dt + g(t)dw, (9)
Lsimple = Et∼[1,T ] Ex0 ∼p(x0 ) Eϵt ∼N (0,I) ∥ϵt − ϵθ (xt , t)∥ .
(7) where w is the standard Wiener process. There are two
Instead of predicting the mean and covariance directly, widely used choices of the SDE formulation used to model
the network is now parametrized to predict the added noise the diffusion process. The first is the Variance-Preserving
for a diffusion timestep and noisy image input. The reverse (VP) SDE, used in the work of Ho et al. p [68] which is
mean is obtained using Equation 6, and the covariance is given by f (x, t) = − 21 β(t)x and g(t) = β(t), where
fixed. Noise prediction networks have the benefit of being β(t) = βt as T goes to infinity. Note that this is equivalent to
able to recover xt−1 from xt in the final sampling stages the continuous formulation of the DDPM parametrization
by predicting zero noise [79]. This is more difficult for direct in Equation 1. The second is the Variance-Exploding (VE)
parametrizations of x̂0 . There is therefore a tradeoff between
q resulting from a choice of f (x, t) = 0 and
SDE [153],
the two, where direct parametrizations can be more bene-
g(t) = 2σ(t) dσ(t) dt . The VE SDE gets its name since the
ficial for very noisy inputs in the initial sampling stages,
variance continually increases with increasing t, whereas the
and noise prediction parametrization can be beneficial in
variance in the VP SDE is bounded [154]. Anderson [7] de-
the latter sampling stages [27].
rives an SDE that reverses a diffusion process, which results
In efforts to improve sampling efficiency, Salimans and
in the following when applied to the Variance Exploding
Ho [143] introduce velocity prediction as a further alterna-
SDE:
tive parametrization. Velocity is a linear combination of the r
denoised input and the added noise, commonly defined as: dσ(t) dσ(t)
dx = −2σ(t) ∇x log p(x; σ(t)) dt + 2σ(t) dw.
v = ᾱt ϵ − (1 − ᾱt )xt . (8) dt dt
(10)
This parametrization combines benefits of both data and ∇x log p(x; σ(t)) is known as the score function. This
noise parametrizations, allowing the denoising network to score function is generally not known, so it needs to be
flexibly learn noise prediction as well as reconstruction dy- approximated using a neural network. A neural network
namics based on the signal-to-noise ratio. This parametriza- D(x; σ) that minimizes the L2-denoising error can be used
tion has led to stable results in diffusion distillation ap- to extract the score function since ∇x log p(x; σ(t)) =
D(x;σ)−x
proaches [143], and can speed up generation [19]. σ2 . This idea is known as Denoising Score Matching
Recently, several works [32, 133, 153, 154] further pro- [161].
pose to think of the noise in terms of continuous instead of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Fig. 3. Left: An exemplary visualization of the U-Net architecture [140]. Consists of an encoder and a decoder, with residual connections that
preserve gradient flow and low-level input details. Adapted from [135]. Right: An exemplary visualization of the DiT architecture. Shows the high-
level architecture, as well as a breakdown of the adaLN-Zero DiT block. Adapted from Peebles and Xie [132].

2.2 Backbone Architectures TABLE 1

An overview of different diffusion model guidance approaches.
We outline the mathematical foundations of diffusion mod- Self-guidance [75] and [73] are both classifier and annotation-free, and
els in Section 2.1. Since denoising prediction networks are online guidance facilitates online learning.
generally parametrized by parameters θ, we discuss the
formulation of θ by several neural network architectures Approach Classifier-Free Annotation-Free Online Learning
in the following section. All of these network architectures Classifier Guidance [42] ✗ ✗ ✗
Classifier-free Guidance [67] ✓ ✗ ✗
map from the same input space to the same output space. Self-guidance [75, 100] ✓ ✓ ✗
Ho et al. [68] use a U-Net backbone similar to an un- Online guidance [73] ✓ ✓ ✓

masked PixelCNN++ [144] to approximate the score func-

tion. This U-Net architecture, originally used in semantic
segmentation approaches [30, 31, 113, 140], is based on into a sequence of tokens using a ”patchify” layer. After
a Wide ResNet [182] and takes a noisy image and the adding ViT-style positional embeddings to all input tokens,
diffusion timestep t as input, encodes the image to a lower- the tokens are fed through a series of transformer blocks.
dimensional representation, and outputs the noise predic- These blocks are equivalent to standard ViT blocks that
tion for that image and noise level. The U-Net consists take additional conditional information such as the diffusion
of an encoder and a decoder with residual connections timestep t and a conditioning signal c as inputs. A detailed
between blocks that preserve gradient flow and help re- overview of their structure can be seen in Fig 3.
cover fine-grained details lost in the compressed represen- U-ViTs [12] combine the U-Net and ViT backbones into a
tation. The encoder consists of a series of residual and self- unified backbone. U-ViTs follow the design methodology
attention blocks and downsamples the input image to a of transformers in tokenizing time, conditioning and im-
low-dimensional representation. The decoder mirrors this age inputs, but additionally employ long skip connections
structure, gradually upsampling the low-dimensional rep- between shallow and deep layers. These skip connections
resentation to match the input dimensionality. The diffusion provide shortcuts for low-level features and therefore sta-
timestep t is specified by adding a sinusoidal positional em- bilize training of the denoising network [12]. Works utiliz-
bedding in each residual block [68] that scales and shifts the ing U-ViT-based backbones [13, 72] achieve results on par
input features, enhancing the network’s ability to capture with U-Net CNN-based architectures, demonstrating their
temporal dependencies. potential as a viable alternative to other denoising network
DDPMs operate in the pixel space, making their training backbones.
and inference computationally expensive. Rombach et al.
[138] address this by proposing Latent Diffusion Models 2.3 Diffusion Model Guidance
(LDMs), which operate in the latent space of a pre-trained Recent improvements in image generation results have
variational autoencoder. The diffusion process is applied to largely been driven by improved guidance approaches. The
the generated representation as opposed to the image di- ability to control generation by passing user-defined con-
rectly, leading to computational benefits without sacrificing ditions is an important property of generative models, and
generation quality. While the authors introduce additional guidance describes the modulation of the strength of the
cross-attention mechanisms to allow for more flexible condi- conditioning signal within the model. Conditioning signals
tioned generation, the denoising network backbone remains can have a wide range of modalities, ranging from class
very close to the DDPM U-Net architecture. labels, to text embeddings to other images. A simple method
Recent advances in the use of transformer architectures to pass spatial conditioning signals to diffusion models
for vision tasks like ViT [45] have led to the adoption of is to simply concatenate the conditioning signal with the
transformer-based architectures for diffusion models. Pee- denoising targets and then pass the signal through the
bles and Xie [132] propose Diffusion Transformers (DiT), denoising network [12, 75]. Another effective approach uses
a diffusion model backbone architecture that is largely in- cross-attention mechanisms, where a conditioning signal c
spired by ViTs, and demonstrates state-of-the-art generation is preprocessed by an encoder to an intermediate projection
performance on ImageNet when combined with the LDM E(c), and then injected into the intermediate layer of the
framework. Following ViT, DiTs work by transforming in- denoising network using cross-attention [76, 142]. These
put images into a sequence of patches, which are converted conditioning approaches alone do not leave the possibility
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

to regulate the strength of the conditioning signal within the samples x that minimize a loss function L(x) and are likely
model. Diffusion model guidance has recently emerged as under the model’s distribution p. The optimization is formu-
an approach to more precisely trade-off generation quality lated as minx0 L(x), where x0 is the source noise point. The
and diversity. loss function L(x) can be modified for conditional sampling
Dhariwal and Nichol [42] use classifier guidance, a to generate a sample belonging to a particular class y.
compute-efficient method leveraging a pre-trained noise-
robust classifier to improve sample quality. Classifier guid-
3 M ETHODS
ance is based on the observation that a pre-trained diffusion
model can be conditioned using the gradients of a classifier Having covered the main preliminaries for diffusion mod-
parametrized by ϕ outputting pϕ (c|xt , t). The gradients of els, we outline a series of methods related to diffusion
the log-likelihood of this classifier ∇xt log pϕ (c|xt , t) can be models and representation learning in the following sec-
used to guide the diffusion process towards generating an tion. In subsection 3.1 we describe and categorize current
image belonging to class label y. The score estimator for frameworks utilizing representations learned by pre-trained
p(x|c) can be written as diffusion models for downstream recognition tasks. In sub-
section 3.2, we describe methods that leverage advances in
∇xt log (pθ (xt )pϕ (c|xt )) = ∇xt log pθ (xt )+∇xt log pϕ (c|xt ). representation learning to improve diffusion models them-
(11) selves.
By using Bayes’ theorem, the noise prediction network
can then be rewritten to estimate:
3.1 Diffusion Models for Representation Learning
ϵ̂θ (xt , c) = ϵθ (xt , c) − wσt ∇xt log pϕ (c|xt ), (12) Learning useful representations is one of the main moti-
where the parameter w modulates the strength of the vations for designing architectures like VAEs [88, 89] and
conditioning signal. Classifier guidance is a versatile ap- GANs [22, 84]. Contrastive learning approaches, where the
proach that increases sample quality, but it is heavily reliant goal is to learn a feature space in which representations of
on the availability of a noise-robust pre-trained classifier, similar images are very close together, and vice versa for
which in turn relies on the availability of annotated data, dissimilar images (e.g. SimCLR [34], MoCo [60]), have also
which is not available in many applications. led to significant advances in representation learning. These
To address this limitation, Classifier-free guidance contrastive methods are not fully self-supervised however,
(CFG) [67] eliminates the need for a pre-trained classifier. since they require supervision in the form of augmentations
CFG works by training an unconditional diffusion model that preserve the original content of the image.
parametrized by ϵθ (xt , t, ϕ) together with a conditional Diffusion models offer a promising alternative to these
model parametrized by ϵθ (xt , t, c). For the unconditional approaches. While diffusion models are primarily designed
model, a null input token ϕ is used as a conditioning signal for generation tasks, the denoising process encourages the
c. The network is trained by randomly dropping out the learning of semantic image representations [15], that can
conditioning signal with probability puncond . Sampling is be used for downstream recognition tasks. The diffusion
then performed using a weighted combination of condi- model learning process is similar to the learning process of
tional and unconditional score estimates: Denoising Autoencoders (DAE) [18, 162], which are trained
to reconstruct images corrupted by adding noise. The main
ϵ̃θ (xt , c) = (1 + w)ϵθ (xt , c) − wϵθ (xt , ϕ). (13) difference is that diffusion models additionally take the
diffusion timestep t as input, and can thus be viewed as
This sampling method does not rely on the gradients of a
multi-level DAEs with different noise scales [169]. Since
pre-trained classifier but still requires an annotated dataset
DAEs learn meaningful representations in the compressed
to train the conditional denoising network. Fully uncondi-
latent space, it is intuitive that diffusion models exhibit
tional approaches have yet to match classifier-free guidance,
similar representation learning capabilities. We outline and
though recent works using diffusion model representations
discuss current approaches in the following section.
for self-supervised guidance show promise [73, 100]. These
methods do not need annotated data, allowing the use of
3.1.1 Leveraging intermediate activations
larger unlabelled datasets.
Table 1 shows the requirements of current guidance Baranchuk et al. [15] investigate the intermediate activations
methods. While classifier and classifier-free guidance im- from the U-Net network that approximates the Markov
prove generation results, they require annotated training step of the reverse diffusion process in DDPMs [42]. They
data. Self-guidance and online guidance are fully self- show that for certain diffusion timesteps, these intermediate
supervised alternatives that achieve competitive perfor- activations capture semantic information that can be used
mance without annotations. for downstream semantic segmentation. The authors take
Classifier and classifier-free guidance are controlled gen- a noise-predictor network ϵθ (xt , t) trained on the LSUN-
eration methods that rely on conditional training. Training- Horse [177] and FFHQ-256 [84] datasets and extract feature
free approaches modify the generation process of a pre- maps produced by one of the network’s 18 decoder blocks
trained model by binding multiple diffusion processes [14] for label-efficient downstream segmentation tasks. Selecting
or using time-independent energy functions [179]. Other the ideal diffusion timestep and decoder block activation
controlled generation methods take a variational perspec- to extract is non-trivial. To understand the efficacy of pixel-
tive [54, 119, 146, 164], treating controlled generation as a level representations of different decoder blocks, the authors
source point optimization problem [17]. The goal is to find train a multi-layer perceptron (MLP) to predict the semantic
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

TABLE 2
Summary of the methods using diffusion models for representation learning.

Paradigm Downstream Task Method

Generative Augmentation [10]
Classification
MA-ZSC [150]
Generative Augmentation
Semantic Segmentation ScribbleGen [148]
GDC [125]
Classification DifFormer [126]
DDAE [169]
DDPM-Seg [15]
Semantic Segmentation
VDM [187]
Panoptic Segmentation ODISE [170]

Leveraging Intermediate Activations DIFT [157]

SD+DINO [183]
Semantic Correspondence Diffusion Hyperfeatures [116]
SD4Match [103]
USCSD [62]
Depth Estimation VDM [187]
P2PCAC [65]
Image Editing
Plug-and-Play Diffusion Features [160]
SODA [82]
Classification l-DAE [35]
DiffMAE [166]
Semantic Segmentation MDM [130]
Diffusion Model Reconstruction
DiffAE [134]
Image Editing
PDAE [186]
InfoDiffusion [165]
Image Interpolation
SmoothDiffusion [58]
DiffusionClassifier [95]
Diffusion Model Knowledge Transfer Classification RepFusion [173]
DreamTeacher [96]
JDM [40]
Classification
Joint Diffusion Models HybViT [174]
Semantic Segmentation ADDP [158]

label from features produced by different decoder blocks mark comparisons against diffusion-based methods like Hy-
on a specific diffusion step t. The representations from a bViT [174] and SBGC [190] on CIFAR-10 and Tiny-ImageNet
fixed set of blocks B of the pre-trained U-Net decoder and [41] show that EDM-based Denoising Diffusion Autoen-
higher diffusion timesteps are upsampled to the image size coders (DDAEs) outperform previous supervised and un-
using bilinear interpolation and concatenated. The obtained supervised diffusion-based methods on both generation
feature vectors are then used to train an ensemble of in- and recognition, especially after fine-tuning. Benchmarking
dependent MLPs which predict a semantic label for each against contrastive learning methods shows that the EDM-
pixel. The final prediction is obtained by majority voting. based DDAE is comparable with Sim-CLRs considering
This method, denoted DDPM-Seg, outperforms baselines model sizes, and outperforms SimCLRs with comparable
that exploit alternative generative models and achieves seg- parameters on CIFAR-10 and Tiny-ImageNet.
mentation results competitive with MAE [61], illustrating ODISE [170] is a related approach that unites text-
that intermediate denoising network activations contain to-image diffusion models with discriminative models to
semantic image features. perform panoptic segmentation [90, 91], a segmentation
Xiang et al. [169] extend this approach to further ar- approach unifying instance and semantic segmentation into
chitectures and image recognition on CIFAR-10 and Tiny- a common framework for comprehensive scene understand-
ImageNet. They investigate the discriminative efficacy of ing. ODISE extracts the internal features of a pre-trained
extracted features for different backbones (U-Net and DiT text-to-image diffusion model. These features are input to
[132]) under different frameworks (DDPM and EDM [85]). a mask generator trained on annotated masks. A mask
The relationship between feature quality and layer-noise classification module then categorizes each generated bi-
combinations is evaluated through grid search, where the nary mask into an open vocabulary category by relating the
quality of feature representations is determined using linear predicted mask’s diffusion features with text embeddings of
probing. The best-performing features lie in the middle of object category names. The authors use the Stable Diffusion
up-sampling using relatively small noising levels, which is U-Net DDPM backbone and extract features by computing
in line with conclusions drawn in DDPM-Seg [15]. Bench- a single forward pass and extracting the intermediate ac-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

tivations f = UNet(xt , τ (s), t) where τ (s) is an encoded p2 can be semantic (pixels that contain similar semantics),
representation of the image caption s obtained leveraging a geometrical (pixels that contain different views of an object)
pre-trained text encoder τ . Interestingly, the authors obtain or temporal (pixels that contain the same object deforming
the best results using t = 0, whereas previous methods over time). DIFT (Diffusion Features) [157] is an approach
obtain better results using higher diffusion timesteps. To leveraging pre-trained diffusion model representations for
overcome reliance on available image captions, Xu et al. correspondence tasks. DIFT also relies on extracting dif-
[170] additionally train an MLP-based implicit captioner fusion model features. Similarly to previous approaches,
that computes an implicit text embedding from the image diffusion timestep and network layer numbers used for
itself. ODISE establishes a new state-of-the-art in open- extraction are an important consideration. The authors ob-
vocabulary segmentation and is a further example of the serve more semantically meaningful features for large dif-
rich semantic representations learned by denoising diffusion fusion timesteps and earlier network layer combinations,
models. whereas lower-level features are captured in smaller dif-
Mukhopadhyay et al. [125] also propose leveraging in- fusion timesteps and later denoising network layers. DIFT
termediate activations from the unconditional ADM U-Net is shown to outperform other self-supervised and weakly-
architecture [42] for ImageNet classification. The method- supervised methods across a range of correspondence tasks,
ology for layer and timestep selection is similar to previ- showing on-par performance with state-of-the-art methods
ous approaches. Additionally, the impact of different sizes on semantic correspondence specifically.
for feature map pooling is evaluated and several different Zhang et al. [183] evaluate how learned diffusion fea-
lightweight architectures for classification (including linear, tures relate across multiple images, instead of focusing on
MLP, CNN, and attention-based classification heads) are downstream tasks for single images. To investigate this, they
used. Feature quality is found to be mostly insensitive to employ Stable Diffusion features for semantic correspon-
pooling size, and is mostly dependent on time steps and dence as well. The authors observe that Stable Diffusion
the selected block number. Their approach, which we term features have a strong sense of spatial layout, but some-
guided diffusion classification (GDC), achieves competitive times provide inaccurate semantic matches. DINOv2 [128],
performance against other unified models, namely BigBi- a method for self-supervised representation learning using
GAN [44] and MAGE [99]. The attention-based classification knowledge distillation and vision transformers, produces
heads perform best on ImageNet-50, but perform poorly on more sparse features that provide more accurate matches.
Fine-Grained Visual Classification datasets, indicating their Zhang et al. [183] therefore propose to combine the two fea-
reliance on a large amount of available data. tures and employ zero-shot evaluation of nearest neighbor
In a continuation of their previous work, Mukhopad- search on the combined features to achieve state-of-the-art
hyay et al. [126] extend this approach by introducing two performance on several semantic correspondence datasets
methods for more fine-grained block and denoising time like SPair-71k and TSS.
step selection. The first is DifFormer [126], an attention SD4Match [103] builds on this approach by using various
mechanism replacing the fixed pooling and linear classi- prompt tuning and conditioning techniques. One method,
fication head from [125] with an attention-based feature SD4Match-Class, fine-tunes prompt embedding Θ for each
fusion head. This fusion head is designed to replace the semantic class using a semantic matching loss [102]. Given
fixed flattening and pooling operation required to generate images IA B
t and It , the Stable Diffusion U-Net f (·) extracts
vector feature representations from the U-Net CNN used feature maps Ft and FB
A
t by Ft = f (It , t, Θ). Correspon-
in the GDC approach with a learnable pooling mechanism. dence points are predicted by normalizing feature maps and
The second mechanism is DifFeed [126], a dynamic feedback computing a correlation map, which is converted to a prob-
mechanism that decouples the feature extraction process ability distribution using a softmax operation. Additionally,
into two forward passes. In the first forward pass, only Li et al. [103] propose conditioning prompts on input im-
the selected decoder feature maps are stored. These are ages using a Conditional Prompting Module (CPM), which
fed to an auxiliary feedback network that learns to map includes a DINOv2 feature extractor, linear layers, and an
decoder features to a feature space suitable for adding them adaptive MaxPooling layer. The conditioning embedding
to the encoder blocks of corresponding blocks. In the second Θcond is formed by concatenating feature representations
forward pass, the feedback features are added to the encoder and projecting them to the prompt embedding dimension.
features, and the DifFeed attention head is used on top The final prompt ΘAB is obtained by appending Θcond to
of those second forward pass features. These additional a global prompt Θglobal . This method sets new benchmark
improvements further increase the quality of learned rep- accuracies on SPair-71k [122], PF-Willow, and PF-Pascal [59],
resentations and improve ImageNet and fine-grained visual surpassing methods like DIFT [157] and SD+DINO [183]
classification performance. Luo et al. [116] introduce Diffusion Hyperfeatures, a
The previously described diffusion representation learn- framework designed to consolidate multiple intermediate
ing methods focus on segmentation and classification, activation maps across diffusion timesteps for downstream
which are only a subset of downstream recognition tasks. recognition. Activations are consolidated using an inter-
Correspondence tasks are another subset that generally in- pretable aggregation network, that takes the collection of
volves identifying and matching points or features between intermediate feature maps as input and produces a single
different images. The problem setting is as follows: Consider feature descriptive feature map as output. While other
two images I1 and I2 and a pixel location p1 in I1 . A approaches manually select fixed diffusion timesteps and
correspondence task involves finding the corresponding activations from a pre-determined number of intermediate
pixel location p2 in I2 . The relationship between p1 and network layers, Diffusion Hyperparameters cache all feature
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

maps across all layers and timesteps in the diffusion process when feeding the denoising network the latent represen-
to generate a dense set of activations. This high dimensional tation of the input image generated by using the VQGAN
set of activations is upsampled, passed through a bottleneck encoder [47] to obtain feature maps F . Finally, pϕ3 (y|F)
layer B and weighed with a unique learnable mixing weight serves as a light-weight prediction head implemented as
wl,s for each layer and timestep combination. The final a semantic feature pyramid network [90] that is adapted
diffusion hyperfeatures take on the form to the downstream task. VDM is evaluated on semantic
S X
L
segmentation and depth estimation, and achieves highly
competitive performance and fast convergence compared to
X
wl,s Bl (rl,s ), (14)
s=0 l=1
methods with other pre-training paradigms.
A more indirect application of text-to-image diffusion
where L is the number of layers, S is a subsample model representations is instructional image editing [23, 51,
of the number of diffusion timesteps and r is an activa- 98], where the desired image edit is described by a natural
tion feature map. Bottleneck layers and mixing weights language instruction rather than a description of the desired
are finetuned on the specific downstream task. Similar to new image [81]. Prompt-based image editing is challenging
previous approaches, Diffusion Hyperfeatures is used for since small changes in the textual prompt can lead to vastly
semantic correspondence. The authors extract activations different generation outcomes. [65] propose a textual editing
from Stable-Diffusion and tune the aggregation network on method for pre-trained text-conditioned diffusion models
a subset of SPair-71k. Diffusion Hyperfeatures outperforms that leverages the semantic strength of the intermediate
models that use self-supervised descriptors or supervised cross-attention layers in the denoising backbone. This ap-
hypercolumns on the SPair-71k and CUB datasets. proach is based on a key observation also employed in
Hedlin et al. [62] focus on optimizing the prompt embed- [62, 187]: Cross-attention maps contain rich information on
dings by exploiting intermediate attention maps specifically. the spatial layout and geometry of the generated image. In-
Given a certain input text prompt, these attention activation jecting the cross-attention layers obtained when generating
maps correspond to the semantics of the prompt. Instead an image I into the generation process of the edited image
of optimizing a global or a class-dependent prompt embed- I ∗ ensures that the edited image preserves the original
ding Θ using the semantic loss, Hedlin et al. [62] optimize spatial layout. Hertz et al. [65] use Imagen [141] to conduct
the embedding to maximize the cross-attention at the loca- experiments and demonstrate promising results on text-only
tion of interest. Locating corresponding points in a second localized editing, global editing, and real image editing. Fol-
image then comes down to conditioning on the optimized lowing works like Plug-and-play Diffusion Features [160]
prompt, and selecting the point with the pixel attaining further improve upon this by leveraging all intermediate
the maximum attention map value within the target image. activation maps to enable instructional image editing. Other
Note that this approach does not utilize supervised training techniques like TokenFlow [52] and work by Yatim et al.
specific to semantic correspondence. However, they require [175] have extended this idea to the video space, using
test-time optimization which is costly. Text prompts are diffusion features to enable prompt-based video editing
optimized using an off-the-shelf diffusion model without text-driven motion transfer.
fine-tuning. Several further works building on aforemen-
tioned approaches [120, 184] exist, showing that exploiting 3.1.2 A general representation extraction framework
pre-trained diffusion models for semantic correspondence
Many of the methods outlined in the previous section follow
remains a promising application of diffusion models.
a similar procedure in leveraging learned representations of
Zhao et al. [187] propose Visual Perception with a pre-
pre-trained diffusion models for downstream vision tasks.
trained Diffusion Model (VDM), a framework closely re-
In this section, we aim to consolidate these approaches to a
lated to USCSD [62] that employs a text feature refinement
common three-step framework. We do this to provide clarity
network as well as an additional recognition encoder for
on the relationship between diffusion models and their use
semantic segmentation and depth estimation. Here, the de-
for downstream predictive tasks. To leverage intermediate
noising network is fed with refined text representations as
activations for downstream tasks, a selection methodology
well as an input image, and the resulting feature maps as
that outputs the ideal diffusion timestep input as well as the
well as the cross-attention maps between the text and image
intermediate layer number(s) whose activation maps have
features are used to provide guidance for a decoder. To
the highest predictive performance when upsampled and
achieve this, the prediction model is written as pϕ (y|x, S),
linearly probed must be applied. This can be a trainable
where S represents the set of all category labels of the
model [116], a grid search procedure [169] or a learning
downstream task. The prediction model is implemented as
agent [173]. The goal of this methodology is generally to
the following:
select timestep t ∈ T and a set of decoder block numbers
pϕ (y|x, S) = pϕ3 (y|F)pϕ2 (F|x, C)pϕ1 (C|S), (15) B that maximize predictive performance on a downstream
task. Given a set of possible timesteps T and a set of decoder
where F denotes the set of feature maps and C denotes blocks B , the goal is to find:
the text features. Here, pϕ1 (C|S) denotes a text adapter con-
sisting of a two-layer MLP that refines the text features ob- (t∗ , B ∗ ) = arg min Ldiscr (t, B) (16)
t∈T,B⊆B
tained by applying the CLIP text encoder to a text template
of "a photo of a [CLS]". pϕ2 (F|x) extracts the feature where Ldiscr (t, B) represents the discriminative loss at
maps from the denoising network given the input image x timestep t when the blocks in B are used for downstream
and the set of refined text features C . The authors use t = 0 prediction. Generally, discriminative tasks will require more
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

Fig. 4. A high-level overview of a framework for extracting representations from pre-trained diffusion models for downstream tasks.

high-level features corresponding to structural elements and representation and z is the student model representation.
shapes, whereas generative tasks mapping random noise to The distance between the two is minimized during training
images will require the computation of lower-level features. using a loss function Lkd . After the distillation, the student
The ideal intermediate layer number as well as the optimal network is reapplied as a feature extractor and fine-tuned
diffusion timestep will largely depend on the exact down- on the available task labels. Previous approaches for using
stream prediction task, the dataset, and the architecture of diffusion model representations rely on grid-search to deter-
the diffusion model used. mine which diffusion timestep to use for feature extraction.
Once the ideal timestep and layer number are deter- Here, the authors formulate a reinforcement learning envi-
mined, an input image and the selected diffusion timestep ronment where the action space is the set of all possible
are passed to the diffusion model, and the intermediate acti- timesteps t available for selection, and the reward function
vations in the selected decoder blocks computed in the for- is the negative task loss −Ltask (y, g(z(t) ; θg )). Given the
ward pass are extracted and generally concatenated and pre- input x, a policy network πθπ (t|x) is trained to determine
processed depending on the downstream task (e.g. through which timestep t to use for representation extraction. Once
upsampling, pooling, etc.). Finally, a classification head is the timestep is selected, the authors use the feature represen-
trained on the annotated dataset, taking the preprocessed tations in the mid-block of the DPM for the selected timestep
∗
features extracted from the diffusion model as input. This t∗ to obtain z(t ) . After the distillation phase, the student
classification head can be an MLP, a CNN, or an attention- network is used as a feature extractor and subsequently fine-
based network depending on the availability of labeled data tuned on the task label y.
and predictive performance on the dataset. The diffusion Li et al. [96] introduce DreamTeacher, a knowledge distil-
model weights are usually frozen in this probing process, lation method using a feature regressor module that distills
but additional fine-tuning regimes can increase discrimina- the learned representations of a generative model G into a
tive performance for certain datasets and architectures (see target image recognition backbone f . Given a feature dataset
e.g., Xiang et al. [169]). Fig. 4 shows an overview of the D = {xi , fig }N i=1 consisting of images x and extracted
g g
generalized framework. features fi , f is trained by distilling fi into the intermediate
features of f (xi ). The features are extracted from G by
3.1.3 Knowledge transfer running a forward diffusion process for T timesteps and
g
Aside from leveraging intermediate activations from pre- conducting a single denoising step to extract fi from the
trained diffusion models directly as inputs to a recognition intermediate layers of the U-Net backbone. The extracted
network, several recent approaches propose a more indirect features are distilled using a feature regressor module with
method of reusing learned representations for downstream a top-down architecture containing lateral skip connections
tasks. We summarize these under the term knowledge transfer that aligns the image backbone features with the generative
methods. This reflects the common idea of distilling repre- features. Intermediate CNN encoder features fle at layers
sentations from pre-trained diffusion models and then trans- l and regressor outputs flr are used to compute an MSE
ferring them to auxiliary networks in a way that is distinct feature regression loss inspired by FitNet [139]:
from simply providing aggregated feature activation maps L
as input. Several of these approaches are discussed in the 1X r 2
LMSE = ∥fl − W(fle )∥2 (17)
following section. L
l=1
Yang and Wang [173] propose RepFusion, a knowledge
W is a non-learnable operator implemented as Lay-
distillation approach that dynamically extracts intermediate
erNorm [11]. This loss is combined with the activation-
representations at different time steps using a reinforcement
based Attention Transfer (AT) objective [181], which distills
learning framework, and uses the extracted representations
a one dimensional ”attention map” for each spatial feature.
as auxiliary supervision for student networks. Given an
DreamTeacher is evaluated on a range of downstream recog-
input x with label y, the authors extract a pair of features,
nition tasks by fine-tuning the pre-trained backbone with
one from the diffusion probabilistic model (DPM) and one
additional classification heads for each task. DreamTeacher
from the student model, where z(t) is the diffusion model
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

outperforms existing contrastive and masking-based self- noise scales. This insight is also applied in DiffAE [134],
supervised methods on the COCO [106], ADE20k [189] and which uses diffusion models for representation learning
BDD100K [178] benchmarks. via autoencoding. Preechakul et al. [134] separate latent
Both RepFusion and DreamTeacher are inspired by ear- representations into a compact semantic representation and
lier works on knowledge distillation [66, 139]. Li et al. [95] a stochastic representation. DiffAE consists of a semantic
propose a slightly different knowledge transfer approach: encoder, that generates a semantic representation zsem , as
Diffusion Classifier, a method for zero-shot classification well as a conditional DDIM [152]. This DDIM acts both
that leverages conditional density estimates from text-to- as the stochastic encoder, which maps x0 to xT , and as
image diffusion models. This classifier converts the diffu- the decoder, which maps xT to x0 . xT represents the
sion model into a classifier by computing class conditional stochastic representation and captures low-level variation,
likelihoods pθ (x|ci ) and using Bayes’ theorem to obtain whereas zsem encodes higher-level semantics. During infer-
predicted class probabilities p(ci |x). Since direct computa- ence, [134] fit a second latent DDIM to zsem , and sample
tion of pθ (x|ci ) is intractable, they use the Evidence Lower from this DDIM and xT to facilitate unconditional sampling.
Bound (ELBO) in its place. The classifier is derived by Variations in xT with fixed zsem result in minor changes
adding noise repeatedly and estimating noise reconstruction in generated images, while varying z leads to different
losses for each class using Monte Carlo methods. While reconstructions, showing DiffAE’s efficiency in generating
Diffusion Classifier suffers from high inference time, it gen- semantically meaningful and decodable representations. In-
erally outperforms DDPM-Seg Baranchuk et al. [15] on most foDiffusion [165] extends DiffAE, supporting custom priors
datasets and is competitive with CLIP ResNet-50 [136] and and improving latent representations zsem via mutual infor-
OpenCLIP ViT-H/14 [36]. mation regularization.
Zhang et al. [186] observe that there is a gap between
3.1.4 Reconstructing diffusion models the true and the predicted posterior mean of xt−1 when
Previous diffusion representation learning techniques do predicting from xt in the diffusion reverse process. Clas-
not propose making fundamental modifications to diffu- sifier guidance can be viewed as reconstructing informa-
sion model architectures and training methodologies. While tion lost in the diffusion forward process by shifting the
these techniques often show encouraging performance for posterior mean to fill that gap. They propose Pre-trained
downstream tasks, they fail to generate deep insights into DPM AutoEncoding (PDAE), a method for adapting DPMs
the architectural components and techniques required to to decoders for image reconstruction. Instead of using a
learn useful representations. It remains largely unclear for class label y to fill this information gap, PDAE employs a
example whether the representation learning abilities of model to predict mean shift according to encoded repre-
diffusion models are driven by the diffusion process, or sentations z, ensuring that z contains as much information
by the model’s denoising capabilities. It is also unclear as possible from x0 . Specifically, Zhang et al. [186] employ
what architectural and optimization choices can improve an encoder Ephi (x0 ) = z along with a gradient estimator
diffusion models’ representation learning capabilities. Gψ (xt , z, t) that simulates ∇xt log(p(z|xt ) to modify the
Chen et al. [35] investigate these questions by decon- conditional DPM training objective. This modified objective
structing a denoising diffusion model (DDM), modifying forces the predicted mean shift to fill the aforementioned
individual model components to turn a DDM into a De- posterior mean gap. With a trained Gψ (xt , z, t), the score
noising Autoencoder. The deconstruction process consists of the implicit classifier p(z|xt ) can be used analogously to
of three stages. In the first stage, the DDM is reoriented classifier-guided sampling. PDAE is evaluated using similar
for self-supervised learning. This entails the removal of experiments as used in [134] and exhibits improved training
class conditioning and a reconstruction of the VQGAN efficiency and performance.
tokenizer [47] used in the DiT baseline. Both the perceptual Pan et al. [130] propose a different method for DDM
and adversarial loss terms rely on annotated data and are reconstruction. They introduce a masked diffusion model
thus removed. This essentially converts the VQGAN to a (MDM), designed for self-supervised semantic segmenta-
VAE. The second stage consists of simplifying the VAE tion. MDM substitutes the conventional diffusion process
tokenizer even further, replacing it with different autoen- with a masking mechanism inspired by the masked autoen-
coder variants. Surprisingly, the authors find that using coder [61]. The representations learned by the pre-trained
simpler autoencoder variants, like patch-wise PCA, does MDM are extracted following Baranchuk et al. [15]. The
not degrade performance substantially. The authors con- proposed MDM is a variant of a time-dependent denoising
clude that the dimensionality per token of the latent space autoencoder, that takes a masked input image and subse-
has a much larger impact on probing accuracy than the quently reconstructs the uncorrupted image. While other
chosen autoencoder. The final deconstruction step includes DDMs and MAE use an MSE reconstruction loss, Pan et al.
converting the DDM to predict the denoised input instead [130] propose using the structural similarity index (SSIM)
of the added noise and removing input scaling, as well as loss. This is done to narrow the gap between reconstruction
changing the diffusion model to operate directly in the pixel and subsequent segmentation tasks. MDM is pre-trained
space. This final stage results in what the authors call the on a set of unlabeled images using the described self-
latent Denoising Autoencoder (l-DAE). They conclude that supervised approach. The learned representations are then
representation learning abilities are largely driven by the extracted to train an MLP-based classification head on a
denoising-driven process rather than the diffusion process. smaller labeled dataset. Features based on specific block
l-DAE is inspired by the observation that diffusion setting B are extracted by selecting the activation maps from
models resemble hierarchical autoencoders with varying each of the specified blocks, upsampling activation maps
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

to match the image size, and concatenating the activations. combined loss L consisting of a standard cross-entropy loss
The method achieves state-of-the-art results against existing to train p(y|x) and the simple denoising loss to train p(x).
supervised segmentation methods on multiple benchmark HybViT provides stable training and outperforms previous
datasets even when only 10% of labels are available. Diff- hybrid models on both generative and discriminative tasks,
MAE [166] is a similar approach that uses a conditional but lags behind generative-only models in generation qual-
generative objective, where the distribution of the masked ity. HybViT also requires more training iterations to achieve
pixels xm v
0 conditioned on the visible pixels x0 is modeled, high classification performance, and the sampling speed
and diffusion is only applied to masked regions. during inference is slow.
Hudson et al. [82] introduce a novel view generation Joint Diffusion Models (JDM) [40] is a related work that
learning goal as well as a bottleneck layer to aid repre- produces meaningful representations across generative and
sentation learning. They present SODA, a self-supervised discriminative tasks. Using a U-Net backbone, JDM consists
diffusion model that consists of an encoder and a denoising of an encoder eν , a decoder dψ , and a classifier gω . The
decoder. The encoder produces a concise latent represen- encoder maps an input xt to feature vectors Zt = eν (xt ).
tation, which is used for denoising decoder guidance by The decoder reconstructs these into a denoised sample
modulation of the decoder activations. The encoder E(x) xt−1 = dψ (Zt ), and the classifier predicts the target class
converts an input view x into a compressed latent rep- ŷ = gω (Zt ). The combined training objective includes cross-
resentation z, which is used to generate a novel output entropy loss Lclass and the noise prediction network’s sim-
view x′ relating to the input x. x′ is created through a plified objective Lt,diff (ν, ψ), resulting in the following loss:
diffusion process conditioned on the latent representation T
z via feature modulation. In addition to this, the authors L(ν, ψ, ω) = Lclass (ν, ω)−L0 (ν, ψ)−
X
Lt,diff (ν, ψ)−LT (ν, ψ).
use layer modulation, where the latent representation is
t=2
partitioned, with each partition zi modulating a specific
pair of layer activations. This enables further specialization JDM also enables a simplification of classifier guidance.
among the latent subvectors, where some are optimized By applying the classifier to noisy images xt , the classifier
to capture finer levels of granularity than others. During is effectively augmented to be robust to noise. To guide the
training, Hudson et al. [82] opt to randomly zero out a generated sample towards a target label, representations Zt
subset of the latent subvectors, effectively implementing are optimized according to the classifier gradient, giving
a layer-wise generalization of classifier-free guidance. This Z′t = Zt − α∇zt log gω (y|Zt ). JDM achieves state-of-the-
further increases control over the generative process since art performance for joint models on CIFAR and CelebA
the trained model can then be conditioned using a curated datasets, outperforming HybViT.
subset of latent subvectors. Tian et al. [158] propose the Alternating Denoising Diffu-
SmoothDiffusion [58] is a work focusing on improving sion Process (ADDP). ADDP alternately denoises pixels and
the smoothness of the latent space of diffusion models, VQ tokens. Given an image x0 , a pre-trained VQ Encoder
which refers to the consistency of perturbations in the latent [26] maps time image to VQ tokens z0 . The alternating
and the image space. SmoothDiffusion enforces smoothdiffusion process masks regions of z0 with a Markov chain
ness over its latent space by proposing a novel step-wise according to diffusion timestep t, producing zt . Unreliable
variation regularization method in training. The resulting tokens z̄t are generated by a token predictor and fed into a
smoothed latents benefit a wide range of image interpola- VQ Decoder to synthesize xt , replacing the masked regions
tion, image inversion and image editing tasks. of z0 . A pixel-to-token generation network is then trained
to approximate the distribution of z̄t−1 . During sampling,
3.1.5 Joint diffusion models ADDP starts with a representation of pure unreliable tokens
Many current diffusion-based representation learning meth- z̄T and iteratively denoises the token sequence by predicting
ods focus on using the diffusion model’s latent variables to z̄t−1 . For recognition, the representations learned by the
benefit the training of a separate recognition network. These pixel-to-token generation network can be forwarded to dif-
frameworks are conceptually equivalent to constructing hy- ferent task-specific recognition heads. ADDP with the VQ-
brid models that solely concentrate on synthesis in the pre- GAN tokenizer [47] MAGE-Large [99] token predictor and
training stage, and on downstream recognition in the post- ViT-Large [45] pixel-to-token encoder, outperforms previous
training/fine-tuning phase. The recognition head and the unified models in image classification, object detection, se-
diffusion denoising network do not share a parametrization, mantic segmentation, and unconditional generation.
and the recognition head is often trained separately while
keeping the weights of the denoising network frozen. A 3.1.6 Generative augmentation
natural question that arises is whether this separation is nec- A lot of state-of-the-art representation learning methods
essary and whether approaches that optimize a generative [33, 55, 60] rely on a fixed set of data augmentations to define
and a discriminative objective simultaneously in a shared positive labels for learning representations. This approach
parametrization can improve representation learning. encourages encoders to learn to map the original and the
HybViT [174] is an approach that establishes a direct con- augmented image to similar embedding space representa-
nection between diffusion models and vision transformers tions [10]. These augmentations should not alter the seman-
by training a single hybrid model for both image classifica- tics of the image, and they should not render the image
tion and image generation. This hybrid model uses a shared unrealistic in a real-world setting. A set of standard trans-
parametrization for image classification and reconstruction. formations might not adequately capture the distribution
The authors use a ViT backbone to train a model with a of real-world data, raising the question of how to design
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

transformations that create diverse images and improve the downstream segmentation performance. Individual token
generalization of learned representations. attention maps of all layers are averaged and converted to
Ayromlou et al. [10] propose using latent diffusion mod- binary masks using an adaptive threshold mechanism based
els [138] to generate novel views of the original image that on an AffinityNet [4]. Additionally, a noise-learning module
preserve the semantic content, while closely following the prunes low-quality segmentation masks, and the authors
distribution of real images. This augmentation method is employ several prompt engineering and static image trans-
denoted by: formations to further enhance the diversity of the generated
( images and corresponding segmentation masks.
G(z; ϕ(x)) if p ≤ p0
T0 (x) = (18)
x otherwise, 3.2 Representation Learning for Diffusion Model Guid-
ance
where G denotes a conditional generative model taking
noise vector z ∼ N (0, I) and condition vector ϕ(x) as Despite the remarkable performance of generative mod-
inputs. ϕ is a pre-trained image encoder such as CLIP els, there exists a gap in quality between conditional and
[136], p ∈ [0, 1] is a random number and p0 is a hyper- unconditional image generation approaches [25]. This is
parameter specifying the probability of applying the aug- especially the case for GANs [53], which suffer from mode
mentation. Ayromlou et al. [10] show that using generative collapse when trained in a fully unsupervised setting [110].
augmentation leads to consistent improvements in learned Unconditional GANs often fail to accurately model multi-
representations over standard transformations across other modal distributions, e.g. not being able to generate all digits
representation learning techniques. for MNIST [110]. Class-conditional GANs [22] [123] miti-
Shipard et al. [150] take this approach one step further, gate this issue, but require labeled data. Recent approaches
using Stable Diffusion to generate a fully synthetic dataset to like self-conditioned GANs [110] and instance-conditioned
improve model-agnostic zero-shot classification (MA-ZSC). GANs [25] attempt to train conditional GANs without re-
They use Stable Diffusion, employing several variations of quiring labeled data, and are able to achieve competitive
prompts designed to increase the diversity of the synthetic generation results.
dataset. An image classifier is subsequently trained on this Diffusion models have since surpassed the image gen-
synthetic dataset, and zero-shot classification results on CI- eration capabilities of GANs [42], but suffer from a simi-
FAR10, CIFAR100, and EuroSAT [64] are evaluated. Shipard lar performance discrepancy between conditional and fully
et al. [150] observe substantial classification architecture- self-supervised approaches. Current state-of-the-art diffu-
agnostic improvements on the aforementioned datasets, sion models are conditional models that rely on guidance
achieving comparable performance to state-of-the-art zero- approaches that also require annotated data. Self-supervised
shot classification methods like CLIP. guidance approaches can leverage much larger unlabeled
Moving beyond classification, Schnell et al. [148] apply datasets for pre-training, and thus have the potential to tran-
similar ideas to scribble-supervised segmentation [104, 129], scend current image generation approaches. One intuitive
a weakly-supervised form of semantic segmentation that approach for leveraging representation learning to facilitate
uses sparse annotations in the form of scribbles drawn these guidance methods is to explore methods that assign
over the images. They introduce ScribbleGen, a diffusion labels to unlabeled data, e.g. through clustering and clas-
model conditioned on semantic scribbles that generates syn- sification approaches. We introduce several approaches in
thetic training images for data augmentation. ScribbleGen the following section. Fig. 5 shows a proposed taxonomy of
utilizes a ControlNet [185] denoising diffusion model for representation learning techniques for diffusion guidance.
noise prediction given xt and conditioning signal c. The 3.2.1 Assignment-based guidance
number of classes is denoted by different color scribbles in
Sheynin et al. [149] propose kNN-Diffusion, an efficient
RGB images, and the conditioning signal c is supplemented
text-to-image diffusion model trained without large-scale
by a text prompt stating all classes in the image. Schnell
image text pairings. To facilitate text-guided image gener-
et al. [148] trade-off photorealism and image diversity by
ation without paired text-image data, a shared text-image
introducing an encode ratio λ ∈ [0, 1]. This diffusion param-
encoder mapping text-image pairs into the same latent space
eter controls the number of noise-adding forward diffusion
is required. The authors use CLIP to achieve this, a pre-
steps, where λ = 1 leads to no change but λ < 1 leads
trained encoder trained using contrastive loss on a large-
to λ · T steps, meaning less noise is added to the input
scale text-image pair dataset. kNN-Diffusion leverages k -
image. The authors evaluate both a fixed and an adaptive
Nearest-Neighbors search to generate k embeddings from a
λ, where the encoding ratio is gradually increased to pro-
retrieval model. The retrieval model uses the input image
vide increasingly diverse synthetic images during training.
representation during training, and the text prompt rep-
ScribbleGen achieves state-of-the-art performance on the
resentation curing inference. This approach eliminates the
PASCAL VOC12 segmentation dataset [48] using scribbles
need for annotated data but still requires a pre-trained en-
from Scribblesup [104].
coder like CLIP, which in turn requires a large-scale dataset
DiffuMask [167] is another generative augmentation
of text-image embeddings for pre-training.
method designed to improve downstream semantic seg-
Blattmann et al. [20] propose retrieval-augmented dif-
mentation tasks. The idea here is to exploit cross-attention
fusion models (RDM), which equip diffusion models with
maps between text prompts and generated images to extend
an image database for composing new scenes based
image synthesis to semantic mask generation. Synthetically
on retrieved images. Inspired by advances in retrieval-
generated masks are used for data augmentation to improve
augmented NLP [21, 168], RDM enhances performance with
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Fig. 5. A hierarchical overview of current diffusion model training frameworks that leverage representation learning techniques for conditional
generation and guidance.

fewer parameters and computational resources. Despite be- trained feature extractor to generate feature representations
ing trained only on images, RDM allows conditional synthe- for clustering.
sis due to the shared image-text feature space of CLIP [136]. For this reason, Hu et al. [73] extend their work to
RDM includes a trainable conditional latent diffusion model propose an online feature clustering method using the
pθ , an external image database D, and a fixed sampling Sinkhorn-Knopp algorithm. This is challenging since the
(k)
strategy ξk that selects a subset MD of D based on a idea requires obtaining conditioning signals for clustering
query x. One strategy ξk (x, D) is to retrieve the k nearest during training from a diffusion model that is dependent on
neighbors using a distance function d(·, x). The retrieved this conditioning. This issue is solved by introducing a zero
data is processed through a frozen image encoder ϕ and vector into the conditional diffusion model for the signals
used to condition pθ . During training, ξk retrieves k nearest used to identify the clustering. For each image example, the
neighbors for a query image x using cosine similarity in conditional diffusion model conditioned on this zero vector
CLIP’s image feature space as the distance function d(x, y). undergoes a fully-connected feature prediction head used to
This approach ensures that retrieved image representations compute features that are mapped to a set of learnable pro-
are useful for generation tasks and allows for text condi- totypes denoted M . This method uses a combination of the
tioning due to CLIP’s shared feature space. The dataset diffusion training loss and a Sinkhorn-Knopp loss to achieve
D and retrieval strategy ξk can be changed at test time, guidance signals c that are based on clustering features us-
adding flexibility for different conditioning modalities and ing M . The promise of this method is high, with self-guided
adaptability to other data distributions. diffusion outperforming related unconditional generation
Hu et al. [75] propose a method also motivated by elim- baseline comparisons on ImageNet256 and LSUN-Churches
inating the need for annotated data. Self-guided diffusion while being competitive with class guidance methods that
is a framework encompassing a feature extraction function rely on ground truth labels. The online approach specifically
gϕ and a self-annotation function fψ . The feature extraction does not rely on ground truth labels or any external pre-
function is a self-supervised feature extractor that maps the trained models. Adaloglou et al. [2] build on the aforemen-
input data x ∈ D to a feature space H, where D denotes tioned cluster-based guidance approaches by utilizing EDM
the dataset. This feature representation is an input of fψ , [85], TEMI clustering [1] and a method for deriving an upper
which maps feature representation gϕ (x; D) to a guidance cluster bound for feature-based clustering.
signal k . This framework can be applied to achieve self- Other approaches to diffusion model guidance rely on
labeled guidance, where k is a one-hot embedding derived generating pseudo-labels for unlabeled data. You et al. [176]
using k -means clustering as the self-annotation function f propose dual pseudo training (DPT), which uses a classifier
on compacted features generated by gϕ . More fine-grained trained on limited labeled data to generate pseudo-labels.
spatial guidance is achieved by self-boxed guidance, which These are then used to condition a diffusion model to
uses a mapping from feature space H to a bounding box generate pseudo images, which are in turn used as data
as the self-annotation function f , as well as self-segmented augmentation to retrain a classifier on a mix of pseudo
guidance, which uses a mapping to a segmentation mask and real images. DPT involves three stages. First, a semi-
to generate guidance signals by clustering. Self-guidance supervised classifier is trained on partially labeled data to
significantly outperforms unconditional diffusion models, predict pseudo-labels ŷ for all images x ∈ X . Second,
and even outperforms classifier-free guided diffusion mod- a conditional generative model is trained on the dataset
els that use ground-truth annotations on image generation. S1 = {(x, y)|x ∈ X } with pseudo-labels. Finally, the classi-
This suggests that the clusters are potentially more aligned fier is retrained on real data that is augmented by the gener-
with the visual similarity of the images, and are better ated data. DPT achieves highly competitive performance on
guidance signals than ground-truth labels alone. While this ImageNet classification and generation with as little as five
approach is self-supervised, it still relies on an external pre- labels per class, outperforming several supervised diffusion
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

model benchmarks like ADM [42] and LDM [138]. models on a self-supervised representation distribution
mapped from the image distribution using a pre-trained en-
3.2.2 A generalized framework for assignment-based guid- coder. The idea is to train a Representation Diffusion Model
ance (RDM) on the representations generated by a pre-trained
Assignment-based guidance approaches all rely on assign- encoder to generate low-dimensional image representations.
ing annotation to inputs during training, which enables After this, a pixel generator conditioned on the representa-
controlled generation during inference when conditioning tion is trained to map noise distributions to image distri-
on this annotation. We therefore propose to formulate a butions. RCG consists of three main components. The first
generalized framework that encapsulates all assignment- is a pre-trained image encoder, which converts the original
based guidance approaches discussed here. This framework image distribution into a representation distribution. The
consists of three main components. The first is a self- authors propose using self-supervised contrastive learning
supervised image encoder E(x), that maps inputs to a low- methods (e.g. MoCo v3) for generating this representation
dimensional feature representation z. Using a multi-modal distribution. The second is a representation generator in the
feature extractor like CLIP has the advantage of enabling form of an RDM, which learns to generate representations
text-based as well as image-based conditioning, but other from Gaussian noise following the DDIM [152] sampling
feature extractors can be used, provided they generate se- process. The final component is a pixel generator that crafts
mantically meaningful image representations. image pixels conditioned on image representations. RCG
The second is a self-annotation function f (z), which uses can easily incorporate classifier-free guidance for uncondi-
the image representation to produce annotation c for input tional generation tasks, since the pixel generator is condi-
image x. In the simplest case, this self-annotation function tioned on self-supervised representations. RCG emerges as
is an external pre-trained image classifier that generates a highly promising method for bridging the gap between
pseudo-class labels from image representations, similar to conditional and unconditional image generation, outper-
the approach employed in DPT [176], where the external forming pre-existing unconditional generation approaches
classifier is subsequently re-trained on the conditionally on ImageNet, and exhibiting competitive performance with
generated images. In other cases, the self-annotation func- current state-of-the-art class-conditional approaches.
tion is a retrieval model, which uses a distance function d Readout Guidance (RG) [117] makes use of auxiliary
to retrieve images similar to the training image, and uses readout heads trained on top of a frozen diffusion model to
representations of the retrieved images for generating the extract properties of the generated image that can be used
guidance signal c. for guidance. These properties can include human pose,
The final component is a denoising network Dθ (xt , c, t), depth maps, edges, and even higher-order properties like
which takes the noisy image xt , the diffusion timestep t and similarity to another image. During sampling, the properties
the guidance signal c as input, and denoises the image. Dur- extracted by the readout heads can be compared to user-
ing inference, controlled generation is enabled by passing an defined control targets, and used in a methodology similar
initial guidance signal k (which can be multi-modal as long to classifier guidance [43] to guide generation.
as the embedding space of the encoder E is shared between Lin and Yang [105] identified a novel self-perceptual
modalities) through the encoder to generate representation objective that enhances diffusion models, enabling them to
z = E(k). The conditioning signal c is then generated by generate more realistic samples. Contrary to the conven-
passing z to the self-annotation function f where c = f (z). tional approach of training or employing an image encoder,
Passing xt , c and t to the denoising network Dθ now enables the authors demonstrate that a pre-trained diffusion model
synthesis of novel images semantically similar to the initial inherently functions as a perceptual network and can be
guidance signal k. used to generate perceptual representations. The perceptual
One of the main motivations behind the design of loss facilitates the model’s ability to generate more realistic
assignment-based guidance methods is the reliance on exist- images even with unconditional synthesis.
ing methods on labeled data. While it could be argued that Also inspired by the downsides of classifier guidance
the aforementioned assignment-based guidance approaches and classifier-free guidance, Hong et al. [69] introduce Self-
are indirectly reliant on annotated data through the pre- Attention Guidance (SAG). SAG adversarially blurs regions
trained image encoder, it is important to note that this that contain salient information by leveraging intermediate
encoder can be replaced with a fully self-supervised encoder self-attention activation maps, using the residual informa-
as well. CLIP relies on the availability of a large-scale dataset tion as guidance. This increases the generation quality with-
of image-caption pairs and is thus not fully self-supervised, out requiring external information or additional training.
but other representation learning methods are also able to The self-attention mechanism, contained in both U-Net and
generate semantic representations. CLIP is used in many DiT diffusion backbones, allows the noise predictor to at-
approaches to facilitate both text prompt-based and image tend to the most informative features of the input. The self-
N ×(HW )×(HW )
conditioning during inference, which may no longer be pos- attention maps AS t ∈ R are aggregated and
sible when using primarily image-based feature extractors. reshaped to dimension RH×W using global average pooling
A summary of the training and inference methodology can and nearest-neighbor upsampling to match the resolution
be found in Fig. 6. of xt . The difference between the blurred image x̃t and xt
is used as conditioning, thereby retaining the information
3.2.3 Representation-based guidance masked in this process.
Li et al. [100] present Representation-Conditioned Image
Generation (RCG), a framework conditioning diffusion
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

Fig. 6. A generalization of assignment-based guidance training and sampling pipelines. Samples are conditioned on annotations generated by a
self-annotation function f , using features extracted by a pre-trained image encoder (e.g., CLIP [136]).

3.2.4 Objective-based guidance between weak and strong depth predictors, guiding the
Many of the previous outlined approaches focus on elim- generation process using the gradient of Ldc with respect to
inating the need for pre-trained classifiers, encoders and xt in a methodology similar to [42]. Depth prior guidance
dataset annotations for training conditional diffusion mod- employs an additional small-resolution diffusion U-Net on
els. Other recent works [46, 86] have demonstrated that the depth domain, adding noise to depth predictions and
internal diffusion model representations can be used to using a denoising objective Ldp . The gradient of Ldp is
improve generation control over the structural and semantic treated like an external classifier gradient and added to
composition of generated images. the image generation objective. Combining both methods
One such approach is Self-guidance for Controllable Im- during training results in enhanced depth semantics in
age Generation [46] (which we denote SGCIG to distinguish generated images.
it from [75]). SGCIG is a zero-shot method designed to Perturbed Attention Guidance (PAG) [3] is a sampling
increase user control over structural and semantic elements guidance method that improves generation quality for both
of objects in images generated by text-to-image diffusion conditional and unconditional settings. PAG does not re-
models. Incorporating similar ideas as [65], the authors of quire additional training or external pre-trained models.
SGCIG leverage representations from intermediate activa- Instead, Ahn et al. [3] introduce an implicit discriminator
tions and attention maps to steer the generation process. SG- D that differentiates between desirable and undesirable
CIG works by adding a series of guidance terms to the ob- samples during the diffusion process, where y is a desirable
jective of the denoising network that each define a series of and ŷ is an undesirable sample. The diffusion sampling
properties that can be used to perform image manipulations. process is then redefined to incorporate the derivative of
Image edits can then be carried out by guiding properties to the discriminator loss LD . The score with undesirable label
change in the pixel generation process. While the method ŷ cannot be approximated using the existing denoising net-
is limited to the manipulation of objects explicitly stated work ϵθ (xt ). Thus the score is estimated by perturbing the
in the conditioning text prompt, it represents a promising forward pass of a pre-trained denoising network, denoted
first step towards increased control over generated images. by ϵ̂θ . PAG works by perturbing the self-attention maps in
Diffusion Handles [131] extend this to 3D object editing, the diffusion U-Net, replacing them with an identity matrix
using manipulated diffusion model activations to produce to guide the sampling process away from degraded samples.
plausible edits. The final noise prediction is obtained by feeding xt into
Depth-aware guidance (DAG) [86] is a related method both ϵθ (·) and ϵ̂θ (·) to get the final noise prediction ϵ̃θ .
that uses semantic information from intermediate denoising PAG improves generation quality in both conditional and
network layers for improved depth-aware image synthe- unconditional settings, and can be combined with existing
sis. Kim et al. [86] propose training depth predictors with guidance methods like classifier guidance.
limited depth-labeled data using internal U-Net backbone
representations, similar to DDPM-Seg [15]. The used depth 4 C HALLENGES & F UTURE D IRECTIONS
predictors are pixel-wise shallow MLP regressors estimat- 4.1 General Challenges
ing depth values from intermediate U-Net features ft at
timestep t. Features are concatenated across layers to form Diffusion model-based representation learning is a novel
gt , with depth maps dt = MLP(gt , t) generated using research field with a lot of potential for theoretical and prac-
an appended time-embedding block. This depth predictor tical improvements. Improving synergies between represen-
is trained using a limited depth-labeled dataset. To now tation learning and generative models is akin to a chicken-
guide the diffusion process toward depth-aware generation, and-egg problem, where better diffusion models simulta-
two guidance strategies are introduced: Depth consistency neously lead to higher quality image representations, and
guidance uses pseudo-labels with a consistency loss Ldc better representation learning methods improve generative
quality of diffusion models when applied to self-supervised
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

guidance methods. Improved online-bootstrapping meth- ACKNOWLEDGMENTS

ods that provide guidance to diffusion models during train- We would like to thank Yuki Asano, Stefan Andreas Bau-
ing can be beneficial here. mann, Timy Phan, and Frank Fundel for providing addi-
To conserve computation in diffusion models [114], the tional related literature.
sampling process has been significantly reduced to just a
few steps [143, 145] or even a single step [63, 118, 155].
However, maintaining the potential of representation learn- B IBLIOGRAPHY
ing with few sampling steps presents a challenge. [1] N. Adaloglou, F. Michels, H. Kalisch, and M. Kollmann,
“Exploring the limits of deep image clustering using
pretrained models,” in BMVC. BMVA, 2023.
4.2 Potential Future Directions [2] N. Adaloglou, T. Kaiser, F. Michels, and M. Koll-
In many works discussed, the quality of representations mann, “Rethinking cluster-conditioned diffusion mod-
els,” arXiv, 2024.
learned by diffusion models is evaluated indirectly using
[3] D. Ahn, H. Cho, J. Min, W. Jang, J. Kim, S. Kim, H. H.
task-specific metrics from auxiliary models. Interpretability Park, K. H. Jin, and S. Kim, “Self-Rectifying Diffusion
and disentanglement are other important ways to evaluate Sampling with Perturbed-Attention Guidance,” arXiv,
representation efficacy and are currently underexplored. 2024.
Methods enhancing the interpretability of the latent space [4] J. Ahn and S. Kwak, “Learning pixel-level semantic affin-
can improve generation control and benefit a wide range ity with image-level supervision for weakly supervised
of recognition tasks. We look towards methods on inter- semantic segmentation,” in CVPR, 2018, pp. 4981–4990.
[5] M. S. Albergo and E. Vanden-Eijnden, “Building normal-
pretable direction discovery as have been proposed for izing flows with stochastic interpolants,” in ICLR, 2023.
GANs in [163] for inspiration, and see similar approaches [6] N. Anand and T. Achim, “Protein Structure and Sequence
for diffusion models as promising. While there are some re- Generation with Equivariant Denoising Diffusion Proba-
cent works focusing on disentangled and interpretable rep- bilistic Models,” arXiv, 2022.
resentation learning in diffusion models (e.g., [28, 93, 180]), [7] B. D. O. Anderson, “Reverse-time diffusion equation
we feel that this area remains underserved. models,” Stochastic Processes and their Applications, vol. 12,
no. 3, pp. 313–326, 1982.
Current diffusion-based representation learning frame-
[8] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-
works use U-Net and DiT backbones, which were primarily labelling via simultaneous clustering and representation
designed for generative tasks. Developing novel architec- learning,” arXiv, 2019.
tures tailored for representation learning is a promising [9] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den
area of research. Current transformer-based backbones are Berg, “Structured Denoising Diffusion Models in Discrete
popular due to their scalability and performance, but their State-Spaces,” in NeurIPS, vol. 34, 2021.
inability for parallel inference and the quadratic complexity [10] S. Ayromlou, A. Afkanpour, V. R. Khazaie, and
F. Forghani, “Can Generative Models Improve Self-
of the attention mechanism are significant downsides, lim- Supervised Representation Learning?” arXiv, 2024.
iting their use for high-resolution images and long videos [11] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normaliza-
[76]. Techniques like windowing [112], sliding [16], and ring tion,” arXiv, 2016.
attention [108] help mitigate these issues, but complexity [12] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu, “All
limitations remain. Recent works [49, 76, 171] have begun to are worth words: A vit backbone for diffusion models,”
utilize state-space diffusion models [56, 127], which offer in CVPR, 2023, pp. 22 669–22 679.
[13] F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y. Wang, G. Yue, Y. Cao,
linear complexity with respect to token sequence length,
H. Su, and J. Zhu, “One transformer fits all distributions
and are thus well suited to long token sequence mod- in multi-modal diffusion at scale,” in ICML. PMLR, 2023,
eling for both text [121] and images/video [29, 97]. The pp. 1692–1717.
representation-learning capabilities of these models are yet [14] O. Bar-Tal, L. Yariv, Y. Lipman, and T. Dekel, “Multi-
to be fully analyzed, but we expect that conclusions drawn Diffusion: Fusing diffusion paths for controlled image
from diffusion models can also be applied to state-space generation,” in ICML, vol. 202. PMLR, 2023, pp. 1737–
models and their representation learning capabilities. 1752.
[15] D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, and
We also see significant room for further research in using A. Babenko, “Label-efficient semantic segmentation with
other generative models for representation learning. Flow diffusion models,” in ICLR, 2022.
Matching models [5, 107, 111] have recently gained promi- [16] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The
nence for their ability to maintain straight trajectories during Long-Document Transformer,” arXiv, 2020.
generation. This characteristic results in faster inference, [17] H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and
making Flow Matching a suitable alternative for addressing Y. Lipman, “D-Flow: Differentiating through Flows for
Controlled Generation,” arXiv, 2024.
trajectory issues encountered in diffusion models. Their ver-
[18] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “General-
satility has been demonstrated across various applications, ized denoising auto-encoders as generative models,” in
including image [38, 78], video [39], depth [57], human NeurIPS, vol. 26, 2013.
motion [74], audio [94], boosting diffusion models [147, 156? [19] Y. Benny and L. Wolf, “Dynamic dual-output diffusion
], and even text generation [77]. The close relationship models,” in CVPR, 2022, pp. 11 482–11 491.
between Diffusion and Flow Matching models suggests that [20] A. Blattmann, R. Rombach, K. Oktay, and B. Ommer,
many of the diffusion representation learning frameworks “Retrieval-augmented diffusion models,” arXiv, 2022.
[21] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Ruther-
can also be applied to Flow Matching models. ford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau,
B. Damoc, A. Clark et al., “Improving language models
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

by retrieving from trillions of tokens,” in ICLR. PMLR, [43] ——, “Diffusion Models Beat GANs on Image Synthesis,”
2022, pp. 2206–2240. in NeurIPS, vol. 34, 2021.
[22] A. Brock, J. Donahue, and K. Simonyan, “Large scale [44] J. Donahue and K. Simonyan, “Large scale adversarial
gan training for high fidelity natural image synthesis,” representation learning,” in NeurIPS, 2019.
in ICLR, 2019. [45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
[23] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
Learning to follow image editing instructions,” in CVPR, G. Heigold, S. Gelly et al., “An image is worth 16x16
2023, pp. 18 392–18 402. words: Transformers for image recognition at scale,”
[24] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, ICLR, 2021.
P. Bojanowski, and A. Joulin, “Emerging properties in [46] D. Epstein, A. Jabri, B. Poole, A. Efros, and A. Holynski,
self-supervised vision transformers,” in ICCV, 2021, pp. “Diffusion self-guidance for controllable image genera-
9650–9660. tion,” in NeurIPS, vol. 36, 2023, pp. 16 222–16 239.
[25] A. Casanova, M. Careil, J. Verbeek, M. Drozdzal, [47] P. Esser, R. Rombach, and B. Ommer, “Taming transform-
and A. Romero Soriano, “Instance-conditioned gan,” in ers for high-resolution image synthesis,” in CVPR, 2021,
NeurIPS, 2021. pp. 12 873–12 883.
[26] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Free- [48] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
man, “Maskgit: Masked generative image transformer,” and A. Zisserman, “The pascal visual object classes (voc)
in CVPR, 2022. challenge,” International journal of computer vision, vol. 88,
[27] Z. Chang, G. A. Koulieris, and H. P. H. Shum, “On the pp. 303–338, 2010.
Design Fundamentals of Diffusion Models: A Survey,” [49] Z. Fei, M. Fan, C. Yu, and J. Huang, “Scalable Diffusion
arXiv, 2023. Models with State Space Backbone,” arXiv, 2024.
[28] H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, [50] J. Schusterbauer, M. Gui, P. Ma, N. Stracke, S. A. Bau-
M. Irani, I. Mosseri, and L. Wolf, “The Hidden Language mann, and B. Ommer, “Boosting latent diffusion with
of Diffusion Models,” in ICLR, 2024. flow matching,” arXiv, 2023.
[29] G. Chen, Y. Huang, J. Xu, B. Pei, Z. Chen, Z. Li, J. Wang, [51] Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang,
K. Li, T. Lu, and L. Wang, “Video Mamba Suite: State J. Bao, Z. Zhang, H. Li, H. Hu et al., “Instructdiffusion: A
Space Model as a Versatile Alternative for Video Under- generalist modeling interface for vision tasks,” in CVPR,
standing,” arXiv, 2024. 2024, pp. 12 709–12 720.
[30] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [52] M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Token-
A. L. Yuille, “Semantic Image Segmentation with Deep flow: Consistent diffusion features for consistent video
Convolutional Nets and Fully Connected CRFs,” arXiv, editing,” in ICLR, 2024.
2016. [53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
[31] ——, “Deeplab: Semantic image segmentation with deep D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
convolutional nets, atrous convolution, and fully con- “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
nected crfs,” IEEE Transactions on Pattern Analysis & Ma- [54] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffu-
chine Intelligence, vol. 40, no. 4, pp. 834–848, 2017. sion models as plug-and-play priors,” in NeurIPS, 2022.
[32] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and [55] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond,
W. Chan, “WaveGrad: Estimating Gradients for Wave- E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo,
form Generation,” in ICLR, 2021. M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos,
[33] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A and M. Valko, “Bootstrap Your Own Latent - A New Ap-
simple framework for contrastive learning of visual rep- proach to Self-Supervised Learning,” in NeurIPS, vol. 33,
resentations,” in ICML. PMLR, 2020, pp. 1597–1607. 2020, pp. 21 271–21 284.
[34] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. [56] A. Gu and T. Dao, “Mamba: Linear-Time Sequence Mod-
Hinton, “Big self-supervised models are strong semi- eling with Selective State Spaces,” arXiv, 2024.
supervised learners,” in NeurIPS, 2020. [57] M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko,
[35] X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing De- O. Grebenkova, S. A. Baumann, V. T. Hu, and B. Ommer,
noising Diffusion Models for Self-Supervised Learning,” “Depthfm: Fast monocular depth estimation with flow
arXiv, 2024. matching,” arXiv, 2024.
[36] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, [58] J. Guo, X. Xu, Y. Pu, Z. Ni, C. Wang, M. Vasu, S. Song,
G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, G. Huang, and H. Shi, “Smooth diffusion: Crafting
and J. Jitsev, “Reproducible scaling laws for contrastive smooth latent spaces in diffusion models,” in CVPR, 2024.
language-image learning,” in CVPR, 2023, pp. 2818–2829. [59] B. Ham, M. Cho, C. Schmid, and J. Ponce, “Proposal flow:
[37] F. Croitoru, V. Hondru, R. Ionescu, and M. Shah, “Dif- Semantic correspondences from object proposals,” IEEE
fusion models in vision: A survey,” IEEE Transactions on Transactions on Pattern Analysis & Machine Intelligence,
Pattern Analysis & Machine Intelligence, vol. 45, no. 09, pp. vol. 40, no. 7, pp. 1711–1725, 2017.
10 850–10 869, 2023. [60] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Mo-
[38] Q. Dao, H. Phung, B. Nguyen, and A. Tran, “Flow match- mentum contrast for unsupervised visual representation
ing in latent space,” arXiv, 2023. learning,” in CVPR, 2020.
[39] A. Davtyan, S. Sameni, and P. Favaro, “Efficient video [61] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick,
prediction via sparsely conditioned flow matching,” in “Masked autoencoders are scalable vision learners,” in
ICCV, 2023, pp. 23 263–23 274. CVPR, 2022.
[40] K. Deja, T. Trzciński, and J. M. Tomczak, “Learning [62] E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar,
data representations with joint diffusion models,” in Joint A. Tagliasacchi, and K. M. Yi, “Unsupervised semantic
European Conference on Machine Learning and Knowledge correspondence using stable diffusion,” in NeurIPS, 2023.
Discovery in Databases. Springer, 2023, pp. 543–559. [63] J. Heek, E. Hoogeboom, and T. Salimans, “Multistep
[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, consistency models,” arXiv, 2024.
“Imagenet: A large-scale hierarchical image database,” in [64] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat:
CVPR, 2009. A novel dataset and deep learning benchmark for land
[42] P. Dhariwal and A. Nichol, “Diffusion models beat gans use and land cover classification,” IEEE Journal of Selected
on image synthesis,” in NeurIPS, 2021. Topics in Applied Earth Observations and Remote Sensing,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

vol. 12, no. 7, pp. 2217–2226, 2019. “Depth-aware guidance with self-estimated depth repre-
[65] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, sentations of diffusion models,” Pattern Recognition, vol.
Y. Pritch, and D. Cohen-Or, “Prompt-to-Prompt Image 153, p. 110474, 2024.
Editing with Cross Attention Control,” arXiv, 2022. [87] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational
[66] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowl- Diffusion Models,” in NeurIPS, vol. 34, 2021.
edge in a Neural Network,” arXiv, 2015. [88] D. P. Kingma and M. Welling, “Auto-encoding variational
[67] J. Ho and T. Salimans, “Classifier-free diffusion guid- bayes,” in ICLR, 2014.
ance,” in NeurIPS Workshop, 2021. [89] ——, “An Introduction to Variational Autoencoders,”
[68] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- Foundations and Trends® in Machine Learning, vol. 12, no. 4,
bilistic models,” in NeurIPS, 2020. pp. 307–392, 2019.
[69] S. Hong, G. Lee, W. Jang, and S. Kim, “Improving Sample [90] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic
Quality of Diffusion Models Using Self-Attention Guid- feature pyramid networks,” in CVPR, 2019, pp. 6399–
ance,” in ICCV, 2023. 6408.
[70] E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and [91] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár,
M. Welling, “Argmax flows and multinomial diffusion: “Panoptic segmentation,” in CVPR, 2019, pp. 9404–9413.
Learning categorical distributions,” in NeurIPS, 2021. [92] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro,
[71] E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling, “DiffWave: A Versatile Diffusion Model for Audio Syn-
“Equivariant Diffusion for Molecule Generation in 3D,” thesis,” arXiv, 2021.
in Proceedings of the 39th International Conference on Ma- [93] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already
chine Learning. PMLR, 2022, pp. 8867–8887. have a semantic latent space,” in ICLR, 2023.
[72] E. Hoogeboom, J. Heek, and T. Salimans, “simple diffu- [94] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz,
sion: End-to-end diffusion for high resolution images,” in M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar et al.,
Proceedings of the 40th International Conference on Machine “Voicebox: Text-guided multilingual universal speech
Learning. PMLR, 2023, pp. 13 213–13 232. generation at scale,” arXiv, 2023.
[73] V. T. Hu, Y. Chen, M. Caron, Y. M. Asano, C. G. M. Snoek, [95] A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and
and B. Ommer, “Guided Diffusion from Self-Supervised D. Pathak, “Your diffusion model is secretly a zero-shot
Diffusion Features,” arXiv, 2023. classifier,” in ICCV, 2023.
[74] V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, [96] D. Li, H. Ling, A. Kar, D. Acuna, S. W. Kim, K. Kreis,
E. Gavves, P. Mettes, B. Ommer, and C. G. M. Snoek, A. Torralba, and S. Fidler, “Dreamteacher: Pretraining
“Motion flow matching for human motion synthesis and image backbones with deep generative models,” in ICCV,
editing,” arXiv, 2023. 2023, pp. 16 698–16 708.
[75] V. T. Hu, D. W. Zhang, Y. M. Asano, G. J. Burghouts, and [97] K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, and
C. G. Snoek, “Self-guided diffusion models,” in CVPR, Y. Qiao, “VideoMamba: State Space Model for Efficient
2023, pp. 18 413–18 422. Video Understanding,” arXiv, 2024.
[76] V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, [98] S. Li, C. Chen, and H. Lu, “MoEController: Instruction-
J. Schusterbauer, and B. Ommer, “ZigMa: A DiT-style based Arbitrary Image Manipulation with Mixture-of-
Zigzag Mamba Diffusion Model,” arXiv, 2024. Expert Controllers,” arXiv, 2024.
[77] V. T. Hu, D. Wu, Y. M. Asano, P. Mettes, B. Fernando, [99] T. Li, H. Chang, S. Mishra, H. Zhang, D. Katabi, and
B. Ommer, and C. G. M. Snoek, “Flow matching for D. Krishnan, “Mage: Masked generative encoder to unify
conditional text generation in a few sampling steps,” in representation learning and image synthesis,” in CVPR,
EACL, 2024. 2023, pp. 2142–2152.
[78] V. T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, and [100] T. Li, D. Katabi, and K. He, “Return of Unconditional
C. Snoek, “Latent space editing in transformer-based Generation: A Self-supervised Representation Generation
flow matching,” in Proceedings of the AAAI Conference on Method,” arXiv, 2024.
Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2247–2255. [101] X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B.
[79] C.-W. Huang, J. H. Lim, and A. Courville, “A variational Hashimoto, “Diffusion-LM Improves Controllable Text
perspective on diffusion-based generative models and Generation,” arXiv, 2022.
score matching,” in NeurIPS, 2021. [102] X. Li, K. Han, X. Wan, and V. A. Prisacariu, “SimSC: A
[80] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Simple Framework for Semantic Correspondence with
Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text- Temperature Learning,” arXiv, 2023.
To-Audio Generation with Prompt-Enhanced Diffusion [103] X. Li, J. Lu, K. Han, and V. A. Prisacariu, “Sd4match:
Models,” in Proceedings of the 40th International Conference Learning to prompt stable diffusion model for semantic
on Machine Learning. PMLR, 2023, pp. 13 916–13 932. matching,” in CVPR, 2024, pp. 27 558–27 568.
[81] Y. Huang, J. Huang, Y. Liu, M. Yan, J. Lv, J. Liu, W. Xiong, [104] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup:
H. Zhang, S. Chen, and L. Cao, “Diffusion Model-Based Scribble-supervised convolutional networks for semantic
Image Editing: A Survey,” arXiv, 2024. segmentation,” in CVPR, 2016, pp. 3159–3167.
[82] D. A. Hudson, D. Zoran, M. Malinowski, A. K. Lampinen, [105] S. Lin and X. Yang, “Diffusion Model with Perceptual
A. Jaegle, J. L. McClelland, L. Matthey, F. Hill, and A. Ler- Loss,” arXiv, 2024.
chner, “Soda: Bottleneck diffusion models for representa- [106] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
tion learning,” in CVPR, 2024. D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
[83] K. Itô, “Stochastic differential equations in a differentiable Common objects in context,” in ECCV, 2014.
manifold,” Nagoya Mathematical Journal, vol. 1, pp. 35–47, [107] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel,
1950. and M. Le, “Flow matching for generative modeling,” in
[84] T. Karras, S. Laine, and T. Aila, “A style-based genera- ICLR, 2023.
tor architecture for generative adversarial networks,” in [108] H. Liu, M. Zaharia, and P. Abbeel, “Ringattention with
ICCV, 2019, pp. 4401–4410. blockwise transformers for near-infinite context,” in
[85] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating ICLR, 2024.
the design space of diffusion-based generative models,” [109] J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger:
in NeurIPS, 2022. Singing Voice Synthesis via Shallow Diffusion Mecha-
[86] G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, and S. Kim, nism,” in AAAI, vol. 36, no. 10, 2022, pp. 11 020–11 028.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

[110] S. Liu, T. Wang, D. Bau, J.-Y. Zhu, and A. Torralba, 7416–7425.

“Diverse Image Generation via Self-Conditioned GANs,” [130] Z. Pan, J. Chen, and Y. Shi, “Masked Diffusion as Self-
in CVPR, 2020, pp. 14 274–14 283. supervised Representation Learner,” arXiv, 2024.
[111] X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: [131] K. Pandey, P. Guerrero, M. Gadelha, Y. Hold-Geoffroy,
Learning to generate and transfer data with rectified K. Singh, and N. J. Mitra, “Diffusion handles enabling 3d
flow,” in ICLR, 2023. edits for diffusion models by lifting activations to 3d,” in
[112] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and CVPR, 2024.
B. Guo, “Swin Transformer: Hierarchical Vision Trans- [132] W. Peebles and S. Xie, “Scalable diffusion models with
former using Shifted Windows,” in ICCV, 2021, pp. 9992– transformers,” in ICCV, 2023.
10 002. [133] R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron,
[113] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski,
tional networks for semantic segmentation,” in CVPR, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner,
2015, pp. 3431–3440. B. Ommer, C. Theobalt, P. Wonka, and G. Wetzstein,
[114] A. S. Luccioni, Y. Jernite, and E. Strubell, “Power hungry “State of the Art on Diffusion Models for Visual Com-
processing: Watts driving the cost of ai deployment?” puting,” arXiv, 2023.
arXiv, 2023. [134] K. Preechakul, N. Chatthee, S. Wizadwongsa, and
[115] C. Luo, “Understanding Diffusion Models: A Unified S. Suwajanakorn, “Diffusion autoencoders: Toward a
Perspective,” arXiv, 2022. meaningful and decodable representation,” in CVPR,
[116] G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Dar- 2022.
rell, “Diffusion hyperfeatures: Searching through time [135] S. J. Prince, Understanding Deep Learning. The MIT Press,
and space for semantic correspondence,” in NeurIPS, 2023.
2023. [136] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
[117] G. Luo, T. Darrell, O. Wang, D. B. Goldman, and S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark
A. Holynski, “Readout Guidance: Learning Control from et al., “Learning transferable visual models from natural
Diffusion Features,” in CVPR, 2024. language supervision,” in ICML, 2021.
[118] W. Luo, T. Hu, S. Zhang, J. Sun, Z. Li, and Z. Zhang, [137] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochas-
“Diff-instruct: A universal approach for transferring tic backpropagation and approximate inference in deep
knowledge from pre-trained diffusion models,” NeurIPS, generative models,” in ICML, 2014.
vol. 36, 2024. [138] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
[119] M. Mardani, J. Song, J. Kautz, and A. Vahdat, “A vari- B. Ommer, “High-resolution image synthesis with latent
ational perspective on solving inverse problems with diffusion models,” in CVPR, 2022.
diffusion models,” in ICLR, 2024. [139] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
[120] O. Mariotti, O. Mac Aodha, and H. Bilen, “Improving se- and Y. Bengio, “FitNets: Hints for Thin Deep Nets,” arXiv,
mantic correspondence with viewpoint-guided spherical 2015.
maps,” in CVPR, 2024, pp. 19 521–19 530. [140] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-
[121] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, tional networks for biomedical image segmentation,” in
“Long range language modeling via gated state spaces,” MICCAI. Springer, 2015, pp. 234–241.
in ICLR, 2023. [141] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Den-
[122] J. Min, J. Lee, J. Ponce, and M. Cho, “SPair-71k: A Large- ton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan,
scale Benchmark for Semantic Correspondence,” arXiv, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photore-
2019. alistic text-to-image diffusion models with deep language
[123] M. Mirza and S. Osindero, “Conditional generative ad- understanding,” in NeurIPS, vol. 35, 2022, pp. 36 479–
versarial nets,” arXiv, 2014. 36 494.
[124] S. Mittal, K. Abstreiter, S. Bauer, B. Schölkopf, and [142] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Den-
A. Mehrjou, “Diffusion based representation learning,” ton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan,
in ICML. PMLR, 2023. T. Salimans et al., “Photorealistic text-to-image diffusion
[125] S. Mukhopadhyay, M. Gwilliam, V. Agarwal, N. Pad- models with deep language understanding,” in NeurIPS,
manabhan, A. Swaminathan, S. Hegde, T. Zhou, and 2022.
A. Shrivastava, “Diffusion Models Beat GANs on Image [143] T. Salimans and J. Ho, “Progressive distillation for fast
Classification,” arXiv, 2023. sampling of diffusion models,” in ICLR, 2022.
[126] S. Mukhopadhyay, M. Gwilliam, Y. Yamaguchi, V. Agar- [144] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma,
wal, N. Padmanabhan, A. Swaminathan, T. Zhou, and “Pixelcnn++: A pixelcnn implementation with discretized
A. Shrivastava, “Do text-free diffusion models learn dis- logistic mixture likelihood and other modifications,” in
criminative visual representations?” arXiv, 2023. ICLR, 2017.
[127] E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, [145] T. Salimans, T. Mensink, J. Heek, and E. Hoogeboom,
S. Baccus, and C. Ré, “S4ND: Modeling images and “Multistep distillation of diffusion models via moment
videos as multidimensional signals with state spaces,” in matching,” arXiv, 2024.
NeurIPS, 2022. [146] D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and
[128] M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, G. Chechik, “Generating images of rare concepts using
M. Szafraniec, V. Khalidov, P. Fernandez, D. HAZIZA, pre-trained diffusion models,” in AAAI, vol. 38, no. 5,
F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, 2024, pp. 4695–4703.
R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, [147] A. Sauer, F. Boesel, T. Dockhorn, A. Blattmann, P. Esser,
V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. La- and R. Rombach, “Fast high-resolution image synthesis
batut, A. Joulin, and P. Bojanowski, “DINOv2: Learning with latent adversarial diffusion distillation,” arXiv, 2024.
robust visual features without supervision,” Transactions [148] J. Schnell, J. Wang, L. Qi, V. T. Hu, and M. Tang, “Scribble-
on Machine Learning Research, 2024. Gen: Generative Data Augmentation Improves Scribble-
[129] Z. Pan, P. Jiang, Y. Wang, C. Tu, and A. G. Cohn, supervised Semantic Segmentation,” arXiv, 2024.
“Scribble-supervised semantic segmentation by uncer- [149] S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni,
tainty reduction on neural representation and self- E. Nachmani, and Y. Taigman, “kNN-diffusion: Image
supervision on neural eigenspace,” in ICCV, 2021, pp. generation via large-scale retrieval,” in ICLR, 2023.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20

[150] J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, and learner,” in ICCV, 2023.

C. Fookes, “Diversity is definitely needed: Improving [174] X. Yang, S.-M. Shih, Y. Fu, X. Zhao, and S. Ji, “Your ViT
model-agnostic zero-shot classification via stable diffu- is Secretly a Hybrid Discriminative-Generative Diffusion
sion,” in CVPR, 2023, pp. 769–778. Model,” arXiv, 2022.
[151] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and [175] D. Yatim, R. Fridman, O. Bar-Tal, Y. Kasten, and
S. Ganguli, “Deep unsupervised learning using nonequi- T. Dekel, “Space-Time Diffusion Features for Zero-Shot
librium thermodynamics,” in ICML, 2015. Text-Driven Motion Transfer,” in CVPR, 2024.
[152] J. Song, C. Meng, and S. Ermon, “Denoising diffusion [176] Z. You, Y. Zhong, F. Bao, J. Sun, C. Li, and J. Zhu,
implicit models,” in ICLR, 2021. “Diffusion models and semi-supervised learners benefit
[153] Y. Song and S. Ermon, “Generative Modeling by Esti- mutually with few labels,” in NeurIPS, 2023.
mating Gradients of the Data Distribution,” in NeurIPS, [177] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “Lsun:
vol. 32, 2019. Construction of a large-scale image dataset using deep
[154] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- learning with humans in the loop,” arXiv, 2015.
mon, and B. Poole, “Score-based generative modeling [178] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu,
through stochastic differential equations,” in ICLR, 2021. V. Madhavan, and T. Darrell, “Bdd100k: A diverse driv-
[155] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consis- ing dataset for heterogeneous multitask learning,” in
tency models,” arXiv, 2023. CVPR, 2020, pp. 2636–2645.
[156] Y. Song, A. Keller, N. Sebe, and M. Welling, “Flow factor- [179] J. Yu, Y. Wang, C. Zhao, B. Ghanem, and J. Zhang, “Free-
ized representation learning,” in NeurIPS, vol. 36, 2024. dom: Training-free energy-guided conditional diffusion
[157] L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariha- model,” ICCV, 2023.
ran, “Emergent correspondence from image diffusion,” [180] Z. Yue, J. Wang, Q. Sun, L. Ji, E. I. Chang, and H. Zhang,
in NeurIPS, vol. 36, 2023. “Exploring diffusion time-steps for unsupervised repre-
[158] C. Tian, C. Tao, J. Dai, H. Li, Z. Li, L. Lu, X. Wang, H. Li, sentation learning,” in ICLR, 2024.
G. Huang, and X. Zhu, “ADDP: Learning general rep- [181] S. Zagoruyko and N. Komodakis, “Paying more attention
resentations for image recognition and generation with to attention: Improving the performance of convolutional
alternating denoising diffusion process,” in ICLR, 2024. neural networks via attention transfer,” in ICLR, 2017.
[159] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual [182] ——, “Wide residual networks,” in BMVC, 2016.
autoregressive modeling: Scalable image generation via [183] J. Zhang, C. Herrmann, J. Hur, L. P. Cabrera, V. Jampani,
next-scale prediction,” arXiv, 2024. D. Sun, and M.-H. Yang, “A tale of two features: Stable
[160] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug- diffusion complements DINO for zero-shot semantic cor-
and-play diffusion features for text-driven image-to- respondence,” in NeurIPS, 2023.
image translation,” in CVPR, 2023, pp. 1921–1930. [184] J. Zhang, C. Herrmann, J. Hur, E. Chen, V. Jampani,
[161] P. Vincent, “A connection between score matching and D. Sun, and M.-H. Yang, “Telling left from right: Identify-
denoising autoencoders,” Neural computation, 2011. ing geometry-aware semantic correspondence,” in CVPR,
[162] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, 2024, pp. 3076–3085.
“Extracting and composing robust features with denois- [185] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional
ing autoencoders,” in ICML, 2008, pp. 1096–1103. control to text-to-image diffusion models,” in ICCV, 2023,
[163] A. Voynov and A. Babenko, “Unsupervised discovery of pp. 3836–3847.
interpretable directions in the gan latent space,” in ICML. [186] Z. Zhang, Z. Zhao, and Z. Lin, “Unsupervised represen-
PMLR, 2020, pp. 9786–9796. tation learning from pre-trained diffusion probabilistic
[164] B. Wallace, A. Gokul, S. Ermon, and N. Naik, “End- models,” in NeurIPS, 2022.
to-end diffusion latent optimization improves classifier [187] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu,
guidance,” in ICCV, 2023, pp. 7246–7256. “Unleashing text-to-image diffusion models for visual
[165] Y. Wang, Y. Schiff, A. Gokaslan, W. Pan, F. Wang, C. De Sa, perception,” in ICCV, 2023, pp. 5729–5739.
and V. Kuleshov, “Infodiffusion: Representation learn- [188] K. Zheng, C. Lu, J. Chen, and J. Zhu, “Improved tech-
ing using information maximizing diffusion models,” in niques for maximum likelihood estimation for diffusion
ICML. PMLR, 2023. odes,” in ICML. PMLR, 2023.
[166] C. Wei, K. Mangalam, P.-Y. Huang, Y. Li, H. Fan, H. Xu, [189] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso,
H. Wang, C. Xie, A. Yuille, and C. Feichtenhofer, “Diffu- and A. Torralba, “Semantic understanding of scenes
sion models as masked autoencoders,” in ICCV, 2023. through the ade20k dataset,” IJCV, vol. 127, pp. 302–321,
[167] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Dif- 2019.
fumask: Synthesizing images with pixel-level annotations [190] R. S. Zimmermann, L. Schott, Y. Song, B. A. Dunn,
for semantic segmentation using diffusion models,” in and D. A. Klindt, “Score-based generative classifiers,”
ICCV, 2023, pp. 1206–1217. in NeurIPS 2021 Workshop on Deep Generative Models and
[168] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy, “Mem- Downstream Applications, 2021.
orizing transformers,” in ICLR, 2022.
[169] W. Xiang, H. Yang, D. Huang, and Y. Wang, “Denoising
diffusion autoencoders are unified self-supervised learn-
ers,” in ICCV, 2023.
[170] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and
S. De Mello, “Open-vocabulary panoptic segmentation
with text-to-image diffusion models,” in CVPR, 2023, pp.
2955–2966.
[171] J. N. Yan, J. Gu, and A. M. Rush, “Diffusion models
without attention,” in CVPR, 2024, pp. 8239–8249.
[172] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao,
W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models:
A comprehensive survey of methods and applications,”
ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
[173] X. Yang and X. Wang, “Diffusion model as representation
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21

Michael Fuest is a research intern in the Björn Ommer is a professor at LMU and leads
Computer Vision & Learning Group at Ludwig- the Computer Vision & Learning Group. He was
Maximilians-Universität München (LMU). He re- previously with Heidelberg University’s Depart-
cently received his master’s degree in Manage- ment of Mathematics and Computer Science,
ment & Technology with a major in Computer IWR, and HCI. He studied computer science and
Science from the Technical University of Munich, physics at the University of Bonn, completed
and is currently a visiting researcher at the MIT his Ph.D. at ETH Zurich where his dissertation
Laboratory for Information & Decision Systems. received the ETH Medal, and held a post-doc po-
sition with Jitendra Malik at UC Berkeley. He is a
member of the Bavarian AI Council, an editor for
IEEE T-PAMI, an ELLIS Fellow, faculty of ELLIS
unit Munich, a PI at the Munich Center for Machine Learning (MCML),
and has held various roles at numerous CVPR, ICCV, ECCV, and
NeurIPS conferences. He delivered the opening keynote at NeurIPS’23.
Pingchuan Ma is a Ph.D. student in the Com-
puter Vision & Learning Group at Ludwig Maxi-
milian University of Munich (LMU) and a Munich
Center for Machine Learning (MCML) member.
He previously received his master’s degree in
Applied Computer Science from Heidelberg Uni-
versity, where he developed an interest in deep
metric learning and style transfer. He served as
a reviewer for CVPR 2024 and NeurIPS 2024.
His current research focuses on leveraging gen-
erative models for tasks beyond generation and
exploring multi-modality representation learning.

Ming Gui is currently a PhD researcher at the

Computer Vision & Learning Group at Ludwig
Maximilian University of Munich (LMU). He re-
ceived his bachelor’s and master’s degree in
electrical and computational engineering from
the Technical University of Munich, where he de-
veloped interest in deep learning and computer
vision. His research is presently centered around
the development and enhancement of scalable
generative models.

Johannes S. Fischer is a PhD Student at the

University of Munich’s Computer Vision & Learn-
ing Group. He did his undergraduate studies
in Psychology and Computer Science, both at
the University of Munich, followed by a Master’s
degree in Intelligent Interactive Systems from
Pompeu Fabra University in Barcelona. Besides
works in the field of generative modeling, Jo-
hannes’ research focuses on the adaptability of
Diffusion and Flow-based models for general im-
age understanding.

Tao Hu is a postdoctoral research fellow at

Ludwig Maximilian University of Munich (LMU)
where they are developing next generation of
Stable Diffusion models. He obtained his Ph.D.
degree in computer science from the University
of Amsterdam in 2023 under the supervision
of Cees Snoek. He was an intern at Megvii,
Amazon AWS in 2017 and 2020. His Ph.D. re-
search has been selected for the CVPR2023
and ICCV2023 Doctoral Consortium. He co-
organized the ECCV 2024 Workshop on Audio-
Visual Generative Learning. He has also served as the Area Chair for
the CVPR AI4CC workshop and the CVPR 2024 Efficient Large Vision
Model workshop. He has been a reviewer for top-tier computer vision
and machine learning conferences for several years. His current re-
search interests lie in scalable, flexible, and efficient generative models.

Diffusion: by Aryan Jain
100% (1)
Diffusion: by Aryan Jain
55 pages
A Survey on Generative Diffusion Model
No ratings yet
A Survey on Generative Diffusion Model
25 pages
A Guide To Diffusion Models 1727598967
No ratings yet
A Guide To Diffusion Models 1727598967
5 pages
A Survey On Generative Diffusion Models
No ratings yet
A Survey On Generative Diffusion Models
26 pages
Diffusion Models in Vision a Survey
No ratings yet
Diffusion Models in Vision a Survey
20 pages
2209.04747v6
No ratings yet
2209.04747v6
25 pages
Diffusion Models in Deep Learning
No ratings yet
Diffusion Models in Deep Learning
14 pages
2402.16369v1
No ratings yet
2402.16369v1
12 pages
Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
Wei 2023 Diffusion Model As Mae
No ratings yet
Wei 2023 Diffusion Model As Mae
18 pages
Generalization of VAE
No ratings yet
Generalization of VAE
30 pages
An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization
No ratings yet
An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization
39 pages
Diffusion Models: A Comprehensive Survey of Methods and Applications
No ratings yet
Diffusion Models: A Comprehensive Survey of Methods and Applications
54 pages
NeurIPS 2021 Diffusion Models Beat Gans On Image Synthesis Paper
No ratings yet
NeurIPS 2021 Diffusion Models Beat Gans On Image Synthesis Paper
15 pages
Diffusion Model 5
No ratings yet
Diffusion Model 5
51 pages
Efficient Diffusion Models For Vision A Survey
No ratings yet
Efficient Diffusion Models For Vision A Survey
16 pages
DF综述
No ratings yet
DF综述
49 pages
Diffusion Models in Vision: A Survey: IEEE Transactions On Pattern Analysis and Machine Intelligence March 2023
No ratings yet
Diffusion Models in Vision: A Survey: IEEE Transactions On Pattern Analysis and Machine Intelligence March 2023
26 pages
Diffusion Models A Concise Perspective
No ratings yet
Diffusion Models A Concise Perspective
8 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
Kingma 等 - 2023 - Variational Diffusion Models
No ratings yet
Kingma 等 - 2023 - Variational Diffusion Models
27 pages
Improved Denoising Diffusion Probabilistic Models
No ratings yet
Improved Denoising Diffusion Probabilistic Models
17 pages
d3pm
No ratings yet
d3pm
33 pages
GNN-Foundations-Frontiers-and-Applications-chapter1
No ratings yet
GNN-Foundations-Frontiers-and-Applications-chapter1
13 pages
2309.16948v3
No ratings yet
2309.16948v3
26 pages
Diffusion
No ratings yet
Diffusion
55 pages
Deep Gen Models Tutorial
No ratings yet
Deep Gen Models Tutorial
96 pages
Prompt Diffusion in Context Learning For Generative Models
No ratings yet
Prompt Diffusion in Context Learning For Generative Models
5 pages
Deep Generative Models
No ratings yet
Deep Generative Models
55 pages
2312.14977diffusion Models For Generative Artificial
No ratings yet
2312.14977diffusion Models For Generative Artificial
23 pages
Denoising Diffusion Implicit Models
No ratings yet
Denoising Diffusion Implicit Models
22 pages
Li Your Diffusion Model is Secretly a Zero-Shot Classifier ICCV 2023 Paper
No ratings yet
Li Your Diffusion Model is Secretly a Zero-Shot Classifier ICCV 2023 Paper
12 pages
An Overview of Deep Neural Networks for Few-Shot Learning
No ratings yet
An Overview of Deep Neural Networks for Few-Shot Learning
44 pages
diffusion-csail-lecture-notes
No ratings yet
diffusion-csail-lecture-notes
56 pages
Diffusion Models For Time Series Applications: A Survey
No ratings yet
Diffusion Models For Time Series Applications: A Survey
25 pages
Diffusion Based Causal Representation Learning
No ratings yet
Diffusion Based Causal Representation Learning
17 pages
A comprehensive survey and analysis of generative models in machine learning--csr 2020 (1)
No ratings yet
A comprehensive survey and analysis of generative models in machine learning--csr 2020 (1)
29 pages
Diffusion
100% (5)
Diffusion
62 pages
paper8
No ratings yet
paper8
26 pages
Zhu 2018
No ratings yet
Zhu 2018
8 pages
gan_diffusion
No ratings yet
gan_diffusion
9 pages
2024 - Denoising Autoregressive Representation Learning - Li Et Al
No ratings yet
2024 - Denoising Autoregressive Representation Learning - Li Et Al
22 pages
New Denoising Diffusion Model
No ratings yet
New Denoising Diffusion Model
13 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
AAI Material and Syllabus
No ratings yet
AAI Material and Syllabus
4 pages
Generalization
No ratings yet
Generalization
10 pages
Generalization
No ratings yet
Generalization
10 pages
Neural Network Diffusion: Forward Process
No ratings yet
Neural Network Diffusion: Forward Process
17 pages
2312.08825
No ratings yet
2312.08825
22 pages
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
No ratings yet
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
14 pages
Diffusion Models Abstract 2
No ratings yet
Diffusion Models Abstract 2
1 page
R A G: T D T I E T Y T: Epresentation Lignment For Eneration Raining Iffusion Ransformers S Asier HAN OU Hink
No ratings yet
R A G: T D T I E T Y T: Epresentation Lignment For Eneration Raining Iffusion Ransformers S Asier HAN OU Hink
36 pages
Diffusion model
No ratings yet
Diffusion model
16 pages
Final Term Paper Draft 2
No ratings yet
Final Term Paper Draft 2
33 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
22-0687
No ratings yet
22-0687
76 pages
Representation Learning: A Review and New Perspectives
No ratings yet
Representation Learning: A Review and New Perspectives
30 pages
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
From Everand
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
Farhad Daneshgar PhD
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Inductive Moment Matching
No ratings yet
Inductive Moment Matching
36 pages
LLM_usecase
No ratings yet
LLM_usecase
31 pages
DiffNet
No ratings yet
DiffNet
22 pages
Opt. DRformer RUL prediction
No ratings yet
Opt. DRformer RUL prediction
15 pages
Generalized Diffusion RUL
No ratings yet
Generalized Diffusion RUL
21 pages
DiffBat
No ratings yet
DiffBat
15 pages
banglabert_language_model_pret
No ratings yet
banglabert_language_model_pret
11 pages
Vectors
No ratings yet
Vectors
22 pages
genaifile
No ratings yet
genaifile
39 pages
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
219 pages
hw3 (1)
No ratings yet
hw3 (1)
35 pages
The Subtle Art of Offensive Prompt Injection
No ratings yet
The Subtle Art of Offensive Prompt Injection
128 pages
CLASSIFICATION D'IMAGES POUR LA DÉTECTION DU CSSVD DANS LES PLANTS DE CACAO
No ratings yet
CLASSIFICATION D'IMAGES POUR LA DÉTECTION DU CSSVD DANS LES PLANTS DE CACAO
5 pages
LLM_FINE_TUNE
No ratings yet
LLM_FINE_TUNE
11 pages
Open Elective 6th Sem External Exam (GAN Specialization)
No ratings yet
Open Elective 6th Sem External Exam (GAN Specialization)
38 pages
Detecting fake news through deep learning: a current systematic review
No ratings yet
Detecting fake news through deep learning: a current systematic review
11 pages
Yan et al, 2023 - Practical and ethical challenges of large language models in education A systematic (1)
No ratings yet
Yan et al, 2023 - Practical and ethical challenges of large language models in education A systematic (1)
23 pages
Learning Data Mining with Python Robert Layton instant download
100% (1)
Learning Data Mining with Python Robert Layton instant download
69 pages
Large Language Models in Finance a Survey
No ratings yet
Large Language Models in Finance a Survey
9 pages
Amharic Information Retrieval 2025
No ratings yet
Amharic Information Retrieval 2025
18 pages
Transformer
No ratings yet
Transformer
10 pages
zhou-et-al-2023-automatic-diagnosis-of-diabetic-retinopathy-using-vision-transformer-based-on-wide-field-optical
No ratings yet
zhou-et-al-2023-automatic-diagnosis-of-diabetic-retinopathy-using-vision-transformer-based-on-wide-field-optical
10 pages
2025-Divide-and-Contrast- A Text-based Method for Firm Market Risk Prediction
No ratings yet
2025-Divide-and-Contrast- A Text-based Method for Firm Market Risk Prediction
18 pages
LLM Market Report
No ratings yet
LLM Market Report
44 pages
OceanofPDF.com Programming Large Language Models - Francesco Esposito
No ratings yet
OceanofPDF.com Programming Large Language Models - Francesco Esposito
605 pages
Research Paper Riya
No ratings yet
Research Paper Riya
8 pages
Efficient Transformer Survey-dual
No ratings yet
Efficient Transformer Survey-dual
56 pages
Fast_and_Efficient_Lung_Abnormality_Identification_With_Explainable_AI_A_Comprehensive_Framework_for_Chest_CT_Scan_and_X-Ray_Images
No ratings yet
Fast_and_Efficient_Lung_Abnormality_Identification_With_Explainable_AI_A_Comprehensive_Framework_for_Chest_CT_Scan_and_X-Ray_Images
19 pages
soma-final
No ratings yet
soma-final
30 pages
Gen AI Unit 1
100% (1)
Gen AI Unit 1
86 pages
AI Open
No ratings yet
AI Open
9 pages
Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation
No ratings yet
Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation
12 pages
Building Generative Ai-powered Apps (Aarushi Kansal) (Z-Library)
No ratings yet
Building Generative Ai-powered Apps (Aarushi Kansal) (Z-Library)
150 pages
TAFormer a Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes
No ratings yet
TAFormer a Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes
16 pages
Project Report b
No ratings yet
Project Report b
31 pages
cs336_spring2025_assignment1_basics
No ratings yet
cs336_spring2025_assignment1_basics
50 pages

Diffusion Models & Representation Learning

Uploaded by

Diffusion Models & Representation Learning

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Diffusion Models and Representation Learning:

in Fig. 1, we expect this trend to continue this year. The

where DKL is the Kullback-Leibler divergence. This

2.2 Backbone Architectures TABLE 1

masked PixelCNN++ [144] to approximate the score func-

Paradigm Downstream Task Method

Leveraging Intermediate Activations DIFT [157]

guidance methods. Improved online-bootstrapping meth- ACKNOWLEDGMENTS

[110] S. Liu, T. Wang, D. Bau, J.-Y. Zhu, and A. Torralba, 7416–7425.

[150] J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, and learner,” in ICCV, 2023.

Ming Gui is currently a PhD researcher at the

Johannes S. Fischer is a PhD Student at the

Tao Hu is a postdoctoral research fellow at

You might also like