0% found this document useful (0 votes)
17 views10 pages

latent(1)

This document presents a novel conditional latent diffusion framework (LDSeg) for medical image segmentation that addresses inefficiencies in traditional diffusion probabilistic models (DPMs) by enabling faster sampling and reducing memory consumption. The end-to-end training strategy of LDSeg allows for robust representation learning in latent space, achieving state-of-the-art segmentation accuracy across multiple medical imaging datasets. The proposed model demonstrates improved robustness to noise compared to conventional deterministic segmentation methods.

Uploaded by

jasmine.yuqing21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

latent(1)

This document presents a novel conditional latent diffusion framework (LDSeg) for medical image segmentation that addresses inefficiencies in traditional diffusion probabilistic models (DPMs) by enabling faster sampling and reducing memory consumption. The end-to-end training strategy of LDSeg allows for robust representation learning in latent space, achieving state-of-the-art segmentation accuracy across multiple medical imaging datasets. The proposed model demonstrates improved robustness to noise compared to conventional deterministic segmentation methods.

Uploaded by

jasmine.yuqing21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Latent Diffusion for Medical Image Segmentation:


End to end learning for fast sampling and accuracy
Fahim Ahmed Zaman, Mathews Jacob, Amanda Chang, Kan Liu, Milan Sonka and Xiaodong Wu

Abstract—Diffusion Probabilistic Models (DPMs) suffer from In computer vision, DPMs have achieved remarkable re-
inefficient inference due to their slow sampling and high memory sults for image generation, outperforming other generative
consumption, which limits their applicability to various medical models [8]. Standard DPMs have two major steps: a forward
imaging applications. In this work, we propose a novel conditional
process that perturbs the image with added Gaussian noise
arXiv:2407.12952v2 [cs.CV] 18 Jan 2025

diffusion modeling framework (LDSeg) for medical image seg-


mentation, utilizing the learned inherent low-dimensional latent and a reverse process that starts with a Gaussian noise and
shape manifolds of the target objects and the embeddings of the iteratively denoises the image to generate a clean image of
source image with an end-to-end framework. Conditional diffu- the original data distribution. The learned denoiser is trained
sion in latent space not only ensures accurate image segmentation with noisy images for different noise variances, thus modeling
for multiple interacting objects, but also tackles the fundamental
issues of traditional DPM-based segmentation methods: (1) high the gradient of the smoothed log prior (score) of the images.
memory consumption, (2) time-consuming sampling process, and Recently, intensive efforts have been made to extend DPMs
(3) unnatural noise injection in the forward and reverse processes. for medical image segmentation [9]–[14]. The DPMs used for
The end-to-end training strategy enables robust representation segmentation differ from the ones used for image generation
learning in the latent space related to segmentation features, in the sense that the forward and reverse processes involve the
ensuring significantly faster sampling from the posterior distri-
bution for segmentation generation in the inference phase. Our noise addition and denoising of the segmentation mask instead
experiments demonstrate that LDSeg achieved state-of-the-art of the source image. In particular, the denoising network
segmentation accuracy on three medical image datasets with models the gradient of the smoothed conditional distribution of
different imaging modalities. In addition, we showed that our the segmentation labels given the source image. The objective
proposed model was significantly more robust to noise compared of the segmentation DPM is thus to sample a segmentation
to traditional deterministic segmentation models. The code is
available at https://github.com/FahimZaman/LDSeg.git. mask from the conditional distribution.
To fully harness the power of DPMs for medical image seg-
Index Terms—Diffusion in latent space, Diffusion probabilistic
mentation, some fundamental challenges must be addressed.
model, Medical image segmentation
First, the original DPM formulation assumes that the pixel
values are continuous variables, compared to binary/integer
I. I NTRODUCTION segmentation labels in the segmentation setting. The above
In the field of medical imaging, image segmentation is denoising strategy using Gaussian noise may not learn the true
a crucial step in identifying and monitoring disease-related score in our setting. This mismatch (continuous values versus
pathologies, clinical decisions in treatment, and the evaluation semantic labels) results in discrepancies during inference. For
of disease progression [1]. Traditional deep learning (DL) example, additional thresholding is needed to obtain the final
based segmentation models have achieved impressive accuracy segmentation mask by filling in hole-shaped structures that
in various imaging modalities, which often match/outperform are caused by high-frequency noises [11]. Wu et al. proposed
field level experts [2], [3]. These DL-based models, mostly the use of frequency parser blocks in the hidden layers of
including convolutional neural networks (CNN), and vision the denoiser to modulate high-frequency noise [9], but it does
transformers (ViT) are generally trained end-to-end in a dis- not guarantee clean results after sampling and may need post-
criminative manner. Recently, generative models have emerged processing. Bogensperger et al. proposed to transform the
as powerful image segmentation tools, taking advantage of discrete segmentation mask into a signed distance function
learning the underlying statistics of target objects, conditioned (SDF-DDPM), where each pixel represents the signed Eu-
on the source image. These conditional generative models clidean distance from the boundary of the closest object [11].
include diffusion probabilistic models (DPMs) [4]–[7] and A limitation of this approach is that the distance map for the
generative adversarial networks (GAN). multi-class images is ambiguous. Zaman et al. proposed to re-
parameterize the segmentation masks with a graph structure
This research was supported in part by NIH Grants R01HL171624,
R01AG067078 and R01EB019961 which guarantees naturally continuous perturbations for sur-
Fahim Ahmed Zaman, Milan Sonka and Xiaodong Wu are with the face positions on the graph columns [15]. However, this model
department of Electrical and Computer Engineering, University of Iowa, IA suffers from the multiclass mask representation problem, as
52240, USA (email: {fahim-zaman, milan-sonka, xiaodong-wu}@uiowa.edu.)
Mathews Jacob is with the department of Electrical and Computer Engineer- it poses great challenges to have a common graph column
ing, University of Virgina, VA 22904, USA (email: [email protected].) structure to define surface positions for different objects. There
Amanda Chang is with the department of Medicine, University of Iowa, IA is a compelling need to have a proper reparameterization
52240, USA (email: [email protected].)
Kan Liu is with the Washington University School of Medicine, MO 63110, technique that can be implemented for multiclass objects and
USA (email: [email protected].) guarantees smooth state transitions between classes.
2

The second key challenge of DPMs is the time-consuming mated labels to remove hole-shaped structures. In addi-
iterative sampling process, which makes segmentation DPMs tion, the standardized distribution of the latents (having
significantly slower than their deterministic counterparts. Vari- a similar distribution of the added Gaussian noise) will
ous approaches have been developed to improve the sampling ensure faster convergence and improved gradient flow
efficiency for natural image generation [16]–[19]. Recently, during training.
latent diffusion models, which perform the sampling in a • The diffusion in the latent space ensures less memory
low-dimensional latent space, have been used to speed up consumption and much faster training and inference pro-
the diffusion process in natural image generation [20] and cess, enabling efficient application of DPM on 3- and
segmentation [21]. In both works, the latent representation higher-dimensional medical image segmentation.
of the source image was used. Liu et al. proposed a latent • The LDSeg model is significantly more robust to noise
diffusion model that uses label rectification in latent space in the source images compared to the deterministic
for semi-supervised medical image segmentation [22]. Latent segmentation models due to the low-dimensional image
diffusion has also been used in audiovisual segmentation embeddings, which mitigates the segmentation challenges
for robust audio recognition [23]. Recently, Vu Quoc et al. in medical images with noisy acquisition.
proposed a latent diffusion model for image segmentation
[24]. They proposed a two-step training strategy in which II. BACKGROUND
a variational-autoencoder (VAE) is first trained to learn the
latent distribution of the label image, followed by training A. Denoising Diffusion Probabilistic Models (DDPM)
a conditional denoiser of the latent codes; the embedding DDPMs are designed to learn a data distribution by grad-
of the source image is used as the condition. This latent ually denoising a normally distributed variable, which cor-
diffusion model is computationally efficient in comparison to responds to learning the reverse process of a fixed-length
traditional DPMs. However, since the score / denoiser model is Markov chain T . More precisely, this denoising reverse
trained separately by minimizing the mean square error (MSE) process can be modeled as pθ (x0:T ), which is a Markov
between the recovered and true latents, the MSE loss in the chain with learned Gaussian transitions starting at p(xT ) ∼
latent domain may not well capture the segmentation errors, N (xT ; 0, I):
which may compromise the accuracy of segmentation. Vahdat
T
et al. proposed to use an end-to-end training strategy by jointly Y
pθ (x0:T ) := p(xT ) pθ (xt−1 | xt ), (1)
learning latent embeddings and a denoiser of latent codes for
t=1
image generation [25]. They observed improved accuracy and
faster sampling for the end-to-end framework. pθ (xt−1 | xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2)
We introduce a novel conditional latent diffusion-based gen- where x0 ∼ q(x0 ) is a sample from a real data distribution,
erative framework (LDSeg) for medical image segmentation, x1 , . . . , xT are transitional states for timesteps t = 1, . . . , T .
which capitalizes on the inherent advantages of latent diffusion
The forward process in a DDPM is also a Markov chain,
models. Unlike the two-step training strategy in [24], we
which gradually adds noise to the image. Given data x0 ∼
jointly train the encoder, decoder, and the score model in
q(x0 ) sampled from the real distribution, the forward process
an end-to-end fashion. We use a combination of denoiser
at time t ∈ [1, T ] can be defined as q(xt | xt−1 ), where
loss in the latent domain and segmentation loss in the label
Gaussian noise is gradually added given a noise variance
domain. The proposed LDSeg learns the standardized latent
schedule βt ∈ [β1 , βT ]:
representation of the target object shape manifolds, enabling
smooth state transitions between object classes. Moreover, T
Y
unlike traditional DPMs, LDSeg learns to sample from the q(x1:T | x0 ) := q(xt | xt−1 ), (3)
posterior distribution, which is significantly simpler and more t=1

concentrated than the prior distribution. Clearly, sampling from p


a narrow/concentrated distribution would be faster. The major q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I) (4)
contributions of this work are summarized as follows.
The choice of Gaussian provides a close-form solution to
• To the best of our knowledge, this is the first work generate a transitional state xt using,
to leverage the jointly learned latent representation of √ √
xt = ᾱx0 + 1 − ᾱϵ (5)
the image and target object’s shape manifolds with a
Qt
diffusion denoiser in latent space, for robust posterior where αt = 1 − βt , ᾱt = i=1 αi and ϵ ∼ N (0, I). Training
prediction. Our experiments show that the end-to-end is usually performed by optimizing the variational bound on
training strategy offers improved performance and faster the negative log likelihood of pθ (x0 ):
sampling, compared to the two-step training scheme.
• The label encoder maps the continuous domain latent 
pθ (x0:T )

variables to a standardized latent space (µ = 0, σ = 1). L := Eq − log ≥ E [− log pθ (x0 )] (6)
q(x1:T | x0 )
Hence, standard diffusion theory can be applied to the
latents, unlike in the case of discrete labels. For instance, However, with re-parameterization, Ho et al. [4] simplified
this mitigates the need for post-processing of the esti- the training objective by proposing a variant of the variational
3

𝒇𝐥𝐚𝐛𝐞𝐥−𝐞𝐧𝐜

𝒇𝐝𝐞𝐧𝐨𝐢𝐬𝐞𝐫

-
𝒇𝐥𝐚𝐛𝐞𝐥−𝐝𝐞𝐜

𝒕
𝒇𝐢𝐦𝐚𝐠𝐞−𝐞𝐧𝐜 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈
𝑰𝒏𝒇𝒆𝒓𝒆𝒏𝒄𝒆 Source Image

𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏𝒔: 𝒕 = 𝑻, … , 𝟏 Label Image


𝒇𝐢𝐦𝐚𝐠𝐞−𝐞𝐧𝐜
Segmentation

𝒕 Time Embedding

Gaussian Diffusion
𝒕 𝒇𝐥𝐚𝐛𝐞𝐥−𝐝𝐞𝐜
𝒇𝐝𝐞𝐧𝐨𝐢𝐬𝐞𝐫 Concatenation

Fig. 1. The proposed LDSeg model. The label encoder flabel-enc and image encoder fimage-enc are used to obtain corresponding low dimensional latent
representations zl(0) and zi for a given ground truth label/mask image y and source image X, respectively. A denoiser fdenoiser , conditioned on the source
image embedding zi , is used to learn the noise distributions of zl(t) for timesteps t = 1, . . . , T , where T is the total number of diffusion steps. zl(t)
is obtained by perturbing zl(0) with a Gaussian block G(·) for a given noise variance scheduler α and β. The cleaned latent space zdn is obtained by
subtracting the predicted noise zn(t) from the perturbed one zl(t) . Finally, a label decoder flabel-dec is used to obtain the segmentation ŷ of the semantic
labels in the original image from zdn . The model is trained in an end-to-end fashion, where our objective is to learn q(ŷ|X) = Eqi (zi |X) [qs (ŷ|z)], where
ql (z | y, X) ∼ N (zdn , σ 2 I). In the inference phase, starting with a random Gaussian z̃l(T ) ∼ N (0, I), the denoiser is iterated for timestep t = T, . . . , 1
to obtain z̃l(0) with zi as the condition. Final segmentation ŷ = flabel-dec (z̃l(0) ) is obtained using the trained label decoder.

bound that improves the quality of generated samples while each time step, conditioned on the embedding of the
being easier to implement, source image and the time step t.
4) Label decoder: The label decoder flabel−dec is used to
LDDP M := Et,x0 ,ϵ [∥ϵ − ϵθ (xt , t)∥2 ] (7) produce segmentation by mapping the denoised latent
space to its corresponding semantic label image in the
where ϵθ is a function approximator intended to predict ϵ from
original image domain. The model training and inference
xt by a trained denoiser. With a trained denoiser, the data
workflows are shown in Figure 1.
can be generated with the reverse process by iterating through
t = T, . . . , 1. Starting from xT ∼ N (0, I), the transitional We note that the segmentation labels are discrete, and
states can be obtained by, hence corrupting them by Gaussian noise is unnatural, as the
1

βt
 label/mask image has only a few modes (i.e., the number of
xt−1 = √ xt − √ ϵθ (xt , t) + σt z (8) object classes). We propose to mitigate this inherent problem
αt 1 − ᾱt
by learning a low-dimensional standardized representation of
where σt is the noise variance of timestep t and z ∼ N (0, I). the label images. In other words, we want to learn a label
encoder flabel-enc (·) that projects the input labels into a latent
space with standardized distribution. Essentially, the label en-
III. M ETHOD
coder learns to produce low-dimensional latent representation
The proposed LDSeg framework consists of four major of the object shape manifolds for the label images. This low-
components: dimensional standardized representation (label embedding) has
1) Label encoder: The label encoder, denoted by flabel−enc , two major advantages over the original label image, (1) it is
is used to learn the low-dimensional latent representation continuous, thus ensuring smooth transition among different
with a standardized distribution of the shape manifolds object classes, and (2) it is computationally more efficient to
of the target object. train a conditional denoiser for a low-dimensional standardized
2) Image encoder: The image encoder, denoted by latent space, thus making the algorithm significantly faster in
fimage−enc , learns the low-dimensional image embed- the inference phase.
ding zi from the source image X. A standard DPM denoiser has two inputs, a noisy version of
3) Conditional label denoiser: The denoiser fdenoiser learns the input image and its corresponding timestep. For segmenta-
the added noise of the perturbed label embedding for tion, the denoiser needs additional conditioning. The condition
4

𝐿𝑎𝑏𝑒𝑙 𝐼𝑚𝑎𝑔𝑒 (𝑦)


𝑓label−enc 𝑧𝑙(0) 𝑧𝑙(𝑡) = 𝛼ത𝑡 𝑧𝑙(0) + 1 − 𝛼ത𝑡 𝜖 𝑧𝑙(𝑇) ~𝒩(0, Ι)

𝑆𝑜𝑢𝑟𝑐𝑒 𝐼𝑚𝑎𝑔𝑒 (𝑋)


𝑓image−enc

𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 (𝑦)
ො 1 𝛽𝑡
𝑧ǁ𝑙(0) 𝑧ǁ𝑙(𝑡−1) = 𝑧ǁ − 𝑓𝑑𝑒𝑛𝑜𝑖𝑠𝑒𝑟 (𝑧ǁ𝑙(𝑡) , 𝑧𝑖 , 𝑡) 𝑧ǁ𝑙(𝑇) ~𝒩(0, Ι)
𝑓label−dec 𝛼𝑡 𝑙(𝑡) 1 − 𝛼𝑡

𝑫𝒊𝒇𝒇𝒖𝒔𝒊𝒐𝒏 𝒊𝒏 𝒍𝒂𝒕𝒆𝒏𝒕 𝒔𝒑𝒂𝒄𝒆

Fig. 2. A sample GlaS data [26] was used to demonstrate the forward and the reverse diffusion processes. In the forward process (top row), the low-dimensional
latent representation zl(0) is first obtained from the label image. Then, Gaussian noise is gradually injected for timestep t = 1, . . . , T , given the noise variance
schedules of β, where ϵ ∼ N (0, I). At timestep T , zl(T ) is subject to N (0, I). To start the reverse process (bottom row), z̃l(T ) is sampled from N (0, I).
Then the denoiser is used iteratively for timesteps t = T, . . . , 1 with the source image embedding zi as the condition. At the end of the reverse process, the
segmentation mask is obtained from z̃l(0) using the trained label decoder.

can be the source image [9], [11], or a text indicating the target Our objective is to learn q(ŷ|X) = Eqi (zi |X) [qs (ŷ|z)],
object [21]. As our objective is image semantic segmentation, where ql (z | y, X) ∼ N (zdn , σ 2 I). The loss function L
we propose to use image embedding as a condition for the consists of two terms, the segmentation loss L1 and the
denoiser. The image embedding is a low-dimensional latent denoiser loss L2 , where the segmentation loss is a combination
representation of the source image having the same size as of cross-entropy loss LCE and dice similarity coefficient
the label embedding, which is learned using an image encoder (DSC) loss LDSC .
fimage-enc (·). The image embedding is concatenated with the X
noisy representation of the label embedding and used as a two LCE (ŷ, y) = − yc log(ŷc ) (14)
c∈C
channel input to the denoiser, along with its corresponding
timestep as a separate input. The denoiser fdenoiser (·), learns P
2 i ŷi yi
the transitional noisy distributions of the label embedding, LDSC (ŷ, y) = 1 − P P (15)
conditioned on the image embedding, and predicts noise for i ŷi + i yi
a given timestep. Finally, to map the denoised latent space to
the semantic segmentation in the original image domain, we L1 = EX,y [LCE (ŷ, y) + γLDSC (ŷ, y)] (16)
learn a label decoder flabel-dec .
L2 = Eϵ∼N (0,I) ∥fdenoiser (zl(t) , zi , t) − ϵ∥2
 
(17)
A. Loss function
L = L1 + λL2 (18)
Let, X, y, and ŷ be the source image, its corresponding
label image, and the predicted segmentation, respectively, where, c ∈ C is an object class of a set of object classes C,
sampled from the dataset. zi , zl(0) are their corresponding i is the corresponding pixel, γ and λ are scaler co-efficients.
image embedding and label embedding, This end-to-end training strategy is functionally analogous to
a VAE model. Although it does not explicitly parameterize a
zi = fimage-enc (X) (9) distribution like VAEs, the denoiser’s role in handling noise
in latent space creates an analogous structure. Unlike VAE,
zl(0) = flabel-enc (y) (10) the proposed model does not aim to explicitly match a latent
distribution to a priori. Instead, the denoiser’s regularization
Given noise variance schedule parameters α and β, a
ensures that the latent space remains structured, noise-resilient
Gaussian block G(·) is used to produce the noisy zl(t) for
and better representative of the segmentation related features.
timestep t ∈ (1, T ) [4], [16],
An example of a conditional denoising forward process for
√ √
zl(t) = G(zl(0) , t) = ᾱt zl(0) + 1 − ᾱt ϵ (11) segmentation generation is shown in Figure 2 (top row). The
Qt algorithm for end-to-end training is shown in Algorithm 1.
where αt = 1 − βt , ᾱt = i=1 αi and ϵ ∼ N (0, I). The
denoiser predicts the noise of the timestep t conditioned on
zi . Let zdn be the denoised latent space, B. Reverse Process for Segmentation
As the image encoder is independent of the denoiser, we
zdn = zl(t) − fdenoiser (zl(t) , zi , t) (12) only need to obtain the image embedding zi at the start of
the reverse process. In the reverse process, the main objective
ŷ = flabel-dec (zdn ) (13) is to generate latent representation zl(0) , conditioned on zi .
5

acquired by a team of pathologists at the University Hos-


Algorithm 1 Training pitals Coventry and Warwickshire, UK. The training set
1: repeat contains 37 benign and 48 malignant cases, whereas the
2: X, y ∼ qdata (X, y) test set contains additional 37 benign and 43 malignant
3: zi = fimage-enc (X) cases.
4: zl(0) = flabel-enc (y)
5: t ∼ Uniform({1, . . . , T })
6: ϵ ∼ N (0,√ I) √
7: zl(t) = ᾱt zl(0) + 1 − ᾱt ϵ
8: zdn = zl(t) − fdenoiser (zl(t) , zi , t)
9: ŷ = flabel-dec (zdn )
10: Take gradient descent step on 3D surface plot
h i
2
∇ [LCE (ŷ, y) + γLDSC (ŷ, y)] + λ ϵ − fdenoiser (zl(t) , zi , t)

11: until converged

Fig. 3. A sample of knee data. FC and TC are marked with green and red
Like the other image generation tasks of DPMs, a Gaussian color. Three slices from axial, coronial and sagittal plane is shown along with
N (0, I) is used as the noisy latent mask representation z̃l(T ) the 3D surface plot for FC and TC.
at timestep T . Then the denoiser is iterated for t = T, . . . , 1.
At the end of the iteration, we obtain z̃l(0) , which is used 3) Knee (https://data-archive.nimh.nih.gov/oai/) is a pub-
as an input to the trained label decoder to get the final licly available 3D MRI dataset. The dataset contains
segmentation ŷ = flabel-dec (z̃l(0) ). An example of reverse randomly selected 987 3D MRI scans from 244 patients
process is shown in Figure 2 (bottom row). The sampling on different time points. Focused volumetric regions
algorithm for segmentation is shown in Algorithm 2. with an image size of 160 × 104 × 256 around the femur
cartilage with bone (FC) and tibia cartilage with bone
(TC) are used as the region of interest (ROI). The FC
Algorithm 2 Inference and TC are segmented by an automatic segmentation
1: X ∼ qdata (X), z̃l(t) ∼ N (0, I) algorithm and validated/edited by an expert. Figure 3
2: zi = fimage-enc (X) shows a sample of the Knee dataset.
3: for t = T, . . . , 1 do
4: n ∼ N (0, I) if t > 1, else n = 0 B. Model Architecture
 
5: z̃l(t−1) = √1 z̃l(t) − √ βt f (z̃ , z , t) + σt n The label and image encoder both have architectures similar
αt 1−ᾱt denoiser l(t) i

6: end for to standard ResUnet encoder [27], without any skip con-
7: ŷ = flabel-dec (z̃l(0) ) nections. Each have several convolution and down-sampling
8: return ŷ layers that determine the size of the latent space for the low-
dimensional projection of the label and source input images.
We experimented with different down-sampling scales and
chose 4 down-sampling layers, which produced the best results
IV. E XPERIMENTS for all three datasets. The image size for Echo, GlaS and Knee
data were resized to 512×768, 512×512 and 128×128×256,
A. Datasets respectively. Hence, the sizes of the low-dimensional zl(0)
We have used 3 datasets to demonstrate the effectiveness of and zi for Echo, GlaS, and Knee data are 32 × 48, 32 × 32
the proposed LDSeg: and 8 × 8 × 16, respectively. We observed that these were
1) Echo is a 2D+time echocardiogram (echo) video dataset the optimal latent sizes as further down-sampling reduced the
from University of Iowa Hospitals & Clinics. All the model accuracy, while less down-sampling reduced denoiser
videos are standard apical 4-chamber scans with a accuracy as the search space got enlarged and learning noise
left-ventricular focused view. Echos are acquired by distributions became challenging. A normalization layer is
transthoracic echocardiography (TTE) using standard 2D added as the final layer to the label encoder to ensure stan-
echocardiography techniques following the guidelines of dardized distribution (µ = 0, σ = 1) for label embedding.
the American Society of Echocardiography. In total, the On the other hand, final two down-sampling layers of the
dataset contains 65 echos (2230 still frames). The left image encoder are equipped with multi-head attention layers
ventricles (LV) and the left atria (LA) were fully traced [28] to capture robust imaging features. The denoiser has a
by an expert manually using ITK-snap. standard ResUnet shape with time-embedding blocks and self-
2) GlaS [26] is a publicly available 2D histopathology attention layers. Specifically, we have adapted the denoiser
dataset of Hematoxylin and Eosin (H&E) stained slides, architecture from [4]. The image embedding is concatenated
6

DSC: 0.76 DSC: 0.90 DSC: 0.91 DSC: 0.93 DSC: 0.94 DSC: 0.98

DSC: 0.91 DSC: 0.93 DSC: 0.88 DSC: 0.95 DSC: 0.92 DSC: 0.96

Image GT SwinUNet MedSegDiff nnUNet ResUNet LSegDiff LDSeg

Fig. 4. Qualitative segmentation results of different methods for GlaS and Echo dataset, shown in top and bottom rows, respectively. Dark red marks the
false negative and the light red marks the false positive error on the segmentation result. GT indicates the ground-truth/label-image.

with the noisy representation of the label embedding and TABLE II


used as a two channel input to the denoiser, along with it’s Q UANTITATIVE RESULTS FOR G LA S DATA SEGMENTATION .
corresponding timestep as a separate input. The decoder have
Method DSC ↑ IoU ↑
similar architecture to the decoder of ResUnet, without the
skip connections. A softmax activation layer is used as the final SwinUNet [29] 0.76 0.62
layer of the decoder to obtain the probabilistic distribution of U-net [30] 0.78 0.65
different object classes. U-net++ [33] 0.78 0.66
MedT [34] 0.81 0.70
SDF-DDPM * [11] 0.83 0.72
C. Experimental Setup nnUNet [31] 0.84 0.73
Exponentially decayed learning rates were used to train the LSegDiff * [24] 0.84 0.74
models with 1 × 10−2 and 1 × 10−3 as the initial learning MedSegDiff * [9] 0.84 0.74
rates for all the model components. For the Echo and Knee Res-Unet [27] 0.86 0.76
datasets, we used a 80% : 20% split for training and testing, *
LDSeg (Ours) 0.91 0.84
and among each training dataset, 10% were used for validation * denotes the DPMs.
during model development. The well-separated training and
test sets were used for the GlaS dataset [26]. The noise step t
is an integer randomly sampled from 1 to 1000 for each batch. TABLE III
Q UANTITATIVE RESULTS FOR THE K NEE DATA SEGMENTATION .
For the loss function, choices of γ = 2 and λ = 1 gave the
best model accuracy. NVIDIA A100-SXM4 (80GB) GPU was DSC ↑ IoU ↑
Method
used for model training and inference. FC TC FC+TC FC TC FC+TC
SwinUNet [29] 0.85 0.70 0.81 0.75 0.56 0.69
LSegDiff* [24] 0.96 0.96 0.96 0.93 0.92 0.93
V. R ESULTS
Res-Unet [27] 0.97 0.96 0.96 0.93 0.93 0.93
A. Segmentation Accuracy nnUNet [31] 0.97 0.96 0.96 0.93 0.94 0.93
LDSeg* (Ours) 0.96 0.96 0.96 0.93 0.92 0.93

denotes the DPMs.
TABLE I
Q UANTITATIVE RESULTS FOR E CHO DATA SEGMENTATION .

DSC ↑ IoU ↑ We evaluated the performance of our proposed method


Method
LV LA LV+LA LV LA LV+LA LDSeg using two standard metrics: (1) Dice Similarity Co-
SwinUNet [29] 0.85 0.70 0.81 0.75 0.56 0.69
efficient (DSC) and (2) Intersection over Union (IoU). Tables
U-net [30] 0.86 0.75 0.83 0.77 0.62 0.72
I,II and III, show the quantitative results for different methods
nnUNet [31] 0.89 0.76 0.86 0.81 0.64 0.76
MedSegDiff * [9] 0.89 0.81 0.87 0.82 0.70 0.78 for Echo, GlaS, and Knee datasets, respectively. Figure 4
V-net [32] 0.93 0.81 0.90 0.87 0.71 0.83 shows qualitative segmentation results by different methods
Res-Unet [27] 0.93 0.83 0.91 0.87 0.74 0.84 of GlaS and Echo dataset. LDSeg achieves the best DSC
LSegDiff* [24] 0.93 0.85 0.91 0.87 0.75 0.84 and IoU scores for all the datasets. SDF-DDPM method uses
LDSeg* (Ours) 0.93 0.87 0.92 0.87 0.77 0.84 signed distance functions to represent mask images, which
* denotes the DPMs. is ambiguous for data with multi-labels. Hence, it is only
reported for the GlaS dataset. For the 3D Knee dataset, it was
7

impossible to implement the full architecture of MedSegDiff


a
due to the large GPU memory consumption. This indicates
that diffusions in the latent space is of tremendous help for
segmenting 3D medical images with a large image size when
the GPU/CPU memory is constrained.

B. Computational Efficiency

Fig. 5. The number of evenly spaced sampling steps vs DSC for different
datasets. The DDIM algorithm with only 2 evenly spaced sampling steps
between 1 and T = 1000 (inclusive) produced maximum segmentation
accuracy for all the datasets. The number of steps are plotted in the logarithmic
scale for convenience.
Fig. 6. (a) The number of sampling steps vs DSCs using LDSeg, LSegDiff
and MedSegDiff models for the GlaS dataset. LDSeg was able to achieve the
maximum DSC with only 2 sampling steps, outperforming both LSegDiff (10)
TABLE IV
and MedSegDiff (700). (b) Image sizes vs execution times for segmenting
T HE LDS EG ALGORITHM RUN TIMES FOR SEGMENTING A SINGLE IMAGE
a single image with different DPM. The execution times of LDSeg and
FROM EACH DATASET ARE SHOWN FOR SAMPLING STEPS 1000 ( USING
LSegDiff (both latent diffusion models) remained close to constant due to
ALL THE SAMPLING STEPS ) AND SAMPLING STEPS 2 ( THE MINIMUM
the use of constrained low-dimensional latent space, while for SDF-DDPM
NUMBER OF SAMPLING STEPS TO ACHIEVE THE SAME ACCURACY AS
and MedSegDiff, the execution times increased exponentially with increased
USING ALL SAMPLING STEPS ).
image sizes.

Execution time (seconds)


Dataset
Sampling steps=1000 Sampling steps=2 5 shows the number of sampling steps versus the DSC scores
for all the datasets with the DDIM sampling algorithm. Table
Echo 91.23 0.34
IV shows the execution times (in seconds) needed to segment
GlaS 80.37 0.30 a single image using all the sampling steps for T = 1000, and
Knee 132.36 0.49 the minimum number of sampling steps needed to achieve
the same segmentation accuracy. With the optimal number of
The major difference of LDSeg compared to other tradi- sampling steps, LDSeg achieved a significant increase in the
tional diffusion-based segmentation methods is that the diffu- efficiency of the sampling time (∼ 268 times reduction in the
sion operations are performed in the latent low-dimensional execution time).
space. We expect the memory demand and total sampling We further investigated the execution times for segmenting
time of LDSeg for a sampling sequence to be less than the a single image with different image sizes for different DPMs.
other diffusion-based methods. We further experimented on The minimum number of sampling steps to achieve maximum
the sampling sequence for the reverse process. Nichol et al. segmentation accuracy (i.e., achieved using all sampling steps)
[16] observed that the model trained with the “cosine” noise can be different (Figure 6(a)) for different DPM due to
scheduler, and sampled with DDIM algorithm [17] performed different objectives to learn target noise distributions. For a
remarkably well in image generation. DDIM method proposed fair comparison, we fixed the total sampling steps to 50 and
by Song et al. deterministically maps noises to images without performed the experiments on the GlaS dataset by down-
using stochasticity in the transition states. Nichol et al. used sampling the images into different sizes. Figure 6(b) shows
fewer sampling steps (< 50), when T >= 1000 and achieved that with increased image sizes, the execution times for SDF-
close to the optimal Fréchet inception distance (FID) score for DDPM and MedSegDiff increased exponentially, mainly due
image generation. They used K evenly spaced real numbers to denoising in the higher-dimensional image domain. For
between 1 and T (inclusive) as sampling steps, and then LDSeg, the execution times were close to constant with the
rounded each resulting number to the nearest integer value. increased image sizes, as the denoising in the low-dimensional
We adapted the same sampling strategy and observed that latent space is efficient and less time/memory consuming.
with remarkably less sampling steps (sampling steps = 2), The improvement of LDSeg over LSegdiff, both in terms of
LDSeg is able to achieve the same segmentation accuracy as computation time and performance can be attributed to the
using all the sampling steps, for all the test datasets. Figure end-to-end training strategy. In particular, the use of the MSE
8

Ts = 2 Ts = 5 Ts = 10 Ts = 100
Source Image

LDSeg

LSegDiff
Label Image

MedSegDiff

Fig. 7. Qualitative comparison of the segmentations results for various sampling steps (Ts ) of an image from GlaS dataset for different methods. LDSeg
produced qualitatively good segmentation even with 2 sampling steps, whereas LSegDiff needed atleast 10 (both are latent diffusion models). MedSegDiff,
which performs diffusion in the original image domain, could not produce reasonably good segmentation even with 100 sampling steps.

a b
𝑬𝒄𝒉𝒐
← LDSeg

← ResUNet
GT

σ = 0.00 σ = 0.05 σ = 0.10 σ = 0.20 σ = 0.30

Fig. 8. (a) Added noise variance σ versus DSC scores for ResUnet and LDSeg on the Echo and Knee dataset. LDSeg (solid lines) significantly outperformed
ResUNet model (dotted lines), which is a deterministic model, in terms of noise resilience on the source image. (b) The top and bottom rows show some
sample segmentation results for an Echo data for noisy images with different noise variances, using LDSeg and ResUNet, respectively. GT indicates the
ground truth of the image.

loss in the latent domain does not necessarily capture the LDSeg to noise, we have generated the noisy image Iσ from
segmentation errors, compromising the performance. Figure 7 the input I by,
shows the qualitative comparison of the segmentation results Iσ = I + N (0, σ) (19)
for LDSeg, LSegDiff and MedSegDiff methods for an image
where I is a sample from test data and σ is the noise variance.
of GlaS dataset, using different sampling steps. With only 2
8(a) shows the DSC scores for LDSeg and a deterministic
sampling steps, LDSeg is comparable to deterministic models
model ResUNet against different variances of added noise on
like ResUnet for image segmentation in terms of computation
the Echo and the Knee datasets, respectively. LDSeg showed
time.
strong robustness to the added noise even for σ = 0.2, and
maintained reasonably good segmentation accuracy through-
C. Robustness to Noise out. In contrast, the accuracy for ResUNet dropped drastically
One of the key challenges for medical image segmentation with the increasing amount of noise added to the source image.
is to produce accurate segmentation from noisy image acquisi- Figure 8(b) shows a sample Echo image with added noise of
tion. Often times, deterministic segmentation models fail in the different variances and the corresponding segmentation results
presence of noise in the test dataset. As the denoiser in LDSeg by LDSeg and ResUNet.
is conditioned on the source image embedding, which is a low-
dimensional representation of the source image, intuitively it VI. A BLATION S TUDY
should be more robust to high-frequency noise. Moreover, the Two major components that distinguish the proposed LDSeg
learned shape manifolds of the target object act as priori to from other diffusion-based segmentation models are the label
the denoiser for iterative denoising even with the noisy or and the image encoders, which learn the latent embeddings of
slightly inaccurate image embedding, and helps produce clean the object shape manifolds and the source image, respectively.
latent representation z̃l(0) . Thus, accurate segmentation can be We tested the effectiveness of each of these two components
obtained using the label decoder. To test the robustness of by creating several variants of LDSeg:
9

TABLE V
G LA S A BLATION STUDY.

Label Image Label Image


Method DSC IoU
Encoder Encoder Down-sample Down-sample
LDSeg(ld,id) ✘ ✘ ✔ ✔ 0.47 0.33
LDSeg(id) ✔ ✘ ✘ ✔ 0.61 0.47
LDSeg(ld) ✘ ✔ ✔ ✘ 0.71 0.56
LDSeg ✔ ✔ ✘ ✘ 0.91 0.84
ld → Label Down-sampled, id → Image Down-sampled.

• LDSeg: The proposed framework that uses both the label enough GPU/CPU memory. On top of that, fast sampling in the
and the image encoder. reverse process makes the proposed LDSeg computationally
• LDSeg(ld) : The label encoder is replaced with a label much efficient. This can be attributed to the much simpler sam-
down-sampler that down-samples the label image to the pling objective of LDSeg compared to a traditional DPM used
same size of zl(0) . The image encoder is unchanged. for image generation. LDSeg samples from posterior distribu-
• LDSeg(id) : The image encoder is replaced with an image tion for segmentation generation, which is much simpler and
down-sampler that down-samples source image to the concentrated than a prior distribution in image generation case,
same size of zi . The label encoder is unchanged. resulting in remarkably faster sampling. Moreover, the end-
• LDSeg(ld,id) : Both the label and the image encoder are to-end training strategy enables robust segmentation related
replaced with the label and the image down-samplers. representation learning in the latent space, further improving
Table V shows the results of the ablation study in the GlaS the sampling efficiency.
data set. The models with direct down-sampling by nearest-
neighbor interpolation of the image or/and label performed a b c
poorly. This indicates that the denoiser trained on low-
dimensional latent space may have superior noise prediction
capability in general, but without proper learning of the
object shape manifolds along with the robust imaging features,
generating accurate segmentation is extremely challenging.
Figure 9 shows example segmentations with different variants
of the LDSeg models. Fig. 10. An example of uncertainty estimation of segmentation in the Echo
dataset. (a) A sample Echo frame with marked unclear LV and LA wall regions
(orange arrows). (b) The mean segmentation map using 100 sampling runs.
Source image LDSeg (ld,id) LDSeg (ld) (c) The obtained standard deviation (SD) map from the 100 sampling runs.
The orange arrows show the highly uncertain regions with three maximum
SDs that correlate to the locations in (a)

LDSeg is significantly more robust to noises presented in the


source images than the traditional deterministic segmentation
models, which mitigates the noisy image acquisition problem
in medical imaging. A key challenge for the deterministic
segmentation models is to measure prediction uncertainty.
Being generative in nature, LDSeg can estimate prediction
uncertainty by obtaining standard deviation of the predictions
with multiple runs. Figure 10 shows an example of uncertain
regions on object boundary estimation using LDSeg.
A possible limitation of the proposed approach is the use of
Label image LDSeg (id) LDSeg a low-dimensional image embedding learned for highly com-
plex medical structures. As the data complexity increases in
Fig. 9. A sample test image along with its ground truth mask and predictions
by different variants of LDSeg. terms of tissue structures with various distributions, it is very
challenging to learn a proper image embedding preserving all
the fine details, which may hamper the denoising process of
VII. D ISCUSSION the denoiser. One way to address this problem would be to
In medical imaging, often times an image consists of 3D learn different frequency patterns of the input images by the
scans and cannot be down-sampled without loosing important image encoder to enforce additional conditions on the denoiser.
imaging features due to the complex tissue structures, organ-
to-organ interactions, etc. LDSeg can be directly applied on VIII. C ONCLUSION
large 3D images for accurate segmentation, while other tradi- Adapting DPMs for medical image segmentation poses
tional DPM may not be even implementable due to the lack of significant challenges due to large image sizes, complex tissue
10

structures, and noisy image acquisitions. We present LDSeg, [19] Y. Wang, X. Wang, A.-D. Dinh, B. Du, and C. Xu, “Learning to schedule
a novel latent diffusion based segmentation framework that in diffusion probabilistic models,” in Proceedings of the 29th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, 2023,
leverages the learned low-dimensional latent representations pp. 2478–2488.
of the image and target object’s shape manifolds in an end- [20] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
to-end training strategy, which substantially improves the resolution image synthesis with latent diffusion models,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition,
training and inference efficiency for not only 2D, but also 2022, pp. 10 684–10 695.
3D and higher-dimensional medical image segmentation. The [21] K. Pnvr, B. Singh, P. Ghosh, B. Siddiquie, and D. Jacobs, “Ld-znet:
proposed LDSeg demonstrated much improved robustness to A latent diffusion approach for text-based image segmentation,” in
Proceedings of the IEEE/CVF International Conference on Computer
severe noises presented in the source image. Vision, 2023, pp. 4157–4168.
[22] X. Liu, W. Li, and Y. Yuan, “ DiffRect: Latent Diffusion Label Rectifica-
tion for Semi-supervised Medical Image Segmentation ,” in proceedings
R EFERENCES of Medical Image Computing and Computer Assisted Intervention –
MICCAI 2024, vol. LNCS 15012. Springer Nature Switzerland, October
[1] P. Aggarwal, R. Vig, S. Bhadoria, and C. Dethe, “Role of segmentation 2024.
in medical imaging: A comparative study,” International Journal of [23] Y. Mao, J. Zhang, M. Xiang, Y. Lv, Y. Zhong, and Y. Dai, “Con-
Computer Applications, vol. 29, pp. 54–61, 2011. trastive conditional latent diffusion for audio-visual segmentation,” arXiv
[2] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep learning tech- preprint arXiv:2307.16579, 2023.
niques for medical image segmentation: Achievements and challenges,” [24] H. Vu Quoc, T. Tran Le Phuong, M. Trinh Xuan, and S. Dinh Viet,
Journal of Digital Imaging, vol. 32, 05 2019. “Lsegdiff: A latent diffusion model for medical image segmentation,” in
[3] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. Nandi, Proceedings of the 12th International Symposium on Information and
“Medical image segmentation using deep learning: A survey,” IET Image Communication Technology, 2023, pp. 456–462.
Processing, vol. 16, no. 5, pp. 1243–1267, 2022. [25] A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling
[4] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in latent space,” in Neural Information Processing Systems (NeurIPS),
Advances in neural information processing systems, vol. 33, pp. 6840– 2021.
6851, 2020. [26] K. Sirinukunwattana, J. P. W. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B.
[5] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, A. Böhm,
“Cascaded diffusion models for high fidelity image generation,” The O. Ronneberger, B. B. Cheikh, D. Racoceanu, P. Kainz, M. Pfeiffer,
Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249–2281, M. Urschler, D. R. J. Snead, and N. M. Rajpoot, “Gland segmentation
2022. in colon histology images: The glas challenge contest,” 2016.
[6] Y. Song and S. Ermon, “Generative modeling by estimating gradients [27] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-
of the data distribution,” Advances in neural information processing net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp.
systems, vol. 32, 2019. 749–753, 2018.
[7] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, [28] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention:
and B. Poole, “Score-based generative modeling through stochastic Collaborate instead of concatenate,” arXiv preprint arXiv:2006.16362,
differential equations,” in International Conference on Learning 2020.
Representations, 2021. [Online]. Available: https://openreview.net/ [29] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang,
forum?id=PxTIG12RRHS “Swin-unet: Unet-like pure transformer for medical image segmenta-
[8] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image tion,” in European conference on computer vision. Springer, 2022, pp.
synthesis,” Advances in neural information processing systems, vol. 34, 205–218.
pp. 8780–8794, 2021. [30] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
[9] J. Wu, H. Fang, Y. Zhang, Y. Yang, and Y. Xu, “Medsegdiff: Medical for biomedical image segmentation,” in Medical Image Computing
image segmentation with diffusion probabilistic model,” arXiv preprint and Computer-Assisted Intervention–MICCAI 2015: 18th International
arXiv:2211.00611, 2022. Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III
[10] J. Wu, R. Fu, H. Fang, Y. Zhang, and Y. Xu, “Medsegdiff-v2: Diffusion 18. Springer, 2015, pp. 234–241.
based medical image segmentation with transformer,” arXiv preprint [31] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein,
arXiv:2301.11798, 2023. “nnu-net: a self-configuring method for deep learning-based biomedical
[11] L. Bogensperger, D. Narnhofer, F. Ilic, and T. Pock, “Score-based image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
generative models for medical image segmentation using signed distance [32] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional
functions,” 2023. neural networks for volumetric medical image segmentation,” in 2016
[12] A. Rahman, J. M. J. Valanarasu, I. Hacihaliloglu, and V. Patel, fourth international conference on 3D vision (3DV). Ieee, 2016, pp.
“Ambiguous medical image segmentation using diffusion models,” 565–571.
2023 IEEE/CVF Conference on Computer Vision and Pattern [33] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang,
Recognition (CVPR), pp. 11 536–11 546, 2023. [Online]. Available: “Unet++: A nested u-net architecture for medical image segmenta-
https://api.semanticscholar.org/CorpusID:258048896 tion,” in Deep Learning in Medical Image Analysis and Multimodal
[13] W. Ding, S. Geng, H. Wang, J. Huang, and T. Zhou, “Fdiff-fusion: Learning for Clinical Decision Support: 4th International Workshop,
Denoising diffusion fusion network based on fuzzy learning for 3d DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in
medical image segmentation,” Information Fusion, vol. 112, p. 102540, Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018,
Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/J.INFFUS. Proceedings 4. Springer, 2018, pp. 3–11.
2024.102540 [34] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical
[14] T. Chen, C. Wang, Z. Chen, Y. Lei, and H. Shan, “Hidiff: Hybrid transformer: Gated axial-attention for medical image segmentation,” in
diffusion framework for medical image segmentation,” 2024. [Online]. Medical Image Computing and Computer Assisted Intervention–MICCAI
Available: https://arxiv.org/abs/2407.03548 2021: 24th International Conference, Strasbourg, France, September
[15] F. A. Zaman, M. Jacob, A. Chang, K. Liu, M. Sonka, and X. Wu, 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 36–46.
“Surf-cdm: Score-based surface cold-diffusion model for medical image
segmentation,” arXiv preprint arXiv:2312.12649, 2023.
[16] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis-
tic models,” in International Conference on Machine Learning. PMLR,
2021, pp. 8162–8171.
[17] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”
arXiv preprint arXiv:2010.02502, 2020.
[18] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10
steps,” Advances in Neural Information Processing Systems, vol. 35, pp.
5775–5787, 2022.

You might also like