latent(1)
latent(1)
Abstract—Diffusion Probabilistic Models (DPMs) suffer from In computer vision, DPMs have achieved remarkable re-
inefficient inference due to their slow sampling and high memory sults for image generation, outperforming other generative
consumption, which limits their applicability to various medical models [8]. Standard DPMs have two major steps: a forward
imaging applications. In this work, we propose a novel conditional
process that perturbs the image with added Gaussian noise
arXiv:2407.12952v2 [cs.CV] 18 Jan 2025
The second key challenge of DPMs is the time-consuming mated labels to remove hole-shaped structures. In addi-
iterative sampling process, which makes segmentation DPMs tion, the standardized distribution of the latents (having
significantly slower than their deterministic counterparts. Vari- a similar distribution of the added Gaussian noise) will
ous approaches have been developed to improve the sampling ensure faster convergence and improved gradient flow
efficiency for natural image generation [16]–[19]. Recently, during training.
latent diffusion models, which perform the sampling in a • The diffusion in the latent space ensures less memory
low-dimensional latent space, have been used to speed up consumption and much faster training and inference pro-
the diffusion process in natural image generation [20] and cess, enabling efficient application of DPM on 3- and
segmentation [21]. In both works, the latent representation higher-dimensional medical image segmentation.
of the source image was used. Liu et al. proposed a latent • The LDSeg model is significantly more robust to noise
diffusion model that uses label rectification in latent space in the source images compared to the deterministic
for semi-supervised medical image segmentation [22]. Latent segmentation models due to the low-dimensional image
diffusion has also been used in audiovisual segmentation embeddings, which mitigates the segmentation challenges
for robust audio recognition [23]. Recently, Vu Quoc et al. in medical images with noisy acquisition.
proposed a latent diffusion model for image segmentation
[24]. They proposed a two-step training strategy in which II. BACKGROUND
a variational-autoencoder (VAE) is first trained to learn the
latent distribution of the label image, followed by training A. Denoising Diffusion Probabilistic Models (DDPM)
a conditional denoiser of the latent codes; the embedding DDPMs are designed to learn a data distribution by grad-
of the source image is used as the condition. This latent ually denoising a normally distributed variable, which cor-
diffusion model is computationally efficient in comparison to responds to learning the reverse process of a fixed-length
traditional DPMs. However, since the score / denoiser model is Markov chain T . More precisely, this denoising reverse
trained separately by minimizing the mean square error (MSE) process can be modeled as pθ (x0:T ), which is a Markov
between the recovered and true latents, the MSE loss in the chain with learned Gaussian transitions starting at p(xT ) ∼
latent domain may not well capture the segmentation errors, N (xT ; 0, I):
which may compromise the accuracy of segmentation. Vahdat
T
et al. proposed to use an end-to-end training strategy by jointly Y
pθ (x0:T ) := p(xT ) pθ (xt−1 | xt ), (1)
learning latent embeddings and a denoiser of latent codes for
t=1
image generation [25]. They observed improved accuracy and
faster sampling for the end-to-end framework. pθ (xt−1 | xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2)
We introduce a novel conditional latent diffusion-based gen- where x0 ∼ q(x0 ) is a sample from a real data distribution,
erative framework (LDSeg) for medical image segmentation, x1 , . . . , xT are transitional states for timesteps t = 1, . . . , T .
which capitalizes on the inherent advantages of latent diffusion
The forward process in a DDPM is also a Markov chain,
models. Unlike the two-step training strategy in [24], we
which gradually adds noise to the image. Given data x0 ∼
jointly train the encoder, decoder, and the score model in
q(x0 ) sampled from the real distribution, the forward process
an end-to-end fashion. We use a combination of denoiser
at time t ∈ [1, T ] can be defined as q(xt | xt−1 ), where
loss in the latent domain and segmentation loss in the label
Gaussian noise is gradually added given a noise variance
domain. The proposed LDSeg learns the standardized latent
schedule βt ∈ [β1 , βT ]:
representation of the target object shape manifolds, enabling
smooth state transitions between object classes. Moreover, T
Y
unlike traditional DPMs, LDSeg learns to sample from the q(x1:T | x0 ) := q(xt | xt−1 ), (3)
posterior distribution, which is significantly simpler and more t=1
𝒇𝐥𝐚𝐛𝐞𝐥−𝐞𝐧𝐜
𝒇𝐝𝐞𝐧𝐨𝐢𝐬𝐞𝐫
-
𝒇𝐥𝐚𝐛𝐞𝐥−𝐝𝐞𝐜
𝒕
𝒇𝐢𝐦𝐚𝐠𝐞−𝐞𝐧𝐜 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈
𝑰𝒏𝒇𝒆𝒓𝒆𝒏𝒄𝒆 Source Image
𝒕 Time Embedding
Gaussian Diffusion
𝒕 𝒇𝐥𝐚𝐛𝐞𝐥−𝐝𝐞𝐜
𝒇𝐝𝐞𝐧𝐨𝐢𝐬𝐞𝐫 Concatenation
Fig. 1. The proposed LDSeg model. The label encoder flabel-enc and image encoder fimage-enc are used to obtain corresponding low dimensional latent
representations zl(0) and zi for a given ground truth label/mask image y and source image X, respectively. A denoiser fdenoiser , conditioned on the source
image embedding zi , is used to learn the noise distributions of zl(t) for timesteps t = 1, . . . , T , where T is the total number of diffusion steps. zl(t)
is obtained by perturbing zl(0) with a Gaussian block G(·) for a given noise variance scheduler α and β. The cleaned latent space zdn is obtained by
subtracting the predicted noise zn(t) from the perturbed one zl(t) . Finally, a label decoder flabel-dec is used to obtain the segmentation ŷ of the semantic
labels in the original image from zdn . The model is trained in an end-to-end fashion, where our objective is to learn q(ŷ|X) = Eqi (zi |X) [qs (ŷ|z)], where
ql (z | y, X) ∼ N (zdn , σ 2 I). In the inference phase, starting with a random Gaussian z̃l(T ) ∼ N (0, I), the denoiser is iterated for timestep t = T, . . . , 1
to obtain z̃l(0) with zi as the condition. Final segmentation ŷ = flabel-dec (z̃l(0) ) is obtained using the trained label decoder.
bound that improves the quality of generated samples while each time step, conditioned on the embedding of the
being easier to implement, source image and the time step t.
4) Label decoder: The label decoder flabel−dec is used to
LDDP M := Et,x0 ,ϵ [∥ϵ − ϵθ (xt , t)∥2 ] (7) produce segmentation by mapping the denoised latent
space to its corresponding semantic label image in the
where ϵθ is a function approximator intended to predict ϵ from
original image domain. The model training and inference
xt by a trained denoiser. With a trained denoiser, the data
workflows are shown in Figure 1.
can be generated with the reverse process by iterating through
t = T, . . . , 1. Starting from xT ∼ N (0, I), the transitional We note that the segmentation labels are discrete, and
states can be obtained by, hence corrupting them by Gaussian noise is unnatural, as the
1
βt
label/mask image has only a few modes (i.e., the number of
xt−1 = √ xt − √ ϵθ (xt , t) + σt z (8) object classes). We propose to mitigate this inherent problem
αt 1 − ᾱt
by learning a low-dimensional standardized representation of
where σt is the noise variance of timestep t and z ∼ N (0, I). the label images. In other words, we want to learn a label
encoder flabel-enc (·) that projects the input labels into a latent
space with standardized distribution. Essentially, the label en-
III. M ETHOD
coder learns to produce low-dimensional latent representation
The proposed LDSeg framework consists of four major of the object shape manifolds for the label images. This low-
components: dimensional standardized representation (label embedding) has
1) Label encoder: The label encoder, denoted by flabel−enc , two major advantages over the original label image, (1) it is
is used to learn the low-dimensional latent representation continuous, thus ensuring smooth transition among different
with a standardized distribution of the shape manifolds object classes, and (2) it is computationally more efficient to
of the target object. train a conditional denoiser for a low-dimensional standardized
2) Image encoder: The image encoder, denoted by latent space, thus making the algorithm significantly faster in
fimage−enc , learns the low-dimensional image embed- the inference phase.
ding zi from the source image X. A standard DPM denoiser has two inputs, a noisy version of
3) Conditional label denoiser: The denoiser fdenoiser learns the input image and its corresponding timestep. For segmenta-
the added noise of the perturbed label embedding for tion, the denoiser needs additional conditioning. The condition
4
𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 (𝑦)
ො 1 𝛽𝑡
𝑧ǁ𝑙(0) 𝑧ǁ𝑙(𝑡−1) = 𝑧ǁ − 𝑓𝑑𝑒𝑛𝑜𝑖𝑠𝑒𝑟 (𝑧ǁ𝑙(𝑡) , 𝑧𝑖 , 𝑡) 𝑧ǁ𝑙(𝑇) ~𝒩(0, Ι)
𝑓label−dec 𝛼𝑡 𝑙(𝑡) 1 − 𝛼𝑡
Fig. 2. A sample GlaS data [26] was used to demonstrate the forward and the reverse diffusion processes. In the forward process (top row), the low-dimensional
latent representation zl(0) is first obtained from the label image. Then, Gaussian noise is gradually injected for timestep t = 1, . . . , T , given the noise variance
schedules of β, where ϵ ∼ N (0, I). At timestep T , zl(T ) is subject to N (0, I). To start the reverse process (bottom row), z̃l(T ) is sampled from N (0, I).
Then the denoiser is used iteratively for timesteps t = T, . . . , 1 with the source image embedding zi as the condition. At the end of the reverse process, the
segmentation mask is obtained from z̃l(0) using the trained label decoder.
can be the source image [9], [11], or a text indicating the target Our objective is to learn q(ŷ|X) = Eqi (zi |X) [qs (ŷ|z)],
object [21]. As our objective is image semantic segmentation, where ql (z | y, X) ∼ N (zdn , σ 2 I). The loss function L
we propose to use image embedding as a condition for the consists of two terms, the segmentation loss L1 and the
denoiser. The image embedding is a low-dimensional latent denoiser loss L2 , where the segmentation loss is a combination
representation of the source image having the same size as of cross-entropy loss LCE and dice similarity coefficient
the label embedding, which is learned using an image encoder (DSC) loss LDSC .
fimage-enc (·). The image embedding is concatenated with the X
noisy representation of the label embedding and used as a two LCE (ŷ, y) = − yc log(ŷc ) (14)
c∈C
channel input to the denoiser, along with its corresponding
timestep as a separate input. The denoiser fdenoiser (·), learns P
2 i ŷi yi
the transitional noisy distributions of the label embedding, LDSC (ŷ, y) = 1 − P P (15)
conditioned on the image embedding, and predicts noise for i ŷi + i yi
a given timestep. Finally, to map the denoised latent space to
the semantic segmentation in the original image domain, we L1 = EX,y [LCE (ŷ, y) + γLDSC (ŷ, y)] (16)
learn a label decoder flabel-dec .
L2 = Eϵ∼N (0,I) ∥fdenoiser (zl(t) , zi , t) − ϵ∥2
(17)
A. Loss function
L = L1 + λL2 (18)
Let, X, y, and ŷ be the source image, its corresponding
label image, and the predicted segmentation, respectively, where, c ∈ C is an object class of a set of object classes C,
sampled from the dataset. zi , zl(0) are their corresponding i is the corresponding pixel, γ and λ are scaler co-efficients.
image embedding and label embedding, This end-to-end training strategy is functionally analogous to
a VAE model. Although it does not explicitly parameterize a
zi = fimage-enc (X) (9) distribution like VAEs, the denoiser’s role in handling noise
in latent space creates an analogous structure. Unlike VAE,
zl(0) = flabel-enc (y) (10) the proposed model does not aim to explicitly match a latent
distribution to a priori. Instead, the denoiser’s regularization
Given noise variance schedule parameters α and β, a
ensures that the latent space remains structured, noise-resilient
Gaussian block G(·) is used to produce the noisy zl(t) for
and better representative of the segmentation related features.
timestep t ∈ (1, T ) [4], [16],
An example of a conditional denoising forward process for
√ √
zl(t) = G(zl(0) , t) = ᾱt zl(0) + 1 − ᾱt ϵ (11) segmentation generation is shown in Figure 2 (top row). The
Qt algorithm for end-to-end training is shown in Algorithm 1.
where αt = 1 − βt , ᾱt = i=1 αi and ϵ ∼ N (0, I). The
denoiser predicts the noise of the timestep t conditioned on
zi . Let zdn be the denoised latent space, B. Reverse Process for Segmentation
As the image encoder is independent of the denoiser, we
zdn = zl(t) − fdenoiser (zl(t) , zi , t) (12) only need to obtain the image embedding zi at the start of
the reverse process. In the reverse process, the main objective
ŷ = flabel-dec (zdn ) (13) is to generate latent representation zl(0) , conditioned on zi .
5
Fig. 3. A sample of knee data. FC and TC are marked with green and red
Like the other image generation tasks of DPMs, a Gaussian color. Three slices from axial, coronial and sagittal plane is shown along with
N (0, I) is used as the noisy latent mask representation z̃l(T ) the 3D surface plot for FC and TC.
at timestep T . Then the denoiser is iterated for t = T, . . . , 1.
At the end of the iteration, we obtain z̃l(0) , which is used 3) Knee (https://data-archive.nimh.nih.gov/oai/) is a pub-
as an input to the trained label decoder to get the final licly available 3D MRI dataset. The dataset contains
segmentation ŷ = flabel-dec (z̃l(0) ). An example of reverse randomly selected 987 3D MRI scans from 244 patients
process is shown in Figure 2 (bottom row). The sampling on different time points. Focused volumetric regions
algorithm for segmentation is shown in Algorithm 2. with an image size of 160 × 104 × 256 around the femur
cartilage with bone (FC) and tibia cartilage with bone
(TC) are used as the region of interest (ROI). The FC
Algorithm 2 Inference and TC are segmented by an automatic segmentation
1: X ∼ qdata (X), z̃l(t) ∼ N (0, I) algorithm and validated/edited by an expert. Figure 3
2: zi = fimage-enc (X) shows a sample of the Knee dataset.
3: for t = T, . . . , 1 do
4: n ∼ N (0, I) if t > 1, else n = 0 B. Model Architecture
5: z̃l(t−1) = √1 z̃l(t) − √ βt f (z̃ , z , t) + σt n The label and image encoder both have architectures similar
αt 1−ᾱt denoiser l(t) i
6: end for to standard ResUnet encoder [27], without any skip con-
7: ŷ = flabel-dec (z̃l(0) ) nections. Each have several convolution and down-sampling
8: return ŷ layers that determine the size of the latent space for the low-
dimensional projection of the label and source input images.
We experimented with different down-sampling scales and
chose 4 down-sampling layers, which produced the best results
IV. E XPERIMENTS for all three datasets. The image size for Echo, GlaS and Knee
data were resized to 512×768, 512×512 and 128×128×256,
A. Datasets respectively. Hence, the sizes of the low-dimensional zl(0)
We have used 3 datasets to demonstrate the effectiveness of and zi for Echo, GlaS, and Knee data are 32 × 48, 32 × 32
the proposed LDSeg: and 8 × 8 × 16, respectively. We observed that these were
1) Echo is a 2D+time echocardiogram (echo) video dataset the optimal latent sizes as further down-sampling reduced the
from University of Iowa Hospitals & Clinics. All the model accuracy, while less down-sampling reduced denoiser
videos are standard apical 4-chamber scans with a accuracy as the search space got enlarged and learning noise
left-ventricular focused view. Echos are acquired by distributions became challenging. A normalization layer is
transthoracic echocardiography (TTE) using standard 2D added as the final layer to the label encoder to ensure stan-
echocardiography techniques following the guidelines of dardized distribution (µ = 0, σ = 1) for label embedding.
the American Society of Echocardiography. In total, the On the other hand, final two down-sampling layers of the
dataset contains 65 echos (2230 still frames). The left image encoder are equipped with multi-head attention layers
ventricles (LV) and the left atria (LA) were fully traced [28] to capture robust imaging features. The denoiser has a
by an expert manually using ITK-snap. standard ResUnet shape with time-embedding blocks and self-
2) GlaS [26] is a publicly available 2D histopathology attention layers. Specifically, we have adapted the denoiser
dataset of Hematoxylin and Eosin (H&E) stained slides, architecture from [4]. The image embedding is concatenated
6
DSC: 0.76 DSC: 0.90 DSC: 0.91 DSC: 0.93 DSC: 0.94 DSC: 0.98
DSC: 0.91 DSC: 0.93 DSC: 0.88 DSC: 0.95 DSC: 0.92 DSC: 0.96
Fig. 4. Qualitative segmentation results of different methods for GlaS and Echo dataset, shown in top and bottom rows, respectively. Dark red marks the
false negative and the light red marks the false positive error on the segmentation result. GT indicates the ground-truth/label-image.
B. Computational Efficiency
Fig. 5. The number of evenly spaced sampling steps vs DSC for different
datasets. The DDIM algorithm with only 2 evenly spaced sampling steps
between 1 and T = 1000 (inclusive) produced maximum segmentation
accuracy for all the datasets. The number of steps are plotted in the logarithmic
scale for convenience.
Fig. 6. (a) The number of sampling steps vs DSCs using LDSeg, LSegDiff
and MedSegDiff models for the GlaS dataset. LDSeg was able to achieve the
maximum DSC with only 2 sampling steps, outperforming both LSegDiff (10)
TABLE IV
and MedSegDiff (700). (b) Image sizes vs execution times for segmenting
T HE LDS EG ALGORITHM RUN TIMES FOR SEGMENTING A SINGLE IMAGE
a single image with different DPM. The execution times of LDSeg and
FROM EACH DATASET ARE SHOWN FOR SAMPLING STEPS 1000 ( USING
LSegDiff (both latent diffusion models) remained close to constant due to
ALL THE SAMPLING STEPS ) AND SAMPLING STEPS 2 ( THE MINIMUM
the use of constrained low-dimensional latent space, while for SDF-DDPM
NUMBER OF SAMPLING STEPS TO ACHIEVE THE SAME ACCURACY AS
and MedSegDiff, the execution times increased exponentially with increased
USING ALL SAMPLING STEPS ).
image sizes.
Ts = 2 Ts = 5 Ts = 10 Ts = 100
Source Image
LDSeg
LSegDiff
Label Image
MedSegDiff
Fig. 7. Qualitative comparison of the segmentations results for various sampling steps (Ts ) of an image from GlaS dataset for different methods. LDSeg
produced qualitatively good segmentation even with 2 sampling steps, whereas LSegDiff needed atleast 10 (both are latent diffusion models). MedSegDiff,
which performs diffusion in the original image domain, could not produce reasonably good segmentation even with 100 sampling steps.
a b
𝑬𝒄𝒉𝒐
← LDSeg
← ResUNet
GT
Fig. 8. (a) Added noise variance σ versus DSC scores for ResUnet and LDSeg on the Echo and Knee dataset. LDSeg (solid lines) significantly outperformed
ResUNet model (dotted lines), which is a deterministic model, in terms of noise resilience on the source image. (b) The top and bottom rows show some
sample segmentation results for an Echo data for noisy images with different noise variances, using LDSeg and ResUNet, respectively. GT indicates the
ground truth of the image.
loss in the latent domain does not necessarily capture the LDSeg to noise, we have generated the noisy image Iσ from
segmentation errors, compromising the performance. Figure 7 the input I by,
shows the qualitative comparison of the segmentation results Iσ = I + N (0, σ) (19)
for LDSeg, LSegDiff and MedSegDiff methods for an image
where I is a sample from test data and σ is the noise variance.
of GlaS dataset, using different sampling steps. With only 2
8(a) shows the DSC scores for LDSeg and a deterministic
sampling steps, LDSeg is comparable to deterministic models
model ResUNet against different variances of added noise on
like ResUnet for image segmentation in terms of computation
the Echo and the Knee datasets, respectively. LDSeg showed
time.
strong robustness to the added noise even for σ = 0.2, and
maintained reasonably good segmentation accuracy through-
C. Robustness to Noise out. In contrast, the accuracy for ResUNet dropped drastically
One of the key challenges for medical image segmentation with the increasing amount of noise added to the source image.
is to produce accurate segmentation from noisy image acquisi- Figure 8(b) shows a sample Echo image with added noise of
tion. Often times, deterministic segmentation models fail in the different variances and the corresponding segmentation results
presence of noise in the test dataset. As the denoiser in LDSeg by LDSeg and ResUNet.
is conditioned on the source image embedding, which is a low-
dimensional representation of the source image, intuitively it VI. A BLATION S TUDY
should be more robust to high-frequency noise. Moreover, the Two major components that distinguish the proposed LDSeg
learned shape manifolds of the target object act as priori to from other diffusion-based segmentation models are the label
the denoiser for iterative denoising even with the noisy or and the image encoders, which learn the latent embeddings of
slightly inaccurate image embedding, and helps produce clean the object shape manifolds and the source image, respectively.
latent representation z̃l(0) . Thus, accurate segmentation can be We tested the effectiveness of each of these two components
obtained using the label decoder. To test the robustness of by creating several variants of LDSeg:
9
TABLE V
G LA S A BLATION STUDY.
• LDSeg: The proposed framework that uses both the label enough GPU/CPU memory. On top of that, fast sampling in the
and the image encoder. reverse process makes the proposed LDSeg computationally
• LDSeg(ld) : The label encoder is replaced with a label much efficient. This can be attributed to the much simpler sam-
down-sampler that down-samples the label image to the pling objective of LDSeg compared to a traditional DPM used
same size of zl(0) . The image encoder is unchanged. for image generation. LDSeg samples from posterior distribu-
• LDSeg(id) : The image encoder is replaced with an image tion for segmentation generation, which is much simpler and
down-sampler that down-samples source image to the concentrated than a prior distribution in image generation case,
same size of zi . The label encoder is unchanged. resulting in remarkably faster sampling. Moreover, the end-
• LDSeg(ld,id) : Both the label and the image encoder are to-end training strategy enables robust segmentation related
replaced with the label and the image down-samplers. representation learning in the latent space, further improving
Table V shows the results of the ablation study in the GlaS the sampling efficiency.
data set. The models with direct down-sampling by nearest-
neighbor interpolation of the image or/and label performed a b c
poorly. This indicates that the denoiser trained on low-
dimensional latent space may have superior noise prediction
capability in general, but without proper learning of the
object shape manifolds along with the robust imaging features,
generating accurate segmentation is extremely challenging.
Figure 9 shows example segmentations with different variants
of the LDSeg models. Fig. 10. An example of uncertainty estimation of segmentation in the Echo
dataset. (a) A sample Echo frame with marked unclear LV and LA wall regions
(orange arrows). (b) The mean segmentation map using 100 sampling runs.
Source image LDSeg (ld,id) LDSeg (ld) (c) The obtained standard deviation (SD) map from the 100 sampling runs.
The orange arrows show the highly uncertain regions with three maximum
SDs that correlate to the locations in (a)
structures, and noisy image acquisitions. We present LDSeg, [19] Y. Wang, X. Wang, A.-D. Dinh, B. Du, and C. Xu, “Learning to schedule
a novel latent diffusion based segmentation framework that in diffusion probabilistic models,” in Proceedings of the 29th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, 2023,
leverages the learned low-dimensional latent representations pp. 2478–2488.
of the image and target object’s shape manifolds in an end- [20] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
to-end training strategy, which substantially improves the resolution image synthesis with latent diffusion models,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition,
training and inference efficiency for not only 2D, but also 2022, pp. 10 684–10 695.
3D and higher-dimensional medical image segmentation. The [21] K. Pnvr, B. Singh, P. Ghosh, B. Siddiquie, and D. Jacobs, “Ld-znet:
proposed LDSeg demonstrated much improved robustness to A latent diffusion approach for text-based image segmentation,” in
Proceedings of the IEEE/CVF International Conference on Computer
severe noises presented in the source image. Vision, 2023, pp. 4157–4168.
[22] X. Liu, W. Li, and Y. Yuan, “ DiffRect: Latent Diffusion Label Rectifica-
tion for Semi-supervised Medical Image Segmentation ,” in proceedings
R EFERENCES of Medical Image Computing and Computer Assisted Intervention –
MICCAI 2024, vol. LNCS 15012. Springer Nature Switzerland, October
[1] P. Aggarwal, R. Vig, S. Bhadoria, and C. Dethe, “Role of segmentation 2024.
in medical imaging: A comparative study,” International Journal of [23] Y. Mao, J. Zhang, M. Xiang, Y. Lv, Y. Zhong, and Y. Dai, “Con-
Computer Applications, vol. 29, pp. 54–61, 2011. trastive conditional latent diffusion for audio-visual segmentation,” arXiv
[2] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep learning tech- preprint arXiv:2307.16579, 2023.
niques for medical image segmentation: Achievements and challenges,” [24] H. Vu Quoc, T. Tran Le Phuong, M. Trinh Xuan, and S. Dinh Viet,
Journal of Digital Imaging, vol. 32, 05 2019. “Lsegdiff: A latent diffusion model for medical image segmentation,” in
[3] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. Nandi, Proceedings of the 12th International Symposium on Information and
“Medical image segmentation using deep learning: A survey,” IET Image Communication Technology, 2023, pp. 456–462.
Processing, vol. 16, no. 5, pp. 1243–1267, 2022. [25] A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling
[4] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in latent space,” in Neural Information Processing Systems (NeurIPS),
Advances in neural information processing systems, vol. 33, pp. 6840– 2021.
6851, 2020. [26] K. Sirinukunwattana, J. P. W. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B.
[5] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, A. Böhm,
“Cascaded diffusion models for high fidelity image generation,” The O. Ronneberger, B. B. Cheikh, D. Racoceanu, P. Kainz, M. Pfeiffer,
Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249–2281, M. Urschler, D. R. J. Snead, and N. M. Rajpoot, “Gland segmentation
2022. in colon histology images: The glas challenge contest,” 2016.
[6] Y. Song and S. Ermon, “Generative modeling by estimating gradients [27] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-
of the data distribution,” Advances in neural information processing net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp.
systems, vol. 32, 2019. 749–753, 2018.
[7] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, [28] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention:
and B. Poole, “Score-based generative modeling through stochastic Collaborate instead of concatenate,” arXiv preprint arXiv:2006.16362,
differential equations,” in International Conference on Learning 2020.
Representations, 2021. [Online]. Available: https://openreview.net/ [29] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang,
forum?id=PxTIG12RRHS “Swin-unet: Unet-like pure transformer for medical image segmenta-
[8] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image tion,” in European conference on computer vision. Springer, 2022, pp.
synthesis,” Advances in neural information processing systems, vol. 34, 205–218.
pp. 8780–8794, 2021. [30] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
[9] J. Wu, H. Fang, Y. Zhang, Y. Yang, and Y. Xu, “Medsegdiff: Medical for biomedical image segmentation,” in Medical Image Computing
image segmentation with diffusion probabilistic model,” arXiv preprint and Computer-Assisted Intervention–MICCAI 2015: 18th International
arXiv:2211.00611, 2022. Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III
[10] J. Wu, R. Fu, H. Fang, Y. Zhang, and Y. Xu, “Medsegdiff-v2: Diffusion 18. Springer, 2015, pp. 234–241.
based medical image segmentation with transformer,” arXiv preprint [31] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein,
arXiv:2301.11798, 2023. “nnu-net: a self-configuring method for deep learning-based biomedical
[11] L. Bogensperger, D. Narnhofer, F. Ilic, and T. Pock, “Score-based image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
generative models for medical image segmentation using signed distance [32] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional
functions,” 2023. neural networks for volumetric medical image segmentation,” in 2016
[12] A. Rahman, J. M. J. Valanarasu, I. Hacihaliloglu, and V. Patel, fourth international conference on 3D vision (3DV). Ieee, 2016, pp.
“Ambiguous medical image segmentation using diffusion models,” 565–571.
2023 IEEE/CVF Conference on Computer Vision and Pattern [33] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang,
Recognition (CVPR), pp. 11 536–11 546, 2023. [Online]. Available: “Unet++: A nested u-net architecture for medical image segmenta-
https://api.semanticscholar.org/CorpusID:258048896 tion,” in Deep Learning in Medical Image Analysis and Multimodal
[13] W. Ding, S. Geng, H. Wang, J. Huang, and T. Zhou, “Fdiff-fusion: Learning for Clinical Decision Support: 4th International Workshop,
Denoising diffusion fusion network based on fuzzy learning for 3d DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in
medical image segmentation,” Information Fusion, vol. 112, p. 102540, Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018,
Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/J.INFFUS. Proceedings 4. Springer, 2018, pp. 3–11.
2024.102540 [34] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical
[14] T. Chen, C. Wang, Z. Chen, Y. Lei, and H. Shan, “Hidiff: Hybrid transformer: Gated axial-attention for medical image segmentation,” in
diffusion framework for medical image segmentation,” 2024. [Online]. Medical Image Computing and Computer Assisted Intervention–MICCAI
Available: https://arxiv.org/abs/2407.03548 2021: 24th International Conference, Strasbourg, France, September
[15] F. A. Zaman, M. Jacob, A. Chang, K. Liu, M. Sonka, and X. Wu, 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 36–46.
“Surf-cdm: Score-based surface cold-diffusion model for medical image
segmentation,” arXiv preprint arXiv:2312.12649, 2023.
[16] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis-
tic models,” in International Conference on Machine Learning. PMLR,
2021, pp. 8162–8171.
[17] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”
arXiv preprint arXiv:2010.02502, 2020.
[18] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10
steps,” Advances in Neural Information Processing Systems, vol. 35, pp.
5775–5787, 2022.