Learning Semi-Supervised Medical Image Segmentation
Learning Semi-Supervised Medical Image Segmentation
Fani Deligianni†
arXiv:2409.10422v1 [cs.CV] 16 Sep 2024
Hang Dai
University of Glasgow University of Glasgow
[email protected] [email protected]
ℒ sup
𝑃𝐴𝑙
ℒ rs
𝑋 𝑙 𝑌𝐴𝑢
𝑎𝑝𝑟
𝑃𝐴𝑢 𝐸𝑞. 8
ℒ 𝑐𝑙 ℒ 𝑐𝑝𝑠
𝑎𝑝𝑙
𝑃𝐵𝑢
𝑟𝑢 𝑇 𝑌𝑙
𝑢
𝑋
𝑌𝐵𝑢
𝑃𝐵𝑙
Figure 1. The overall architecture of our framework for semi-supervised medical image segmentation.
Our contrastive loss follows [40], but with the key differ- tions across classes and reduces memory usage, unlike [36]
ence that it contrasts pixel features instead of whole-image which simply discards background features. Note that in
features. We project each pixel to a shared embedding space our experiments, N = 1000 and O = 500. The positive
then regularize in a supervised manner, encouraging fea- key ap is given by calculating the average of all other pixels
tures of anchor pixels to be similar to those of pixels having of the same class, i.e. in the anchor set Ac :
the same class (positives), and to be dissimilar to those of 1 X
alp = ai . (5)
different classes (negatives). |Ac |
ai ∈Ac
Specifically, as shown in Fig. 1, we extract a feature
Contrasting only an average positive instead of all posi-
batch F = FA ∪ FB , where F∗ = H∗ (E∗ (X)) and H∗ (·)
tives is computationally cheaper, yet still allows reducing
is the projector. The choice of anchors, which serve as the
the average distance between the anchor and other samples
comparison target of each class, has a great impact on learn-
of class c [50]. In Section 3.4 we will show how using spa-
ing; we therefore try to reduce the number of anchors with
tial registration information can provide additional positives
incorrect class labels. For every class in the current mini-
for contrastive learning.
batch, we sample pixels with high top-1 probability value
as anchors Ac for class c, setting Training and inference. The two models are trained si-
multaneously with separate losses. The total training loss
Ac = fi | (yi = c) ∧ (pi > h) , (3)
LA for model A is:
where fi is the ith pixel feature in F , and the threshold h LA = Lsup(A) + wcps Lcps(A) + wcl Lcl . (6)
for top-1 probability value is set to 0.5 to only exclude hard
samples. and similarly for model B. Here w∗ are weighting factors
used to balance each loss term. Overall, this setup yields
The supervised contrastive loss Lcl is then computed as:
comparable performance to the SOTA contrastive cross-
1 X 1 X exp(ai · ap /τ )
Lcl = − log , teaching method, MCSC [49], while being significantly
|C| |anc | a ∈an exp(ai · ap /τ ) + Z simpler, and easier to adapt to use registration information.
c∈C i c
Registration transform
Supervised Contrastive guided by labels from logit space Contrastive guided by registration from feature space
Anchor for class i Incorrect Anchor Positive for class i Incorrect Positive Positive Pair Incorrect Positive Pair
Figure 2. Supervised contrastive learning guided by labels vs. registration: In the semi-supervised setting, for unlabeled data, the
supervised contrastive loss uses pseudo-label information to select pairs. However, pseudo-labels are unreliable, especially early in training.
For example, in the middle panel, the anchor is wrongly labeled as Myo (green), which leads to an incorrect learning signal, due to
contrasting with positives correctly labeled as Myo. In contrast, registration finds the anatomically-closest point to the anchor in each 3D
volume, without relying on label predictions from models, enabling the contrastive loss to perform correct comparisons between cases.
method. Although the segmentation model remains 2D, op- Assuming that the slice xui belongs to the unlabeled vol-
erating on individual slices, each slice is now considered ume vju , we define the registered label riu by mapping the
within the 3D space of its original volume. We define the ground truth yil from the labeled volume vql :
set of registration transforms as T = {Tij }Ni=1,j=1 , where riu = Tqj (yil ), (8)
Tij maps points from volume vi to volume vj , and N is the
total number of volumes. where Tqj is the transform from to This transform vqu vjl .
Our CCT-R uses T in two ways. First, we go beyond aligns the label yil with the corresponding coordinates in
cross-teaching, introducing a new loss that uses registration the slice xui , resulting in the riu . This greatly improves the
to transfer labels from labeled to unlabeled data (Sec. 3.3). model’s learning performance (see Sec. 4.4), especially in
Furthermore, traditional supervised contrastive learning cases with minimal supervision (e.g. only one labeled vol-
typically relies on predicted logits, which can introduce ume).
errors. Our CCT-R mitigates this by using T to identify Best registration selection strategy. In practice, registra-
anatomically corresponding features across volumes, pro- tions are often imperfect, particularly for complex anatom-
viding a complementary set of positives (Sec. 3.4). ical regions such as the abdomen. Moreover, the loss de-
scribed in Eq. 7 does not require every image to be paired
3.3. Registration supervision loss
with all others. We therefore design a strategy to choose
We use spatial transforms obtained by registration as which registered pairs should be used. Importantly, this
an additional source of pseudo-labels to supervise the two strategy cannot rely on ground-truth labels, due to our semi-
models. Specifically, by transforming a point from an un- supervised setting. Specifically, we measure the cycle-
labeled volume to the corresponding point in a labeled vol- consistency of the transforms from T (Sec. 3.2) between
ume, we can assume that these two points correspond to the two volumes, say vju and vql . We apply the forward trans-
same anatomical location. Thus, the label from the labeled form Tjq (j-to-q) and the reverse transform Tqj (q-to-j) on
volume can be used as supervision for the unlabeled slice. volume vju :
This provides much more accurate pseudo-labels early in
ṽju = Tqj (Tjq (vju )). (9)
training, and also helps to reduce the confirmation bias that
can arise from cross-teaching. Ideally, ṽj should be equal to the original volume vju , mean-
u
Formally, we define a new loss Lrs , that encourages each ing the composition of forward and reverse transformations
pixel to match the label of its corresponding location in the approximates the identity function. We calculate the global
paired labeled volume: similarity between vju and ṽju using both mutual informa-
M tion (MI) [59] and root mean square error (RMSE), and use
1 X these to derive a composite score
Lrs = − (Ldice (pui , riu ) + Lce (pui , riu )) , (7)
M i=1 S = wrmse · RMSE + wmi · MI, (10)
where pui is the class probability map of the ith unlabeled where wrmse and wmi weight the importance of RMSE and
image xui , and riu is a new registered label found by registra- MI, respectively. We then select the vql that minimizes this
tion. Lrs is then added to the overall loss function (Eq. 6). composite score to generate the best additional pseudo-label
riu for the unlabeled slice xui in vju . 4. Experiments
3.4. Registration-enhanced positive sampling Datasets. We evaluate CCT-R using two challenging
benchmark datasets. ACDC [9] comprises of 200 short-
We next show how to use registration to improve the su- axis cardiac MR volumes from 100 cases, with segmenta-
pervised contrastive learning loss in Eq. 4. Fig. 2 shows tion masks provided for the left ventricle (LV), myocardium
the shortcomings of standard positive sampling in compari- (Myo), and right ventricle (RV). We allocate 70 cases (1930
son to our novel approach integrating registration. Positives slices) for training, 10 for validation, and 20 for testing as
ap derived from (pseudo-)labels are sampled from any lo- in [57], and match their choice of labeled cases. Synapse
cation within the same organ or class as shown in Eq. 4. In [43] consists of abdominal CT volumes from 30 cases, with
contrast, registration-based positives correspond to the ex- eight labeled organs: aorta, gallbladder, spleen, left kidney,
act same anatomical location within the organ, albeit in dif- right kidney, liver, pancreas, and stomach. As in [15], we
ferent volumes or patients. Any noise in registration-based use 18 cases (2212 slices) for training and 12 for testing.
positives stems from registration inaccuracies and is inde- We precomputed a composite pairwise registration (affine
pendent of pseudo-label errors. Therefore, we augment the for ACDC and affine + B-spline deformable transformation
set of positive samples by incorporating registration-based for Synapse) for all training data using ITK [56, 60].
examples. This approach reduces the confirmation bias that
Metrics. For quantitative evaluation, we use two widely-
can arise when learning only from pseudo-labels.
recognized metrics for 2D segmentation: Dice coefficient
Assume the xyz coordinate of anchor ai in an image (DSC) and 95% Hausdorff Distance (HD).
from volume vq is denoted by p. We use a registration trans- Baselines. We first compare with a registration baseline
form to get the corresponding positive coordinates pj in vj : that is not learning-based—we use the transforms to prop-
pj = Tqj (p), (11) agate labels from the labeled training cases to the test im-
where j ∈ {1, 2, . . . , N } and j ̸= q, i.e. we consider all ages, similar to [8, 27, 55], selecting labeled cases with our
other training volumes in V . Given the pj , we extract the BRS. We also compare a joint registration and segmentation
positive feature arpj from the corresponding feature maps. model, DeepAtlas [82]; this learns registration from scratch
simultaneously with segmentation. To stay consistent with
Since our minibatch comprise 2D slices rather than full
our CCT-R, we reimplemented it using a 2D U-Net seg-
3D volumes, there is only a small probability that the fea-
mentation model. We evaluate several recent S4 methods
ture map containing a given registered point pj will in fact
with the U-Net [67] backbone: Mean Teacher (MT) [71],
be available in the current minibatch. We therefore build a
Deep Co-Training (DCT) [66], Uncertainty Aware Mean
memory bank B to serve as a source of feature maps, which
Teacher (UAMT) [85], Interpolation Consistency Train-
provides more diverse registered positive samples across
ing (ICT) [75], Cross Consistency Training (CCT) [64],
different 3D volumes. The memory bank B stores feature
Cross Pseudo Supervision (CPS) [19], and Cross Teach-
maps of 2D slices. For every slice in each mini-batch, new
ing Supervision (CTS) [57], which like CCT-R uses Swin-
feature maps are added to B. If a slice is not yet in B, it
UNet [11] (Transformer) and U-Net backbones. In addition,
is added; otherwise, the existing slice is updated with the
we include the SOTA S4 method with contrastive learn-
new features. Once B reaches its maximum capacity K,
ing, MCSC [49]. As a reference we also train the U-Net
the oldest slices are removed in a first-in, first-out (FIFO)
backbone from the S4 methods on only the labeled subset
order. This provides the model with a more diverse set of
of cases (LS) without additional tricks. We also include
features from various 3D volumes.
fully-supervised methods—the same U-Net trained under
The positive features arpj are averaged over the available full supervision (FS), and the SOTA fully-supervised meth-
j indices that exist in the memory bank: ods BATFormer [47] (on ACDC) and nnFormer [88] (on
1 X Synapse). We retrain all baseline models using their recom-
arp = apj , (12)
|J| mended hyperparameters, and report the results from [57]
j∈J
where J represents the set of volume indices for which the or our replication, whichever is better. Furthermore, the re-
feature point exists in the memory bank. Note that J is a sults of all baselines are given in the appendix.
subset of the total volume indices {1, 2, . . . , N }. Implementation details. For all methods we use random
cropping, random flipping and rotations to augment. All
Finally, we combine with the pseudo-label-supervised methods were trained until convergence, or up to 40,000
positive key alp from Eq. 5 to give a single combined posi-
iterations. We precomputed a composite pairwise regis-
tive key ap for ai :
tration (affine for ACDC and affine + B-spline deformable
ap = w1 alp + w2 arp . (13) transformation for Synapse) for all training data using ITK
We use these positives in the contrastive loss Eq. 4, but oth- [56, 60]. We used the AdamW optimizer with a weight
erwise keep it unchanged. decay of 5 × 10−4 . The learning rate followed a poly-
Table 1. Segmentation results on ACDC for our method and base- Table 2. Segmentation results on Synapse for ours method and
lines, according to DSC (%) and HD (mm). baselines, according to DSC (%) and HD (mm).
Mean Myo LV RV Labeled Methods DSC↑ HD↓ Aorta Gallb Kid L Kid R Liver Pancr Spleen Stom
Labeled Methods
UNet-FS 75.6 42.3 88.8 56.1 78.9 72.6 91.9 55.8 85.8 74.7
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ 18(100%)
nnFormer 86.6 10.6 92.0 70.2 86.6 86.3 96.8 83.4 90.5 86.8
UNet-FS 91.7 4.0 89.0 5.0 94.6 5.9 91.4 1.2 Reg. only (Affine) 27.0 39.6 16.0 7.5 36.4 33.0 56.8 13.1 28.5 25.1
70 (100%)
BATFormer [47] 92.8 8.0 90.26 6.8 96.3 5.9 91.97 11.3 Reg. only (Aff+Def) 32.5 36.5 29.7 4.8 36.5 29.4 65.5 14.2 48.0 31.7
Reg. only (Aff) 30.7 16.4 19.7 13.9 42.0 14.4 30.5 20.8 DeepAtlas [82] 56.1 85.3 69.2 43.3 50.8 55.2 88.8 30.5 62.7 48.0
UNet-LS 47.2 122.3 67.6 29.7 47.2 50.7 79.1 25.2 56.8 21.5
DeepAtlas [82] 79.4 8.0 79.0 11.7 81.9 3.2 77.3 9.0
UAMT [85] 51.9 69.3 75.3 33.4 55.3 40.8 82.6 27.5 55.9 44.7
4(20%) CPS [19] 57.9 62.6 75.6 41.4 60.1 53.0 88.2 26.2 69.6 48.9
UNet-LS 75.9 10.8 78.2 8.6 85.5 13.0 63.9 10.7
MT [71] 80.9 11.5 79.1 7.7 86.1 13.4 77.6 13.3 CTS [57] 64.0 56.4 79.9 38.9 66.3 63.5 86.1 41.9 75.3 60.4
MCSC [49] 68.5 24.8 76.3 44.4 73.4 72.3 91.8 46.9 79.9 62.9
DCT [66] 80.4 13.8 79.3 10.7 87.0 15.5 75.0 15.3 Ours (Affine) 70.0 23.2 79.8 34.5 71.0 70.7 92.8 49.6 87.4 74.4
7 (10%) UAMT [85] 81.1 11.2 80.1 13.7 87.1 18.1 77.6 14.7 Ours (Affine+Deform) 71.4 21.1 80.4 42.3 73.0 70.0 93.7 49.4 87.9 74.2
ICT [75] 82.4 7.2 81.5 7.8 87.6 10.6 78.2 3.2
Reg. only (Affine) 25.4 36.8 17.5 3.5 32.7 27.5 53.4 12.6 33.4 22.5
CCT [64] 84.0 6.6 82.3 5.4 88.6 9.4 81.0 5.1 Reg. only (Aff+Def) 29.1 44.0 27.2 11.3 28.6 26.5 66.4 12.7 29.7 30.3
CPS [19] 85.0 6.6 82.9 6.6 88.0 10.8 84.2 2.3
DeepAtlas [82] 44.0 67.1 68.0 24.9 37.9 46.0 82.7 18.4 44.2 30.6
CTS [57] 86.4 8.6 84.4 6.9 90.1 11.2 84.8 7.8
UNet-LS 45.2 55.6 66.4 27.2 46.0 48.0 82.6 18.2 39.9 33.4
MCSC [49] 89.4 2.3 87.6 1.1 93.6 3.5 87.1 2.1
UAMT [85] 49.5 62.6 71.3 21.1 62.6 51.4 79.3 22.8 58.2 29.0
Ours (Affine) 90.3 1.6 87.4 1.4 92.7 2.2 90.9 1.3 2(10%) CPS [19] 48.8 65.6 70.9 21.3 58.0 45.1 80.7 23.5 58.0 32.7
Reg. only (Aff) 32.0 17.8 18.0 15.7 43.9 16.0 34.0 21.7 CTS [57] 55.2 45.4 71.5 25.6 62.6 67.5 78.2 26.3 75.9 34.3
MCSC [49] 61.1 32.6 73.9 26.4 69.9 72.7 90.0 33.2 79.4 43.0
DeepAtlas [82] 59.0 8.6 62.8 5.4 67.8 7.7 46.4 12.6 Ours (Affine) 65.1 22.5 75.7 28.4 74.5 75.0 91.8 38.0 82.3 55.1
Ours (Affine+Deform) 66.5 19.7 77.6 34.4 75.1 74.2 92.6 39.5 82.1 56.1
UNet-LS 51.2 31.2 54.8 24.4 61.8 24.3 37.0 44.4
Reg. only (Affine) 26.4 45.0 16.3 6.6 35.8 32.8 53.5 14.4 28.7 22.7
MT [71] 56.6 34.5 58.6 23.1 70.9 26.3 40.3 53.9
Reg. only (Aff+Def) 27.4 52.2 26.4 11.3 30.5 27.1 61.6 12.8 26.3 23.6
DCT [66] 58.2 26.4 61.7 20.3 71.7 27.3 41.3 31.7
DeepAtlas [82] 16.1 72.3 18.4 14.9 1.2 10.1 57.1 0.6 14.4 12.2
3 (5%) UAMT [85] 61.0 25.8 61.5 19.3 70.7 22.6 50.8 35.4
ICT [75] 58.1 22.8 62.0 20.4 67.3 24.1 44.8 23.8 UNet-LS 13.7 116.5 11.6 17.8 0.8 1.8 56.9 0.1 8.7 11.6
UAMT [85] 10.7 90.2 8.0 9.3 0.3 8.1 31.7 1.1 13.1 14.3
CCT [64] 58.6 27.9 64.7 22.4 70.4 27.1 40.8 34.2 1(5%) CPS [19] 15.0 123.5 19.6 9.6 5.6 6.9 59.4 2.3 9.4 7.2
CPS [19] 60.3 25.5 65.2 18.3 72.0 22.2 43.8 35.8 CTS [57] 26.3 96.5 44.6 4.0 11.2 5.5 60.3 9.6 54.1 21.2
CTS [57] 65.6 16.2 62.8 11.5 76.3 15.7 57.7 21.4 MCSC [49] 34.0 53.8 50.9 13.0 17.6 54.6 64.3 5.5 43.1 23.5
MCSC [49] 73.6 10.5 70.0 8.8 79.2 14.9 71.7 7.8 Ours (Affine) 43.4 40.8 62.5 13.3 17.9 71.0 77.0 11.4 65.4 28.7
Ours (Affine) 85.7 2.0 83.8 1.4 89.9 2.4 83.5 2.1 Ours (Affine+Deform) 47.6 38.4 65.5 9.3 50.6 70.2 72.7 11.1 73.9 27.8
Best is bold, Second Best is underlined.
Reg. only (Aff) 23.4 19.7 13.6 18.7 31.6 19.0 25.1 21.4
DeepAtlas [82] 40.4 18.5 42.2 11.7 34.7 29.2 44.4 14.6 margin of 20% and 12% in DSC and reduction of 14 mm
1 (1.4%) UNet-LS 26.4 60.1 26.3 51.2 28.3 52.0 24.6 77.0 and 8.5 mm in HD, respectively. When the supervision is
CTS [57] 46.8 36.3 55.1 5.5 64.8 4.1 20.5 99.4
MCSC [49] 58.6 31.2 64.2 13.3 78.1 12.2 33.5 68.1
reduced to one labeled case, our approach outperforms the
Ours (Affine) 80.4 3.5 78.3 3.2 83.6 4.3 79.3 2.9 SOTA by an even larger margin (DSC of 80.4 vs. 58.6 for
Best is bold, Second Best is underlined. MCSC), highlighting its robustness in scenarios with ex-
tremely limited labeled data. DeepAtlas, a joint registration
nomial schedule, starting at 5 × 10−4 for the U-Net and and segmentation method, underperforms. This may be due
1 × 10−4 for the Swin-Unet. Our training batches con- to its lack of advanced S4 techniques, and its online learn-
sisted of 8 images for ACDC and 24 images for Synapse, ing of registration, which means registrations are inaccurate
evenly split between labeled and unlabeled. In the con- early in training and provide poor guidance for segmenta-
trastive learning section, each (H∗ ) was composed of two tion. Qualitative results in Fig. 3 (left) further illustrate the
linear layers, outputting 256 and 128 channels, respectively. superiority of CCT-R, showing more accurate segmentation
In Eq. 6, wcps is defined by a Gaussian warm-up func- with fewer under-segmented regions for the RV (bottom)
tion [57]: wcps (i) = 0.1 · exp −5(1 − i/ttotal )2 , where and fewer false positives (top) compared to CTS.
i is the index of the current training iteration and ttotal is
the total number of iterations, while wcl is set to a constant Synapse. We evaluate performance on the Synapse dataset
value of 10−3 . In Eq. 4, temperature τ = 0.1. In REPS using 4, 2, and 1 labeled cases. Although Synapse is more
module, the bank size K = (M + K)/5. We implemented challenging than ACDC due to greater class imbalance and
our method in PyTorch. All experiments were run on one anatomical variability, CCT-R demonstrates even larger im-
RTX 3090 GPU. provements than on ACDC (Table 2). With 4 labeled cases,
4.1. Comparison with Existing Methods DSC increases from 64.0% to 71.4%, outperforming CTS
by 7.4% and MCSC by 2.9%. Even with just one labeled
ACDC. Table 1 presents quantitative results from our CCT- case, CCT-R still excels at segmenting challenging small
R and baselines, under three different levels of supervision organs like the aorta, kidney, and pancreas, where others
(7, 3, and 1 labeled cases). When trained on 7 labeled struggle. It significantly outperforms MCSC, improving the
cases (10%), significantly outperforms the baseline CTS, mean DSC by 13.6% and reducing HD by 15.4 mm. This
with more than a 4% improvement in DSC and a reduction robustness to extreme class imbalance and limited supervi-
of 7 mm in HD. With just 5% of labeled data (3 cases), our sion emphasizes the value of registration information. Fur-
CCT-R surpasses CTS and SOTA MCSC by an impressive thermore, our approach is robust across varying registration
RV Myo LV aorta gallbladder left kidney right kidney liver pancrea spleen stomach
s
Table 3. Benefit of our modules combined with different baselines, Table 4. Comparisons with SoTA contrastive learning methods
on Synapse with 10% labeled data. combined with CTS, on ACDC and Synapse.
UAMT [85] CPS [19] CTS [57]
ACDC 3 (5 %) / 1 (1.4 %) Synapse 4 (20 %) / 2 (10%)
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ Contrastive learning method
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓
Baselines 49.5 62.6 48.8 65.6 55.2 45.4 GLCL [36] (MICCAI’21) 71.7 3.8 47.4 35.8 67.7 42.6 59.7 34.6
Patch-level
+ RSL 52.3 60.3 57.3 42.4 65.4 28.5 MCSC [49] (BMVC’23) 73.6 10.5 58.6 31.2 68.5 24.8 61.1 32.6
+ RSL + SCL + REPS 54.6 55.6 59.1 37.5 66.5 19.7 ReCo [51] (ICLR’22) 70.2 6.1 48.3 33.5 68.3 25.9 60.4 20.7
Slice-level
Ours 85.4 2.6 80.0 4.2 71.4 21.1 66.5 19.7
qualities. Even with simpler affine registrations, inaccurate None (Vanilla CTS [57]) 65.6 16.2 46.8 36.3 64.0 56.4 57.2 45.7
Best is bold.
for complex abdominal anatomy, it significantly improves
segmentation (Ours (Affine) rows) over not using registra- Table 5. Ablation study for the primary components of our CCT-R.
tion, though results are better still with deformable trans- SCL: typical supervised local contrastive loss. RSL: registration
forms (Ours (Affine+Deform)). Fig. 3 (right) shows CCT- supervision loss. BRS: best registration selection strategy for reg-
R accurately segments small structures like the gallblad- istered labels ru . REPS: registration-enhanced positive sampling
der and pancreas, often missed or over-segmented by LS module (using positives from registration in SCL).
and CTS. Our approach also correctly identifies the spleen 1 (5%) 2 (10%)
SCL RSL BRS REPS
and distinguishes it from the liver, a common error in other DSC↑ HD↓ DSC↑ HD↓
methods. It also provides more precise segmentation of the 26.3 96.5 55.2 45.4
liver and stomach, significantly outperforming MCSC. This ✓ 29.0 46.9 64.2 33.9
✓ ✓ – – 65.4 28.5
figure shows the robustness in handling challenging, imbal- ✓ 27.5 59.8 63.1 29.1
anced datasets. ✓ ✓ ✓ 28.1 53.9 64.8 20.6
✓ ✓ 31.4 55.2 63.9 29.7
✓ ✓ ✓ ✓ 47.6 38.4 66.5 19.7
Segmentation via registration only. We also test whether
simply propagating labels based on either affine or de-
4.3. Comparison with Alternative Supervised Con-
formable registration achieves adequate segmentation per-
trastive Learning Losses
formance (Reg.only rows in Tables 1 & 2). We see this per-
forms substantially worse than the learning-based methods. In Table 4, we compare our proposed approach with the
state-of-the-art contrastive S4 method MCSC [49], and with
4.2. Benefit of Our Registration-Based Modules Ap- incorporating other recent patch-level and slice-level con-
plied on Different Baselines trastive learning techniques (GLCL [36] and ReCo [51])
into CTS. While all the contrastive losses improve on
Our main experiments build on CTS; however to show vanilla CTS, our CCT-R achieves higher segmentation ac-
the wide applicability of our approach, we measure perfor- curacy on nearly all datasets and labelling rates.
mance when it is integrated with alternative SSL baselines
(Table 3). We include UAMT [85], a classic teacher-student 4.4. Ablation Studies and Analysis
framework with two U-Nets, CPS [19], a student-student We conduct an ablation study on Synapse, measuring
framework with two cross-teaching U-Nets, and CTS [57], the importance of various aspects of our proposed CCT-R
which improves CPS by replacing one of the U-Nets with (Table 5). CTS, as our baseline, achieves Dice of 26.3%
Swin-UNet. With each baseline, we measure the benefit of and 55.2% for one and two labeled cases respectively (top
adding RSL only, and RSL in conjunction with contrastive row). Our registration supervision loss (RSL) improves the
learning and registration-based positive selection (SCL + baseline by +2.7% and 9.0%. The best registration se-
REPS row). Our registration-derived modules boost all lection strategy (BRS), which is only applicable for two
baselines. Enhanced UAMT approaches CTS performance, or more labeled cases, further boosts performance by an
while improved CPS surpasses CTS by 4% on DSC. CTS additional +1.2% in DSC and reduces HD by -5.4 mm.
with our modules remains the top performer. Adding a standard supervised local contrastive learning
References
[1] Julia Andresen, Timo Kepp, Jan Ehrhardt, Claus von der
Burchard, Johann Roider, and Heinz Handels. Deep
learning-based simultaneous registration and unsupervised
non-correspondence segmentation of medical images with
pathologies. International Journal of Computer Assisted Ra-
(a) 1 case (5%) (b) 2 cases (10%)
diology and Surgery, 17(4):699–710, 2022. 3
Figure 4. DSC of pseudo-labels from two models on unlabeled [2] Rajath C Aralikatti, SJ Pawan, and Jeny Rajan. A dual-
data during the early training stages, for Synapse (a) 1 labeled stage semi-supervised pre-training approach for medical im-
case, and (b) 2 labeled cases. age segmentation. IEEE Transactions on Artificial Intelli-
gence, 5(2):556–565, 2023. 1
(SCL) improves the baseline by +1.2% and 7.9% respec-
[3] Brian B Avants, Nick Tustison, Gang Song, et al. Advanced
tively even without registration; also incorporating RSL
normalization tools (ants). Insight j, 2(365):1–35, 2009. 3
gives further improvements of 0.6% and 1.7%, indicating
[4] Yunhao Bai, Duowen Chen, Qingli Li, Wei Shen, and Yan
that contrastive learning and RSL are complementary strate- Wang. Bidirectional copy-paste for semi-supervised medical
gies. The registration-enhanced positive sampling (REPS), image segmentation. In Proceedings of the IEEE/CVF con-
which mitigates bias towards single pseudo-label supervi- ference on computer vision and pattern recognition, pages
sion in SCL, yields significant improvements: a +3.9% 11514–11524, 2023. 3
DSC and -4.6 mm HD for one labeled case and +0.8% for [5] Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Gut-
two labeled cases versus just SCL. Lastly, when combining tag, and Adrian V. Dalca. Voxelmorph: A learning frame-
all components, our full method achieves substantial Dice work for deformable medical image registration. IEEE
score improvement compared to the CTS baseline of 21.3% Transactions on Medical Imaging, 38(8):1788–1800, 2019.
for 1 labeled case (from 26.3% to 47.6%) and 11.3% for 2 1, 2
labeled cases (from 55.2% to 66.5%). [6] Hritam Basak and Zhaozheng Yin. Pseudo-label guided con-
trastive learning for semi-supervised medical image segmen-
Analysing the quality of pseudo-labels. We measured tation. In Proceedings of the IEEE/CVF Conference on Com-
the DSC of pseudo-labels predicted for unlabeled training puter Vision and Pattern Recognition (CVPR), pages 19786–
data and used for cross-teaching, illustrating the noisiness 19797, June 2023. 3
of pseudo-labels and demonstrating how the proposed RSL [7] Laurens Beljaards, Mohamed S Elmahdy, Fons Verbeek, and
mitigates this issue. Fig. 4 shows that early in training, Marius Staring. A cross-stitch architecture for joint registra-
cross-teaching models without RSL (dashed lines) yield tion and segmentation in adaptive radiotherapy. In Medical
suboptimal results due to the insufficient training. This lim- Imaging with Deep Learning, pages 62–74. PMLR, 2020. 3
itation persists even in later training stages, as the model [8] Noah C Benson, Omar H Butt, David H Brainard, and Ge-
struggles to generalize and often converges to local op- offrey K Aguirre. Correction of distortion in flattened repre-
tima, especially in the 5% labeled setting. In contrast, the sentations of the cortical surface allows prediction of v1-v3
functional organization from anatomy. PLoS computational
supervision provided by registrations, RSL, offers consis-
biology, 10(3):e1003538, 2014. 3, 6
tent and reliable guidance throughout the training process
[9] Olivier Bernard et al. Deep learning techniques for automatic
(solid lines), significantly mitigating these issues and en-
mri cardiac multi-structures segmentation and diagnosis: is
abling more effective learning from limited data. the problem solved? IEEE T Med Imaging, 37(11):2514–
2525, 2018. 2, 6
[10] Gerda Bortsova et al. Semi-supervised medical image seg-
5. Conclusion mentation via learning consistency under transformations. In
MICCAI, pages 810–818. Springer, 2019. 2
We have introduced CCT-R, a registration-guided
[11] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi-
method for semi-supervised medical image segmentation. aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet:
This builds on cross-teaching methods, and improves seg- Unet-like pure transformer for medical image segmentation.
mentation via two novel modules: the Registration Su- arXiv preprint arXiv:2105.05537, 2021. 6
pervision Loss and Registration-Enhanced Positive Sam- [12] Krishna Chaitanya et al. Contrastive learning of global and
pling module. The RSL uses segmentation knowledge de- local features for medical image segmentation with limited
rived from transforms between labeled and unlabeled vol- annotations. Adv Neur In, 33:12546–12558, 2020. 3
ume pairs, providing an additional source of supervision for [13] Krishna Chaitanya et al. Local contrastive loss with pseudo-
the models. With the REPS, supervised contrastive learn- label based self-training for semi-supervised medical image
ing can sample anatomically-corresponding positives across segmentation. Med Image Anal, 87:102792, 2023. 1, 3
volumes. Without introducing extra training parameters, [14] Chen Chen et al. Deep learning for cardiac image segmenta-
CCT-R achieves the new SOTA on popular S4 benchmarks. tion: a review. Front Cardiovasc Med, 7:25, 2020. 2
[15] Jieneng Chen et al. Transunet: Transformers make strong [28] Geoff French, Timo Aila, Samuli Laine, Michal Mack-
encoders for medical image segmentation. arXiv preprint iewicz, and Graham Finlayson. Semi-supervised semantic
arXiv:2102.04306, 2021. 6 segmentation needs strong, high-dimensional perturbations.
[16] Ting Chen et al. A simple framework for contrastive learn- In Proceedings of the IEEE/CVF International Conference
ing of visual representations. In ICML, pages 1597–1607. on Learning Representations, 2019.
PMLR, 2020. 3 [29] Jean-Bastien Grill et al. Bootstrap your own latent-a new ap-
[17] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad proach to self-supervised learning. NIPS, 33:21271–21284,
Norouzi, and Geoffrey E Hinton. Big self-supervised mod- 2020. 3
els are strong semi-supervised learners. Advances in neural [30] Jean-Bastien Grill et al. Bootstrap your own latent-a new ap-
information processing systems, 33:22243–22255, 2020. 1 proach to self-supervised learning. NIPS, 33:21271–21284,
[18] Xinlei Chen et al. Improved baselines with momentum con- 2020. 3
trastive learning. arXiv preprint arXiv:2003.04297, 2020. 3 [31] Xiao Gu, Fani Deligianni, Jinpei Han, Xiangyu Liu, Wei
[19] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Chen, Guang-Zhong Yang, and Benny Lo. Beyond super-
Wang. Semi-supervised semantic segmentation with cross vised learning for pervasive healthcare. IEEE Reviews in
pseudo supervision. In Proceedings of the IEEE/CVF Con- Biomedical Engineering, 17:42–62, 2024. 1
ference on Computer Vision and Pattern Recognition, pages [32] Gilion Hautvast, Steven Lobregt, Marcel Breeuwer, and
2613–2622, 2021. 1, 2, 3, 6, 7, 8, 14, 15 Frans Gerritsen. Automatic contour propagation in cine car-
[20] Neel Dey, Jo Schlemper, Seyed Sadegh Mohseni Salehi, Bo diac magnetic resonance images. IEEE transactions on med-
Zhou, Guido Gerig, and Michal Sofka. Contrareg: Con- ical imaging, 25(11):1472–1482, 2006. 2
trastive learning of multi-modality unsupervised deformable [33] Kaiming He et al. Momentum contrast for unsupervised vi-
image registration. In International Conference on Medi- sual representation learning. In CVPR, pages 9729–9738,
cal Image Computing and Computer-Assisted Intervention, 2020. 3
pages 66–77. Springer, 2022. 3 [34] Kaiming He et al. Momentum contrast for unsupervised vi-
[21] Wangbin Ding, Lei Li, Junyi Qiu, Sihan Wang, Liqin Huang, sual representation learning. In CVPR, pages 9729–9738,
Yinyin Chen, Shan Yang, and Xiahai Zhuang. Align- 2020. 3
ing multi-sequence cmr towards fully automated myocar- [35] Thao Thi Ho, Woo Jin Kim, Chang Hyun Lee, Gong Yong
dial pathology segmentation. IEEE Transactions on Medical Jin, Kum Ju Chae, and Sanghun Choi. An unsupervised im-
Imaging, 42(12):3474–3486, 2023. 3 age registration method employing chest computed tomogra-
[22] Yuzhen Ding, Hongying Feng, Yunze Yang, Jason Holmes, phy images and deep neural networks. Computers in Biology
Zhengliang Liu, David Liu, William W Wong, Nathan Y Yu, and Medicine, 154:106612, 2023. 2
Terence T Sio, Steven E Schild, et al. Deep-learning based [36] Xinrong Hu et al. Semi-supervised contrastive learning for
fast and accurate 3d ct deformable image registration in lung label-efficient medical image segmentation. In MICCAI,
cancer. Medical physics, 50(11):6864–6880, 2023. 2 pages 481–490. Springer, 2021. 1, 3, 4, 8
[23] Mohamed S. Elmahdy, Laurens Beljaards, Sahar Yousefi, [37] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin
Hessam Sokooti, Fons Verbeek, Uulke A. Van Der Heide, Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula,
and Marius Staring. Joint registration and segmentation via Caroline M Moore, Mark Emberton, et al. Weakly-
multi-task learning for adaptive radiotherapy of prostate can- supervised convolutional neural networks for multimodal
cer. IEEE Access, 9:95551–95568, 2021. 3 image registration. Medical image analysis, 49:1–13, 2018.
[24] Koen AJ Eppenhof and Josien PW Pluim. Pulmonary ct 2
registration through supervised learning with convolutional [38] Bin Huang, Yufeng Ye, Ziyue Xu, Zongyou Cai, Yan He,
neural networks. IEEE transactions on medical imaging, Zhangnan Zhong, Lingxiang Liu, Xin Chen, Hanwei Chen,
38(5):1097–1105, 2018. 2 and Bingsheng Huang. 3d lightweight network for simul-
[25] Jiashuo Fan, Bin Gao, Huan Jin, and Lihui Jiang. Ucc: Un- taneous registration and segmentation of organs-at-risk in ct
certainty guided cross-head co-training for semi-supervised images of head and neck cancer. IEEE Transactions on Med-
semantic segmentation. In Proceedings of the IEEE/CVF ical Imaging, 41(4):951–964, 2022. 3
conference on computer vision and pattern recognition, [39] Shirui Huang, Keyan Wang, Huan Liu, Jun Chen, and Yun-
pages 9947–9956, 2022. 1 song Li. Contrastive semi-supervised learning for underwa-
[26] Bei Fang, Xian Li, Guangxin Han, and Juhou He. Rethinking ter image restoration via reliable bank. In Proceedings of
pseudo-labeling for semi-supervised facial expression recog- the IEEE/CVF conference on computer vision and pattern
nition with contrastive self-supervised learning. IEEE Ac- recognition, pages 18145–18155, 2023. 3
cess, 11:45547–45558, 2023. 3 [40] Prannay Khosla et al. Supervised contrastive learning. In
[27] Bruce Fischl, David H Salat, Evelina Busa, Marilyn Al- H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H.
bert, Megan Dieterich, Christian Haselgrove, Andre Van Lin, editors, NIPS, volume 33, pages 18661–18673. Curran
Der Kouwe, Ron Killiany, David Kennedy, Shuna Klave- Associates, Inc., 2020. 4
ness, et al. Whole brain segmentation: automated labeling [41] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
of neuroanatomical structures in the human brain. Neuron, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
33(3):341–355, 2002. 3, 6 Dilip Krishnan. Supervised contrastive learning. Advances
in neural information processing systems, 33:18661–18673, [56] Bradley C Lowekamp, David T Chen, Luis Ibáñez, and
2020. 3 Daniel Blezek. The design of simpleitk. Frontiers in neu-
[42] Arno Klein and Joy Hirsch. Mindboggle: a scatter- roinformatics, 7:45, 2013. 6
brained approach to automate brain labeling. NeuroImage, [57] Xiangde Luo, Minhao Hu, Tao Song, Guotai Wang, and
24(2):261–280, 2005. 3 Shaoting Zhang. Semi-supervised medical image segmen-
[43] Bennett Landman et al. Miccai multi-atlas labeling beyond tation via cross teaching between cnn and transformer. In
the cranial vault–workshop and challenge. In MICCAI, vol- International Conference on Medical Imaging with Deep
ume 5, page 12, 2015. 2, 6 Learning, pages 820–833. PMLR, 2022. 1, 2, 3, 6, 7, 8,
[44] Tao Lei et al. Semi-supervised medical image segmentation 14, 15
using adversarial consistency learning and dynamic convolu- [58] J.B.A. Maintz and M.A. Viergever. A survey of medical im-
tion network. 2022. 2 age registration. Medical Image Analysis, 2(1):1–36, 1998.
[45] Yiwen Li, Yunguan Fu, Iani JMB Gayo, Qianye Yang, Zhe 1, 2
Min, Shaheer U Saeed, Wen Yan, Yipei Wang, J Alison No- [59] David Mattes, David R Haynor, Hubert Vesselle, Thomas K
ble, Mark Emberton, et al. Prototypical few-shot segmenta- Lewellen, and William Eubank. Pet-ct image registration in
tion for cross-institution male pelvic structures with spatial the chest using free-form deformations. IEEE transactions
registration. Medical Image Analysis, 90:102935, 2023. 3 on medical imaging, 22(1):120–128, 2003. 5
[46] Huibin Lin, Chun-Yang Zhang, Shiping Wang, and Wen- [60] Matthew McCormick, Xiaoxiao Liu, Julien Jomier, Charles
zhong Guo. A probabilistic contrastive framework for semi- Marion, and Luis Ibanez. Itk: enabling reproducible research
supervised learning. IEEE Transactions on Multimedia, and open science. Frontiers in neuroinformatics, 8:13, 2014.
25:8767–8779, 2023. 3 6
[47] Xian Lin et al. Batformer: Towards boundary-aware [61] Seungjong Oh, David Jaffray, and Young-Bin Cho. A novel
lightweight transformer for efficient medical image segmen- method to quantify and compare anatomical shape: applica-
tation. IEEE JBHI, 2023. 6, 7, 14 tion in cervix cancer radiotherapy. Physics in Medicine &
[48] Lihao Liu, Angelica I Aviles-Rivero, and Carola-Bibiane Biology, 59(11):2687, 2014. 2
Schönlieb. Contrastive registration for unsupervised medical [62] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and
image segmentation. IEEE Transactions on Neural Networks Lennart Svensson. Classmix: Segmentation-based data aug-
and Learning Systems, 2023. 3 mentation for semi-supervised learning. In Proceedings of
[49] Qianying Liu, Xiao Gu, Paul Henderson, and Fani Deli- the IEEE/CVF winter conference on applications of com-
gianni. Multi-scale cross contrastive learning for semi- puter vision, pages 1369–1378, 2021. 1
supervised medical image segmentation. In 34th British Ma- [63] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
chine Vision Conference 2023, BMVC 2023, Aberdeen, UK, sentation learning with contrastive predictive coding. arXiv
November 20-24, 2023. BMVA, 2023. 1, 2, 3, 4, 6, 7, 8, 14, preprint arXiv:1807.03748, 2018. 3
15 [64] Yassine Ouali et al. Semi-supervised semantic segmenta-
[50] Shikun Liu et al. Bootstrapping semantic segmentation with tion with cross-consistency training. In CVPR, pages 12674–
regional contrast. In ICRL, 2022. 2, 4 12684, 2020. 1, 6, 7, 14, 15
[51] Shikun Liu, Shuaifeng Zhi, Edward Johns, and Andrew [65] Jizong Peng et al. Deep co-training for semi-supervised im-
Davison. Bootstrapping semantic segmentation with re- age segmentation. Lect Notes Comput Sc, 107:107269, 2020.
gional contrast. In International Conference on Learning 1, 2
Representations (ICLR), 2022. 8 [66] Siyuan Qiao et al. Deep co-training for semi-supervised im-
[52] Yang Liu and Shi Gu. Co-learning semantic-aware unsuper- age recognition. In ECCV, pages 135–152, 2018. 6, 7, 14
vised segmentation for pathological image registration. In [67] Olaf Ronneberger et al. U-net: Convolutional networks for
International Conference on Medical Image Computing and biomedical image segmentation. In MICCAI, pages 234–
Computer-Assisted Intervention, pages 537–547. Springer, 241. Springer, 2015. 6
2023. 3 [68] László Ruskó, György Bekes, and Márta Fidrich. Au-
[53] Nikos K Logothetis. What we can do and what we cannot do tomatic segmentation of the liver from multi-and single-
with fmri. Nature, 453(7197):869–878, 2008. 1 phase contrast-enhanced ct images. Medical Image Analysis,
[54] Zijun Long, George Killick, Lipeng Zhuang, Richard Mc- 13(6):871–882, 2009. 3
Creadie, Gerardo Aragon Camarasa, and Paul Henderson. [69] Hessam Sokooti, Bob De Vos, Floris Berendsen,
Elucidating and overcoming the challenges of label noise in Boudewijn PF Lelieveldt, Ivana Išgum, and Marius
supervised contrastive learning. In European Conference on Staring. Nonrigid image registration using multi-scale
Computer Vision, 2024. 3 3d convolutional neural networks. In Medical Image
[55] Maria Lorenzo-Valdés, Gerardo I Sanchez-Ortiz, Raad Mo- Computing and Computer Assisted Intervention- MICCAI
hiaddin, and Daniel Rueckert. Atlas-based segmentation and 2017: 20th International Conference, Quebec City, QC,
tracking of 3d cardiac mr images using non-rigid registra- Canada, September 11-13, 2017, Proceedings, Part I 20,
tion. In Medical Image Computing and Computer-Assisted pages 232–239. Springer, 2017. 2
Intervention—MICCAI 2002: 5th International Conference [70] Xinrui Song, Hanqing Chao, Xuanang Xu, Hengtao Guo,
Tokyo, Japan, September 25–28, 2002 Proceedings, Part I 5, Sheng Xu, Baris Turkbey, Bradford J Wood, Thomas San-
pages 642–650. Springer, 2002. 3, 6 ford, Ge Wang, and Pingkun Yan. Cross-modal attention
for multi-modal image registration. Medical Image Analy- [85] Lequan Yu et al. Uncertainty-aware self-ensembling model
sis, 82:102612, 2022. 3 for semi-supervised 3d left atrium segmentation. In MICCAI,
[71] Antti Tarvainen et al. Mean teachers are better role mod- pages 605–613. Springer, 2019. 2, 6, 7, 8, 14, 15
els: Weight-averaged consistency targets improve semi- [86] Xiangyu Zhao et al. Rcps: Rectified contrastive pseudo su-
supervised deep learning results. NIPS, 30, 2017. 1, 6, 7, pervision for semi-supervised medical image segmentation.
14 arXiv preprint arXiv:2301.05500, 2023. 1, 3
[72] J.P. Thirion. Image matching as a diffusion process: an [87] Yuanyi Zhong et al. Pixel contrastive-consistent semi-
analogy with maxwell’s demons. Medical Image Analysis, supervised semantic segmentation. In CVPR, pages 7273–
2(3):243–260, 1998. 2 7282, 2021. 3
[73] Maria Thor, Jørgen BB Petersen, Lise Bentzen, Morten [88] Hong-Yu Zhou et al. nnformer: Interleaved transformer for
Høyer, and Ludvig Paul Muren. Deformable image reg- volumetric segmentation. arXiv preprint arXiv:2109.03201,
istration for contour propagation from ct to cone-beam ct 2021. 6
scans in radiotherapy of prostate cancer. Acta Oncologica,
50(6):918–925, 2011. 2
[74] Yuandong Tian et al. Understanding self-supervised learning
dynamics without contrastive pairs. In ICML, pages 10268–
10278. PMLR, 2021. 3
[75] Vikas Verma et al. Interpolation consistency training for
semi-supervised learning. Neural Networks, 145:90–106,
2022. 6, 7, 14, 15
[76] Paul Viola and William M. Wells III. Alignment by max-
imization of mutual information. International Journal of
Computer Vision, 24(2):137–154, 1997. 2
[77] Kaiping Wang et al. Semi-supervised medical image seg-
mentation via a tripled-uncertainty guided mean teacher
model with contrastive learning. Med Image Anal,
79:102447, 2022. 2
[78] Xinlong Wang et al. Dense contrastive learning for self-
supervised visual pre-training. In CVPR, pages 3024–3033,
2021. 3
[79] Zhiwei Wang, Xiaoyu Zeng, Chongwei Wu, Xu Zhang, Wei
Fang, Qiang Li, et al. Styleseg v2: Towards robust one-
shot segmentation of brain tissue via optimization-free reg-
istration error perception. arXiv preprint arXiv:2405.03197,
2024. 3
[80] Huisi Wu et al. Cross-patch dense contrastive learn-
ing for semi-supervised segmentation of cellular nuclei in
histopathologic images. In CVPR, pages 11666–11675,
2022. 1, 3
[81] Zhenda Xie et al. Propagate yourself: Exploring pixel-level
consistency for unsupervised visual representation learning.
In CVPR, pages 16684–16693, 2021. 3
[82] Zhenlin Xu and Marc Niethammer. Deepatlas: Joint semi-
supervised learning of image registration and segmenta-
tion. In Medical Image Computing and Computer Assisted
Intervention–MICCAI 2019: 22nd International Conference,
Shenzhen, China, October 13–17, 2019, Proceedings, Part II
22, pages 420–429. Springer, 2019. 3, 6, 7, 14, 15
[83] Tokihiro Yamamoto, Sven Kabus, Jens Von Berg, Cris-
tian Lorenz, and Paul J Keall. Impact of four-dimensional
computed tomography pulmonary ventilation imaging-based
functional avoidance for lung cancer radiotherapy. Inter-
national Journal of Radiation Oncology* Biology* Physics,
79(1):279–288, 2011. 2
[84] Fan Yang et al. Class-aware contrastive semi-supervised
learning. In CVPR, pages 14421–14430, 2022. 3
A. Additional Results
Here we show extended versions of Table 1 and Table 2
in the main paper as Table 6 and Table 7. In these extended
tables, we provide additional comparisons by separately
evaluating the performance of the two branches (CNN and
Transformer) of our CCT-R (whereas in the main paper we
use the mean of their logits); we also give results for all
baselines under three different settings on both datasets. It
can be seen that on the ACDC dataset, the performance
of CCT-R’s CNN and Transformer branches is quite simi-
lar. However, on the more challenging Synapse dataset, the
Transformer outperforms the CNN, likely due to its superior
ability to capture long-range dependencies, which allows it
to better handle the relationships between large and small
organs.
Table 6. Segmentation results on ACDC for our method CCT-R and baselines, according to DSC(%) and HD(mm) for organs.
Mean Myo LV RV
Labeled Methods
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓
UNet-FS 91.7 4.0 89.0 5.0 94.6 5.9 91.4 1.2
70 (100%)
BATFormer [47] 92.8 8.0 90.26 6.8 96.3 5.9 91.97 11.3
Reg. only (Aff) 30.7 16.4 19.7 13.9 42.0 14.4 30.5 20.8
DeepAtlas [82] 79.4 8.0 79.0 11.7 81.9 3.2 77.3 9.0
UNet-LS 75.9 10.8 78.2 8.6 85.5 13.0 63.9 10.7
MT [71] 80.9 11.5 79.1 7.7 86.1 13.4 77.6 13.3
DCT [66] 80.4 13.8 79.3 10.7 87.0 15.5 75.0 15.3
UAMT [85] 81.1 11.2 80.1 13.7 87.1 18.1 77.6 14.7
7 (10%) ICT [75] 82.4 7.2 81.5 7.8 87.6 10.6 78.2 3.2
CCT [64] 84.0 6.6 82.3 5.4 88.6 9.4 81.0 5.1
CPS [19] 85.0 6.6 82.9 6.6 88.0 10.8 84.2 2.3
CTS [57] 86.4 8.6 84.4 6.9 90.1 11.2 84.8 7.8
MCSC [49] 89.4 2.3 87.6 1.1 93.6 3.5 87.1 2.1
Ours (CNN, Affine) 89.5 1.8 87.2 2.0 92.9 1.8 88.4 1.7
Ours (Trans, Affine) 89.1 1.8 85.7 1.2 91.7 2.8 89.9 1.3
Ours (mean, Affine) 90.3 1.6 87.4 1.4 92.7 2.2 90.9 1.3
Reg. only (Aff) 32.0 17.8 18.0 15.7 43.9 16.0 34.0 21.7
DeepAtlas [82] 59.0 8.6 62.8 5.4 67.8 7.7 46.4 12.6
UNet-LS 51.2 31.2 54.8 24.4 61.8 24.3 37.0 44.4
MT [71] 56.6 34.5 58.6 23.1 70.9 26.3 40.3 53.9
DCT [66] 58.2 26.4 61.7 20.3 71.7 27.3 41.3 31.7
UAMT [85] 61.0 25.8 61.5 19.3 70.7 22.6 50.8 35.4
3 (5%) ICT [75] 58.1 22.8 62.0 20.4 67.3 24.1 44.8 23.8
CCT [64] 58.6 27.9 64.7 22.4 70.4 27.1 40.8 34.2
CPS [19] 60.3 25.5 65.2 18.3 72.0 22.2 43.8 35.8
CTS [57] 65.6 16.2 62.8 11.5 76.3 15.7 57.7 21.4
MCSC [49] 73.6 10.5 70.0 8.8 79.2 14.9 71.7 7.8
Ours (CNN, Affine) 85.2 1.9 83.3 1.5 89.9 2.9 82.4 2.2
Ours (Trans, Affine) 85.4 2.6 83.2 1.8 89.3 3.8 83.5 2.1
Ours (mean, Affine) 85.7 2.0 83.8 1.4 89.9 2.4 83.5 2.1
Reg. only (Aff) 23.4 19.7 13.6 18.7 31.6 19.0 25.1 21.4
DeepAtlas [82] 40.4 18.5 42.2 11.7 34.7 29.2 44.4 14.6
UNet-LS 26.4 60.1 26.3 51.2 28.3 52.0 24.6 77.0
1 (1.4%) CTS [57] 46.8 36.3 55.1 5.5 64.8 4.1 20.5 99.4
MCSC [49] 58.6 31.2 64.2 13.3 78.1 12.2 33.5 68.1
Ours (CNN, Affine) 79.6 5.2 77.6 5.3 83.2 5.1 78.0 5.1
Ours (Trans, Affine) 80.0 4.2 77.7 4.0 83.0 4.2 79.4 3.6
Ours (mean, Affine) 80.4 3.5 78.3 3.2 83.6 4.3 79.3 2.9
Best is bold, Second Best is underlined.
Table 7. Segmentation results on Synapse for our method CCT-R and baselines, according to DSC(%) and HD(mm).
Labeled Methods DSC↑ HD↓ Aorta Gallb Kid L Kid R Liver Pancr Spleen Stom
UNet-FS 75.6 42.3 88.8 56.1 78.9 72.6 91.9 55.8 85.8 74.7
18(100%)
nnFormer 86.6 10.6 92.0 70.2 86.6 86.3 96.8 83.4 90.5 86.8
Reg. only (Affine) 27.0 39.6 16.0 7.5 36.4 33.0 56.8 13.1 28.5 25.1
Reg. only (Aff+Def) 32.5 36.5 29.7 4.8 36.5 29.4 65.5 14.2 48.0 31.7
DeepAtlas [82] 56.1 85.3 69.2 43.3 50.8 55.2 88.8 30.5 62.7 48.0
UNet-LS 47.2 122.3 67.6 29.7 47.2 50.7 79.1 25.2 56.8 21.5
UAMT [85] 51.9 69.3 75.3 33.4 55.3 40.8 82.6 27.5 55.9 44.7
ICT [75] 57.5 79.3 74.2 36.6 58.3 51.7 86.7 34.7 66.2 51.6
CCT [64] 51.4 102.9 71.8 31.2 52.0 50.1 83.0 32.5 65.5 25.2
4(20%) CPS [19] 57.9 62.6 75.6 41.4 60.1 53.0 88.2 26.2 69.6 48.9
CTS [57] 64.0 56.4 79.9 38.9 66.3 63.5 86.1 41.9 75.3 60.4
MCSC [49] 68.5 24.8 76.3 44.4 73.4 72.3 91.8 46.9 79.9 62.9
Ours (CNN, Affine) 67.3 37.9 79.0 36.5 72.7 70.4 87.9 47.3 77.8 67.0
Ours (Trans, Affine) 70.5 22.7 81.0 34.1 71.1 71.9 93.2 49.9 87.9 75.2
Ours (mean, Affine) 70.0 23.2 79.8 34.5 71.0 70.7 92.8 49.6 87.4 74.4
Ours (CNN, Affine+Deform) 69.5 36.2 80.0 49.2 73.0 69.9 89.3 48.5 79.5 66.7
Ours (Trans, Affine+Deform) 72.5 20.5 80.9 43.4 75.6 75.1 93.5 51.3 87.4 72.2
Ours (mean, Affine+Deform) 71.4 21.1 80.4 42.3 73.0 70.0 93.7 49.4 87.9 74.2
Reg. only (Affine) 25.4 36.8 17.5 3.5 32.7 27.5 53.4 12.6 33.4 22.5
Reg. only (Aff+Def) 29.1 44.0 27.2 11.3 28.6 26.5 66.4 12.7 29.7 30.3
DeepAtlas [82] 44.0 67.1 68.0 24.9 37.9 46.0 82.7 18.4 44.2 30.6
UNet-LS 45.2 55.6 66.4 27.2 46.0 48.0 82.6 18.2 39.9 33.4
UAMT [85] 49.5 62.6 71.3 21.1 62.6 51.4 79.3 22.8 58.2 29.0
ICT [75] 49.0 59.9 68.9 19.9 52.5 52.2 83.7 25.4 53.2 36.0
CCT [64] 46.9 58.2 66.0 26.6 53.4 41.0 82.9 21.2 48.7 35.6
2(10%) CPS [19] 48.8 65.6 70.9 21.3 58.0 45.1 80.7 23.5 58.0 32.7
CTS [57] 55.2 45.4 71.5 25.6 62.6 67.5 78.2 26.3 75.9 34.3
MCSC [49] 61.1 32.6 73.9 26.4 69.9 72.7 90.0 33.2 79.4 43.0
Ours (CNN, Affine) 60.4 37.1 77.0 27.8 70.8 69.0 88.4 35.4 67.0 47.7
Ours (Trans, Affine) 64.2 22.1 77.4 22.1 75.0 74.2 92.2 39.6 78.2 54.8
Ours (mean, Affine) 65.1 22.5 75.7 28.4 74.5 75.0 91.8 38.0 82.3 55.1
Ours (CNN, Affine+Deform) 62.6 44.3 76.5 37.7 73.0 68.0 87.0 32.3 76.5 49.9
Ours (Trans, Affine+Deform) 68.3 23.1 74.8 49.1 75.2 74.7 92.8 39.7 84.1 56.2
Ours (mean, Affine+Deform) 66.5 19.7 77.6 34.4 75.1 74.2 92.6 39.5 82.1 56.1
Reg. only (Affine) 26.4 45.0 16.3 6.6 35.8 32.8 53.5 14.4 28.7 22.7
Reg. only (Aff+Def) 27.4 52.2 26.4 11.3 30.5 27.1 61.6 12.8 26.3 23.6
DeepAtlas [82] 16.1 72.3 18.4 14.9 1.2 10.1 57.1 0.6 14.4 12.2
UNet-LS 13.7 116.5 11.6 17.8 0.8 1.8 56.9 0.1 8.7 11.6
UAMT [85] 10.7 90.2 8.0 9.3 0.3 8.1 31.7 1.1 13.1 14.3
ICT [75] 15.9 82.3 13.8 11.9 0.3 2.7 70.5 0.8 16.4 10.6
CCT [64] 11.7 107.5 10.0 13.0 0.1 1.9 47.5 3.7 8.0 9.3
1(5%) CPS [19] 15.0 123.5 19.6 9.6 5.6 6.9 59.4 2.3 9.4 7.2
CTS [57] 26.3 96.5 44.6 4.0 11.2 5.5 60.3 9.6 54.1 21.2
MCSC [49] 34.0 53.8 50.9 13.0 17.6 54.6 64.3 5.5 43.1 23.5
Ours (CNN, Affine) 39.5 66.5 61.7 17.0 9.2 65.2 71.1 12.3 54.3 25.3
Ours (Trans, Affine) 43.2 67.5 58.5 12.5 20.2 66.6 78.9 10.3 72.9 26.5
Ours (mean, Affine) 43.4 40.8 62.5 13.3 17.9 71.0 77.0 11.4 65.4 28.7
Ours (CNN, Affine+Deform) 44.2 54.2 63.8 10.8 48.7 61.6 74.6 5.4 61.8 26.6
Ours (Trans, Affine+Deform) 45.3 46.9 62.9 9.9 56.5 65.6 70.9 0.1 72.8 24.2
Ours (mean, Affine+Deform) 47.6 38.4 65.5 9.3 61.6 70.2 72.7 0.1 73.9 27.8
Best is bold, Second Best is underlined.