Learning Semi-Supervised Medical Image Segmentation

Uploaded by

keften.al

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Learning Semi-Supervised Medical Image Segmentation

Uploaded by

keften.al

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Learning Semi-Supervised Medical Image Segmentation

from Spatial Registration

Qianying Liu Paul Henderson† Xiao Gu

University of Glasgow University of Glasgow University of Oxford
[email protected] [email protected] [email protected]

Fani Deligianni†
arXiv:2409.10422v1 [cs.CV] 16 Sep 2024

Hang Dai
University of Glasgow University of Glasgow
[email protected] [email protected]

Abstract annotation by leveraging a large pool of unlabeled images

alongside a limited set of labeled images [31].
Semi-supervised medical image segmentation has shown Existing S4 methods try to extract useful information
promise in training models with limited labeled data and from unlabeled data in various ways. One line of work
abundant unlabeled data. However, state-of-the-art meth- [2, 17] first performs self-supervised pretraining on unla-
ods ignore a potentially valuable source of unsupervised beled data to learn robust features, then fine-tunes with lim-
semantic information—spatial registration transforms be- ited labeled data. Other works learn from unlabeled images
tween image volumes. To address this, we propose CCT- via pseudo-labeling [28, 62, 71] or consistency regulariza-
R, a contrastive cross-teaching framework incorporating tion strategies [25, 64, 65], both of which retrain the model
registration information. To leverage the semantic infor- using its own predictions on unlabeled images as pseudo-
mation available in registrations between volume pairs, supervision. Cross-teaching frameworks, like the teacher-
CCT-R incorporates two proposed modules: Registra- student [71] and student-student paradigms [19, 57], learn
tion Supervision Loss (RSL) and Registration-Enhanced from unlabeled data by encouraging consistency of pre-
Positive Sampling (REPS). The RSL leverages segmen- dictions between different network branches. Supervised
tation knowledge derived from transforms between la- contrastive learning endows the S4 model with a stronger
beled and unlabeled volume pairs, providing an addi- feature-extraction ability [13, 36, 80, 86], encouraging fea-
tional source of pseudo-labels. REPS enhances contrastive tures of pixels with the same class (positives) to be similar,
learning by identifying anatomically-corresponding posi- and features of different classes (negatives) to be dissim-
tives across volumes using registration transforms. Ex- ilar. State-of-the-art (SOTA) cross-teaching methods [49]
perimental results on two challenging medical segmenta- also incorporate pixel-wise contrastive learning on multi-
tion benchmarks demonstrate the effectiveness and supe- scale feature maps. However, learning a robust representa-
riority of CCT-R across various semi-supervised settings, tion from numerous unlabeled images remains challenging
with as few as one labeled case. Our code is avail- due to potential noise in pseudo-labels.
able at https://github.com/kathyliu579/ContrastiveCross- Spatial registration is a related task that aims to find
teachingWithRegistration. dense spatial correspondences between pairs of 3D image
volumes [5,58]. Many methods, both classical and learning-
based, do not require manual supervision, but are based on
1. Introduction
comparing pixel intensities or features. Still, spatial regis-
Semantic segmentation is a foundational task in medi- tration yields a wealth of semantic information, as points
cal image analysis. However, supervised methods require matched by the registration transformation should, in prin-
meticulously annotated images, which are expensive and ciple, have the same semantic labels. Indeed, registration
time-consuming to obtain. Alternatively, Semi-Supervised techniques are commonly used in brain image analysis to
Semantic Segmentation (S4) minimizes the need for manual directly propagate a segmentation map from a template im-
† Equal
advising
age to another [53]. Despite the wide use of spatial registra-
This work was supported in part by the China Scholarship Council and tion in medical image analysis, the potential of harnessing
EPSRC (EP/W01212X/1). registration for S4 remains under-explored.
In this work, we investigate how to improve S4 by lever- learning, by adding anatomically-corresponding pos-
aging the rich semantic information inherently available itive pairs regardless the currently predicted class.
through off-the-shelf spatial registration methods. By in- Our evaluation demonstrates that each of these strate-
tegrating this information into contrastive cross-teaching gies enhances accuracy when combined with several recent
frameworks [49, 57] which currently represent the SOTA in S4 algorithms including UAMT [85], CPS [19], CTS [57],
S4 for medical images, we propose a novel method CCT-R, and contrastive variants. Implementing both strategies si-
incorporating two techniques that give substantial improve- multaneously proves even more effective. Our proposed
ments in S4 performance for medical images. CCT-R (based on CTS) achieves SOTA performance across
Firstly, we use registration-derived semantic information all settings with particularly impressive gains under mini-
to generate additional pseudo-labels for unlabeled data, and mal supervision conditions. With just a single labeled case,
introduce a new loss allowing these to guide the segmen- CCT-R improves Dice coefficient (DSC) by 33.6% and re-
tation process. This is beneficial since the accuracy of ex- duces Hausdorff Distance (HD) by 32.8 mm on ACDC car-
isting cross-teaching methods is limited by the quality of diac MRI segmentation [9], while on Synapse abdominal
pseudo-labels predicted by each network and used to super- CT [43] it improves DSC by 21.3% and HD by 58.1 mm.
vise the other; these pseudo-labels are typically very noisy
during the early stages of training. In contrast, registrations 2. Related Work
can be computed offline, prior to training, with relatively Consistency regularization in semi-supervised medical
high accuracy. We can then use registration transforms image segmentation. Semi-supervised learning is a very
to transfer annotations from labeled to unlabeled volumes. effective approach to address the challenge of limited anno-
To mitigate poor-quality registrations, we develop a simple tations in medical image segmentation [10, 14, 44, 57, 65].
yet effective ‘best registration selection’ (BRS) strategy that Researchers have proposed various consistency regulariza-
uses cycle-consistency to identify the most useful registration approaches that enforce consistency between multiple
tions for generating high-quality labels, without requiring branches, either through data augmentations [10, 65], net-
extra supervision. In this way, more reliable pseudo-labels work architectures [57], or task configurations [77]. For
are available early in the training process, which helps avoid instance, Bortsova et al. [10] encouraged consistency be-
confirmation bias from cross-teaching, accelerates learning, tween the predicted masks and the input images under spa-
and improves final segmentation performance. tial transformations. Peng et al. [65] used adversarial learn-
Secondly, we use registration to optimise the sampling ing to encourage diverse predictions among a set of models,
of pairs during pixel-wise contrastive learning. The SOTA while Luo et al. [57] leveraged Transformer-CNN consis-
contrastive cross-teaching S4 approach, MCSC [49], se- tency. However, most of these methods focus on prediction
lects positive pairs based on (potentially noisy) pseudo- consistency for each single slice, overlooking feature rela-
labels, and only within the current minibatch. By employ- tionships between different slices [49]. Additionally, re-
ing registration transformations, we can go further, identi- lying on models to generate pseudo-labels often results in
fying spatially-corresponding pixels for each anchor point inaccurate organ boundaries [50]. Addressing these limita-
across different volumes. This allows us to sample spatially tions remains an open challenge. Our CCT-R encourages
positive pairs across volumes for contrast, even when their both output and feature consistency between two branches
current pseudo-labels are incorrect, e.g. early in training. [49, 57], while uniquely using registration to provide richer
Furthermore, to increase the diversity of registration guided information beyond cross-teaching alone.
positives, and avoid the constraints imposed by batch size, Medical image registration. Spatial registration is the
we construct a memory-bank of feature maps from across process of aligning images from various sources, times,
multiple volumes. or patients to a common coordinate system [58], enabling
In summary, our main contributions are as follows: tasks like automatic segmentation [32, 73], mathematical
• We propose CCT-R, the first registration-guided modeling [61], and functional imaging [83]. Classical
method for semi-supervised medical image segmen- methods, such as those based on mutual information (MI)
tation, by integrating registration with a contrastive [76], and feature-based techniques like Demons registra-
cross-teaching framework. tion [72], align images by optimizing a cost function to
minimize misalignment. These approaches rely heavily on
• We introduce a novel registration supervision loss that
pixel intensities and anatomical features. Recent advances
enhances cross-teaching, by providing additional and
in deep learning have introduced learnt methods [5, 22],
informative registered pseudo-labels early in training,
which automate feature extraction and optimization. These
automatically selecting the best registered volumes.
methods can be supervised (trained with labeled reference
• We show how registration can be used to mitigate the deformations) [24, 69] or unsupervised (optimize similarity
noisiness of pseudo labels in supervised contrastive metrics without ground truth) [5, 35, 37]. Both classical and
learnt methods typically take a pair of images (fixed and 3. Method
moving) as input, and produce a transformation matrix or a We first describe our problem setup and overall learn-
dense deformation field that aligns them. ing framework (Section 3.1), which closely follows SOTA
cross-teaching methods [19, 49, 57]. Next, we introduce
Combining segmentation and registration. Segmenta-
the main technical contributions for our CCT-R: incorpo-
tion and registration are closely related tasks that can com-
rating registration into the S4 framework (Section 3.2),
plement each other, as both require extracting similar infor-
followed by a detailed description of how this is accom-
mation from images. Several methods achieve segmenta-
plished through a Registration Supervision Loss (RSL)
tion purely by propagating the labels from an atlas image to
(Section 3.3) and by improving the quality of contrastive
another after registration, such as for gray/white matter [27]
pairs with the Registration-Enhanced Positives Sampling
or V1/V2/IT [8] regions of brain, cardiac MR images [55]
(REPS) module (Section 3.4).
and liver CT [68]. Conversely, segmentation can provide
additional supervision (beyond image intensities) for reg- 3.1. Preliminaries
istration [3], as well as serve as a mean to evaluate regis- S4 aims to obtain good segmentation performance by
tration results [42]. Consequently, many studies have ex- leveraging data comprising of few labeled 2D slices Dl =
plored joint training of deep networks for registration and {(xli , yil )}K
i=1 and many unlabeled slices Du = {xj }j=1
u M

segmentation across various supervision levels: unsuper- N

(i.e. M ≫ K). Let V = {vn }n=1 represents the set of all
vised [1,52], fully supervised [7,21,23,38], few shot [45,79] 3D volumes, from which the set D = Dl ∪ Du is extracted.
and semi-supervised [82]. The most relevant to our CCT-R, Our overall learning framework is similar to cross
DeepAtlas [82], jointly learns registration and S4 using 3D pseudo supervision [19,57] (Fig. 1), and the input is a mini-
networks. However, they leverage neither established regis- batch X = X l ∪ X u including labeled images and unla-
tration techniques nor modern S4 strategies like co-training beled images. It uses two student models that are trained
and contrastive learning, limiting their approach to simpler via a standard supervised loss Lsup on X l , and via a cross
anatomies (knee and brain). Unlike these works, our ap- pseudo supervision loss Lcps on X u where each network
proach does not aim to solve registration itself. Instead, it learns from the predictions of the other.
leverages an existing (imperfect) registration algorithms to The supervised loss combines Dice and cross-entropy
boost the performance of S4. terms, similar to [4, 49]:
K
Contrastive learning for segmentation and registra- 1 X
Ldice (P∗l , Y l ) + Lce (P∗l , Y l ) . (1)

Lsup = −
tion. Contrastive learning has been pivotal in self- K i=1
supervised representation learning [16, 29, 33, 74]. Early
Here P∗l is the predicted class probability map of the labeled
contrastive learning approaches focused on image-level
image batch X l , calculated according to P∗l = C∗ (E∗ (X l ))
(global) representations [18,30,34,63], increasing similarity
where E∗ (·) is a feature extractor, C∗ (·) is a segmentation
between positive pairs while differentiating negative pairs.
head yielding class probabilities for each pixel, Y l is the
To adapt contrastive learning to the segmentation task,
ground-truth label maps and ∗ denotes the model A or B.
which requires dense predictions, recent research has intro-
The cross pseudo supervision loss Lcps [19] enables
duced pixel-level (local) self-supervised contrastive learn-
model A and model B teach each other on the unlabeled
ing [78, 81]. Some methods [12] incorporate both local
X u , encouraging their respective predictions to be consis-
and global contrastive losses in segmentation. These self-
tent. Specifically, we define
supervised methods are prone to false negative predictions
[41]; to mitigate this, existing works [13, 36, 49] have ex- Lcps(A) = Ldice (PAu , YBu ), Lcps(B) = Ldice (PBu , YAu ). (2)
plored supervised local contrastive learning. In the field
of natural images, the integration of semi-supervised learn- Here the Dice loss Ldice for model A uses pseudo-labels
ing and contrastive learning has become a popular trend. YBu predicted by model B as its target, instead of ground-
This has lead to the development of one-stage, end-to- truth labels as in Lsup . Note that there is no gradient back-
end models that eliminate the need for self-supervised pre- propagation between PAu and YBu during training, nor be-
training [26, 39, 46, 54, 84, 87]. This approach has also tween PBu and YAu . In Section 3.3, we will show how us-
been successfully applied to medical image segmentation ing spatial registration information can improve accuracy by
[6, 13, 36, 80, 86]. Lastly, some works use self-supervised providing additional pseudo-labels that are often less noisy
contrastive learning for registration, aiming to achieve high than the cross teaching predictions.
mutual information between fixed and moving images at the Supervised contrastive learning. In addition, we option-
level of whole images [48] or patches [20, 70]. Unlike the ally incorporate a supervised contrastive learning loss Lcl ,
above works, our CCT-R is the first to use registration in- to better capture high-level semantic relationships between
formation to guide contrastive sampling for S4. distant regions of different cases across the entire dataset.
ℒ𝑐𝑙 ℒ𝑐𝑝𝑠 , ℒ𝑟𝑠 ℒ𝑠𝑢𝑝

ℒ sup
𝑃𝐴𝑙
ℒ rs

𝑋 𝑙 𝑌𝐴𝑢

𝑎𝑝𝑟
𝑃𝐴𝑢 𝐸𝑞. 8

ℒ 𝑐𝑙 ℒ 𝑐𝑝𝑠
𝑎𝑝𝑙

𝑃𝐵𝑢
𝑟𝑢 𝑇 𝑌𝑙
𝑢
𝑋
𝑌𝐵𝑢

𝑃𝐵𝑙

Figure 1. The overall architecture of our framework for semi-supervised medical image segmentation.

Our contrastive loss follows [40], but with the key differ- tions across classes and reduces memory usage, unlike [36]
ence that it contrasts pixel features instead of whole-image which simply discards background features. Note that in
features. We project each pixel to a shared embedding space our experiments, N = 1000 and O = 500. The positive
then regularize in a supervised manner, encouraging fea- key ap is given by calculating the average of all other pixels
tures of anchor pixels to be similar to those of pixels having of the same class, i.e. in the anchor set Ac :
the same class (positives), and to be dissimilar to those of 1 X
alp = ai . (5)
different classes (negatives). |Ac |
ai ∈Ac
Specifically, as shown in Fig. 1, we extract a feature
Contrasting only an average positive instead of all posi-
batch F = FA ∪ FB , where F∗ = H∗ (E∗ (X)) and H∗ (·)
tives is computationally cheaper, yet still allows reducing
is the projector. The choice of anchors, which serve as the
the average distance between the anchor and other samples
comparison target of each class, has a great impact on learn-
of class c [50]. In Section 3.4 we will show how using spa-
ing; we therefore try to reduce the number of anchors with
tial registration information can provide additional positives
incorrect class labels. For every class in the current mini-
for contrastive learning.
batch, we sample pixels with high top-1 probability value
as anchors Ac for class c, setting Training and inference. The two models are trained si-
multaneously with separate losses. The total training loss
Ac = fi | (yi = c) ∧ (pi > h) , (3)
LA for model A is:
where fi is the ith pixel feature in F , and the threshold h LA = Lsup(A) + wcps Lcps(A) + wcl Lcl . (6)
for top-1 probability value is set to 0.5 to only exclude hard
samples. and similarly for model B. Here w∗ are weighting factors
used to balance each loss term. Overall, this setup yields
The supervised contrastive loss Lcl is then computed as:
comparable performance to the SOTA contrastive cross-
1 X 1 X exp(ai · ap /τ )
Lcl = − log , teaching method, MCSC [49], while being significantly
|C| |anc | a ∈an exp(ai · ap /τ ) + Z simpler, and easier to adapt to use registration information.
c∈C i c

For inference, we make predictions by averaging the logits

X X (4) from the two models.
Z= exp(ai · ak /τ ).
j∈C ak ∈njc 3.2. Learning from spatial registration
j̸=c
Here C is the number of classes, anc ⊆ Ac is the current an- We now describe how our CCT-R incorporates registra-
chor subset, i.e. N randomly sampled queries from the an- tion information into the learning framework described in
chor set Ac , ai represents the ith anchor of class c, nc ⊆ Nc Section 3.1. In CCT-R, spatial correspondences from regis-
is the current negative set, i.e. O randomly sampled keys tration serve as additional supervision, since points mapped
from Nc (the negative set of class c), njc ∈ nc is the subset together by an accurate registration transform share the
of negative keys with class j, j ̸= c, and τ is a temperature same anatomical label across volumes.
constant. To prevent the background class from dominat- We assume pairwise 3D registration transforms, either
ing the learning process, we limit the number of negative affine or deformable, are available between all volumes in
samples for each category. It ensures balanced contribu- V ; these can be calculated using any standard off-the-shelf
Labeled data masked by GT Unlabeled data masked by pseudo-label All data w/o mask
Contrast Contrast Contrast
Anchor Anchor
Anchor

Registration transform

Supervised Contrastive guided by labels from logit space Contrastive guided by registration from feature space

Anchor for class i Incorrect Anchor Positive for class i Incorrect Positive Positive Pair Incorrect Positive Pair

Figure 2. Supervised contrastive learning guided by labels vs. registration: In the semi-supervised setting, for unlabeled data, the
supervised contrastive loss uses pseudo-label information to select pairs. However, pseudo-labels are unreliable, especially early in training.
For example, in the middle panel, the anchor is wrongly labeled as Myo (green), which leads to an incorrect learning signal, due to
contrasting with positives correctly labeled as Myo. In contrast, registration finds the anatomically-closest point to the anchor in each 3D
volume, without relying on label predictions from models, enabling the contrastive loss to perform correct comparisons between cases.

method. Although the segmentation model remains 2D, op- Assuming that the slice xui belongs to the unlabeled vol-
erating on individual slices, each slice is now considered ume vju , we define the registered label riu by mapping the
within the 3D space of its original volume. We define the ground truth yil from the labeled volume vql :
set of registration transforms as T = {Tij }Ni=1,j=1 , where riu = Tqj (yil ), (8)
Tij maps points from volume vi to volume vj , and N is the
total number of volumes. where Tqj is the transform from to This transform vqu vjl .
Our CCT-R uses T in two ways. First, we go beyond aligns the label yil with the corresponding coordinates in
cross-teaching, introducing a new loss that uses registration the slice xui , resulting in the riu . This greatly improves the
to transfer labels from labeled to unlabeled data (Sec. 3.3). model’s learning performance (see Sec. 4.4), especially in
Furthermore, traditional supervised contrastive learning cases with minimal supervision (e.g. only one labeled vol-
typically relies on predicted logits, which can introduce ume).
errors. Our CCT-R mitigates this by using T to identify Best registration selection strategy. In practice, registra-
anatomically corresponding features across volumes, pro- tions are often imperfect, particularly for complex anatom-
viding a complementary set of positives (Sec. 3.4). ical regions such as the abdomen. Moreover, the loss de-
scribed in Eq. 7 does not require every image to be paired
3.3. Registration supervision loss
with all others. We therefore design a strategy to choose
We use spatial transforms obtained by registration as which registered pairs should be used. Importantly, this
an additional source of pseudo-labels to supervise the two strategy cannot rely on ground-truth labels, due to our semi-
models. Specifically, by transforming a point from an unsupervised setting. Specifically, we measure the cycle-
labeled volume to the corresponding point in a labeled vol- consistency of the transforms from T (Sec. 3.2) between
ume, we can assume that these two points correspond to the two volumes, say vju and vql . We apply the forward trans-
same anatomical location. Thus, the label from the labeled form Tjq (j-to-q) and the reverse transform Tqj (q-to-j) on
volume can be used as supervision for the unlabeled slice. volume vju :
This provides much more accurate pseudo-labels early in
ṽju = Tqj (Tjq (vju )). (9)
training, and also helps to reduce the confirmation bias that
can arise from cross-teaching. Ideally, ṽj should be equal to the original volume vju , mean-
u

Formally, we define a new loss Lrs , that encourages each ing the composition of forward and reverse transformations
pixel to match the label of its corresponding location in the approximates the identity function. We calculate the global
paired labeled volume: similarity between vju and ṽju using both mutual informa-
M tion (MI) [59] and root mean square error (RMSE), and use
1 X these to derive a composite score
Lrs = − (Ldice (pui , riu ) + Lce (pui , riu )) , (7)
M i=1 S = wrmse · RMSE + wmi · MI, (10)
where pui is the class probability map of the ith unlabeled where wrmse and wmi weight the importance of RMSE and
image xui , and riu is a new registered label found by registra- MI, respectively. We then select the vql that minimizes this
tion. Lrs is then added to the overall loss function (Eq. 6). composite score to generate the best additional pseudo-label
riu for the unlabeled slice xui in vju . 4. Experiments
3.4. Registration-enhanced positive sampling Datasets. We evaluate CCT-R using two challenging
benchmark datasets. ACDC [9] comprises of 200 short-
We next show how to use registration to improve the su- axis cardiac MR volumes from 100 cases, with segmenta-
pervised contrastive learning loss in Eq. 4. Fig. 2 shows tion masks provided for the left ventricle (LV), myocardium
the shortcomings of standard positive sampling in compari- (Myo), and right ventricle (RV). We allocate 70 cases (1930
son to our novel approach integrating registration. Positives slices) for training, 10 for validation, and 20 for testing as
ap derived from (pseudo-)labels are sampled from any lo- in [57], and match their choice of labeled cases. Synapse
cation within the same organ or class as shown in Eq. 4. In [43] consists of abdominal CT volumes from 30 cases, with
contrast, registration-based positives correspond to the ex- eight labeled organs: aorta, gallbladder, spleen, left kidney,
act same anatomical location within the organ, albeit in dif- right kidney, liver, pancreas, and stomach. As in [15], we
ferent volumes or patients. Any noise in registration-based use 18 cases (2212 slices) for training and 12 for testing.
positives stems from registration inaccuracies and is inde- We precomputed a composite pairwise registration (affine
pendent of pseudo-label errors. Therefore, we augment the for ACDC and affine + B-spline deformable transformation
set of positive samples by incorporating registration-based for Synapse) for all training data using ITK [56, 60].
examples. This approach reduces the confirmation bias that
Metrics. For quantitative evaluation, we use two widely-
can arise when learning only from pseudo-labels.
recognized metrics for 2D segmentation: Dice coefficient
Assume the xyz coordinate of anchor ai in an image (DSC) and 95% Hausdorff Distance (HD).
from volume vq is denoted by p. We use a registration trans- Baselines. We first compare with a registration baseline
form to get the corresponding positive coordinates pj in vj : that is not learning-based—we use the transforms to prop-
pj = Tqj (p), (11) agate labels from the labeled training cases to the test im-
where j ∈ {1, 2, . . . , N } and j ̸= q, i.e. we consider all ages, similar to [8, 27, 55], selecting labeled cases with our
other training volumes in V . Given the pj , we extract the BRS. We also compare a joint registration and segmentation
positive feature arpj from the corresponding feature maps. model, DeepAtlas [82]; this learns registration from scratch
simultaneously with segmentation. To stay consistent with
Since our minibatch comprise 2D slices rather than full
our CCT-R, we reimplemented it using a 2D U-Net seg-
3D volumes, there is only a small probability that the fea-
mentation model. We evaluate several recent S4 methods
ture map containing a given registered point pj will in fact
with the U-Net [67] backbone: Mean Teacher (MT) [71],
be available in the current minibatch. We therefore build a
Deep Co-Training (DCT) [66], Uncertainty Aware Mean
memory bank B to serve as a source of feature maps, which
Teacher (UAMT) [85], Interpolation Consistency Train-
provides more diverse registered positive samples across
ing (ICT) [75], Cross Consistency Training (CCT) [64],
different 3D volumes. The memory bank B stores feature
Cross Pseudo Supervision (CPS) [19], and Cross Teach-
maps of 2D slices. For every slice in each mini-batch, new
ing Supervision (CTS) [57], which like CCT-R uses Swin-
feature maps are added to B. If a slice is not yet in B, it
UNet [11] (Transformer) and U-Net backbones. In addition,
is added; otherwise, the existing slice is updated with the
we include the SOTA S4 method with contrastive learn-
new features. Once B reaches its maximum capacity K,
ing, MCSC [49]. As a reference we also train the U-Net
the oldest slices are removed in a first-in, first-out (FIFO)
backbone from the S4 methods on only the labeled subset
order. This provides the model with a more diverse set of
of cases (LS) without additional tricks. We also include
features from various 3D volumes.
fully-supervised methods—the same U-Net trained under
The positive features arpj are averaged over the available full supervision (FS), and the SOTA fully-supervised meth-
j indices that exist in the memory bank: ods BATFormer [47] (on ACDC) and nnFormer [88] (on
1 X Synapse). We retrain all baseline models using their recom-
arp = apj , (12)
|J| mended hyperparameters, and report the results from [57]
j∈J
where J represents the set of volume indices for which the or our replication, whichever is better. Furthermore, the re-
feature point exists in the memory bank. Note that J is a sults of all baselines are given in the appendix.
subset of the total volume indices {1, 2, . . . , N }. Implementation details. For all methods we use random
cropping, random flipping and rotations to augment. All
Finally, we combine with the pseudo-label-supervised methods were trained until convergence, or up to 40,000
positive key alp from Eq. 5 to give a single combined posi-
iterations. We precomputed a composite pairwise regis-
tive key ap for ai :
tration (affine for ACDC and affine + B-spline deformable
ap = w1 alp + w2 arp . (13) transformation for Synapse) for all training data using ITK
We use these positives in the contrastive loss Eq. 4, but oth- [56, 60]. We used the AdamW optimizer with a weight
erwise keep it unchanged. decay of 5 × 10−4 . The learning rate followed a poly-
Table 1. Segmentation results on ACDC for our method and base- Table 2. Segmentation results on Synapse for ours method and
lines, according to DSC (%) and HD (mm). baselines, according to DSC (%) and HD (mm).
Mean Myo LV RV Labeled Methods DSC↑ HD↓ Aorta Gallb Kid L Kid R Liver Pancr Spleen Stom
Labeled Methods
UNet-FS 75.6 42.3 88.8 56.1 78.9 72.6 91.9 55.8 85.8 74.7
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ 18(100%)
nnFormer 86.6 10.6 92.0 70.2 86.6 86.3 96.8 83.4 90.5 86.8
UNet-FS 91.7 4.0 89.0 5.0 94.6 5.9 91.4 1.2 Reg. only (Affine) 27.0 39.6 16.0 7.5 36.4 33.0 56.8 13.1 28.5 25.1
70 (100%)
BATFormer [47] 92.8 8.0 90.26 6.8 96.3 5.9 91.97 11.3 Reg. only (Aff+Def) 32.5 36.5 29.7 4.8 36.5 29.4 65.5 14.2 48.0 31.7
Reg. only (Aff) 30.7 16.4 19.7 13.9 42.0 14.4 30.5 20.8 DeepAtlas [82] 56.1 85.3 69.2 43.3 50.8 55.2 88.8 30.5 62.7 48.0
UNet-LS 47.2 122.3 67.6 29.7 47.2 50.7 79.1 25.2 56.8 21.5
DeepAtlas [82] 79.4 8.0 79.0 11.7 81.9 3.2 77.3 9.0
UAMT [85] 51.9 69.3 75.3 33.4 55.3 40.8 82.6 27.5 55.9 44.7
4(20%) CPS [19] 57.9 62.6 75.6 41.4 60.1 53.0 88.2 26.2 69.6 48.9
UNet-LS 75.9 10.8 78.2 8.6 85.5 13.0 63.9 10.7
MT [71] 80.9 11.5 79.1 7.7 86.1 13.4 77.6 13.3 CTS [57] 64.0 56.4 79.9 38.9 66.3 63.5 86.1 41.9 75.3 60.4
MCSC [49] 68.5 24.8 76.3 44.4 73.4 72.3 91.8 46.9 79.9 62.9
DCT [66] 80.4 13.8 79.3 10.7 87.0 15.5 75.0 15.3 Ours (Affine) 70.0 23.2 79.8 34.5 71.0 70.7 92.8 49.6 87.4 74.4
7 (10%) UAMT [85] 81.1 11.2 80.1 13.7 87.1 18.1 77.6 14.7 Ours (Affine+Deform) 71.4 21.1 80.4 42.3 73.0 70.0 93.7 49.4 87.9 74.2
ICT [75] 82.4 7.2 81.5 7.8 87.6 10.6 78.2 3.2
Reg. only (Affine) 25.4 36.8 17.5 3.5 32.7 27.5 53.4 12.6 33.4 22.5
CCT [64] 84.0 6.6 82.3 5.4 88.6 9.4 81.0 5.1 Reg. only (Aff+Def) 29.1 44.0 27.2 11.3 28.6 26.5 66.4 12.7 29.7 30.3
CPS [19] 85.0 6.6 82.9 6.6 88.0 10.8 84.2 2.3
DeepAtlas [82] 44.0 67.1 68.0 24.9 37.9 46.0 82.7 18.4 44.2 30.6
CTS [57] 86.4 8.6 84.4 6.9 90.1 11.2 84.8 7.8
UNet-LS 45.2 55.6 66.4 27.2 46.0 48.0 82.6 18.2 39.9 33.4
MCSC [49] 89.4 2.3 87.6 1.1 93.6 3.5 87.1 2.1
UAMT [85] 49.5 62.6 71.3 21.1 62.6 51.4 79.3 22.8 58.2 29.0
Ours (Affine) 90.3 1.6 87.4 1.4 92.7 2.2 90.9 1.3 2(10%) CPS [19] 48.8 65.6 70.9 21.3 58.0 45.1 80.7 23.5 58.0 32.7
Reg. only (Aff) 32.0 17.8 18.0 15.7 43.9 16.0 34.0 21.7 CTS [57] 55.2 45.4 71.5 25.6 62.6 67.5 78.2 26.3 75.9 34.3
MCSC [49] 61.1 32.6 73.9 26.4 69.9 72.7 90.0 33.2 79.4 43.0
DeepAtlas [82] 59.0 8.6 62.8 5.4 67.8 7.7 46.4 12.6 Ours (Affine) 65.1 22.5 75.7 28.4 74.5 75.0 91.8 38.0 82.3 55.1
Ours (Affine+Deform) 66.5 19.7 77.6 34.4 75.1 74.2 92.6 39.5 82.1 56.1
UNet-LS 51.2 31.2 54.8 24.4 61.8 24.3 37.0 44.4
Reg. only (Affine) 26.4 45.0 16.3 6.6 35.8 32.8 53.5 14.4 28.7 22.7
MT [71] 56.6 34.5 58.6 23.1 70.9 26.3 40.3 53.9
Reg. only (Aff+Def) 27.4 52.2 26.4 11.3 30.5 27.1 61.6 12.8 26.3 23.6
DCT [66] 58.2 26.4 61.7 20.3 71.7 27.3 41.3 31.7
DeepAtlas [82] 16.1 72.3 18.4 14.9 1.2 10.1 57.1 0.6 14.4 12.2
3 (5%) UAMT [85] 61.0 25.8 61.5 19.3 70.7 22.6 50.8 35.4
ICT [75] 58.1 22.8 62.0 20.4 67.3 24.1 44.8 23.8 UNet-LS 13.7 116.5 11.6 17.8 0.8 1.8 56.9 0.1 8.7 11.6
UAMT [85] 10.7 90.2 8.0 9.3 0.3 8.1 31.7 1.1 13.1 14.3
CCT [64] 58.6 27.9 64.7 22.4 70.4 27.1 40.8 34.2 1(5%) CPS [19] 15.0 123.5 19.6 9.6 5.6 6.9 59.4 2.3 9.4 7.2
CPS [19] 60.3 25.5 65.2 18.3 72.0 22.2 43.8 35.8 CTS [57] 26.3 96.5 44.6 4.0 11.2 5.5 60.3 9.6 54.1 21.2
CTS [57] 65.6 16.2 62.8 11.5 76.3 15.7 57.7 21.4 MCSC [49] 34.0 53.8 50.9 13.0 17.6 54.6 64.3 5.5 43.1 23.5
MCSC [49] 73.6 10.5 70.0 8.8 79.2 14.9 71.7 7.8 Ours (Affine) 43.4 40.8 62.5 13.3 17.9 71.0 77.0 11.4 65.4 28.7
Ours (Affine) 85.7 2.0 83.8 1.4 89.9 2.4 83.5 2.1 Ours (Affine+Deform) 47.6 38.4 65.5 9.3 50.6 70.2 72.7 11.1 73.9 27.8
Best is bold, Second Best is underlined.
Reg. only (Aff) 23.4 19.7 13.6 18.7 31.6 19.0 25.1 21.4
DeepAtlas [82] 40.4 18.5 42.2 11.7 34.7 29.2 44.4 14.6 margin of 20% and 12% in DSC and reduction of 14 mm
1 (1.4%) UNet-LS 26.4 60.1 26.3 51.2 28.3 52.0 24.6 77.0 and 8.5 mm in HD, respectively. When the supervision is
CTS [57] 46.8 36.3 55.1 5.5 64.8 4.1 20.5 99.4
MCSC [49] 58.6 31.2 64.2 13.3 78.1 12.2 33.5 68.1
reduced to one labeled case, our approach outperforms the
Ours (Affine) 80.4 3.5 78.3 3.2 83.6 4.3 79.3 2.9 SOTA by an even larger margin (DSC of 80.4 vs. 58.6 for
Best is bold, Second Best is underlined. MCSC), highlighting its robustness in scenarios with ex-
tremely limited labeled data. DeepAtlas, a joint registration
nomial schedule, starting at 5 × 10−4 for the U-Net and and segmentation method, underperforms. This may be due
1 × 10−4 for the Swin-Unet. Our training batches con- to its lack of advanced S4 techniques, and its online learn-
sisted of 8 images for ACDC and 24 images for Synapse, ing of registration, which means registrations are inaccurate
evenly split between labeled and unlabeled. In the con- early in training and provide poor guidance for segmenta-
trastive learning section, each (H∗ ) was composed of two tion. Qualitative results in Fig. 3 (left) further illustrate the
linear layers, outputting 256 and 128 channels, respectively. superiority of CCT-R, showing more accurate segmentation
In Eq. 6, wcps is defined by a Gaussian warm-up func- with fewer under-segmented regions for the RV (bottom)
tion [57]: wcps (i) = 0.1 · exp −5(1 − i/ttotal )2 , where and fewer false positives (top) compared to CTS.
i is the index of the current training iteration and ttotal is
the total number of iterations, while wcl is set to a constant Synapse. We evaluate performance on the Synapse dataset
value of 10−3 . In Eq. 4, temperature τ = 0.1. In REPS using 4, 2, and 1 labeled cases. Although Synapse is more
module, the bank size K = (M + K)/5. We implemented challenging than ACDC due to greater class imbalance and
our method in PyTorch. All experiments were run on one anatomical variability, CCT-R demonstrates even larger im-
RTX 3090 GPU. provements than on ACDC (Table 2). With 4 labeled cases,
4.1. Comparison with Existing Methods DSC increases from 64.0% to 71.4%, outperforming CTS
by 7.4% and MCSC by 2.9%. Even with just one labeled
ACDC. Table 1 presents quantitative results from our CCT- case, CCT-R still excels at segmenting challenging small
R and baselines, under three different levels of supervision organs like the aorta, kidney, and pancreas, where others
(7, 3, and 1 labeled cases). When trained on 7 labeled struggle. It significantly outperforms MCSC, improving the
cases (10%), significantly outperforms the baseline CTS, mean DSC by 13.6% and reducing HD by 15.4 mm. This
with more than a 4% improvement in DSC and a reduction robustness to extreme class imbalance and limited supervi-
of 7 mm in HD. With just 5% of labeled data (3 cases), our sion emphasizes the value of registration information. Fur-
CCT-R surpasses CTS and SOTA MCSC by an impressive thermore, our approach is robust across varying registration
RV Myo LV aorta gallbladder left kidney right kidney liver pancrea spleen stomach
s

GT LS CTS MCSC Ours GT LS CTS MCSC Ours

Figure 3. Qualitative results from our CCT-R and baselines. Left: ACDC, trained on 3 labeled cases; right: Synapse, 2 labeled cases

Table 3. Benefit of our modules combined with different baselines, Table 4. Comparisons with SoTA contrastive learning methods
on Synapse with 10% labeled data. combined with CTS, on ACDC and Synapse.
UAMT [85] CPS [19] CTS [57]
ACDC 3 (5 %) / 1 (1.4 %) Synapse 4 (20 %) / 2 (10%)
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ Contrastive learning method
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓
Baselines 49.5 62.6 48.8 65.6 55.2 45.4 GLCL [36] (MICCAI’21) 71.7 3.8 47.4 35.8 67.7 42.6 59.7 34.6
Patch-level
+ RSL 52.3 60.3 57.3 42.4 65.4 28.5 MCSC [49] (BMVC’23) 73.6 10.5 58.6 31.2 68.5 24.8 61.1 32.6
+ RSL + SCL + REPS 54.6 55.6 59.1 37.5 66.5 19.7 ReCo [51] (ICLR’22) 70.2 6.1 48.3 33.5 68.3 25.9 60.4 20.7
Slice-level
Ours 85.4 2.6 80.0 4.2 71.4 21.1 66.5 19.7

qualities. Even with simpler affine registrations, inaccurate None (Vanilla CTS [57]) 65.6 16.2 46.8 36.3 64.0 56.4 57.2 45.7
Best is bold.
for complex abdominal anatomy, it significantly improves
segmentation (Ours (Affine) rows) over not using registra- Table 5. Ablation study for the primary components of our CCT-R.
tion, though results are better still with deformable trans- SCL: typical supervised local contrastive loss. RSL: registration
forms (Ours (Affine+Deform)). Fig. 3 (right) shows CCT- supervision loss. BRS: best registration selection strategy for reg-
R accurately segments small structures like the gallblad- istered labels ru . REPS: registration-enhanced positive sampling
der and pancreas, often missed or over-segmented by LS module (using positives from registration in SCL).
and CTS. Our approach also correctly identifies the spleen 1 (5%) 2 (10%)
SCL RSL BRS REPS
and distinguishes it from the liver, a common error in other DSC↑ HD↓ DSC↑ HD↓
methods. It also provides more precise segmentation of the 26.3 96.5 55.2 45.4
liver and stomach, significantly outperforming MCSC. This ✓ 29.0 46.9 64.2 33.9
✓ ✓ – – 65.4 28.5
figure shows the robustness in handling challenging, imbal- ✓ 27.5 59.8 63.1 29.1
anced datasets. ✓ ✓ ✓ 28.1 53.9 64.8 20.6
✓ ✓ 31.4 55.2 63.9 29.7
✓ ✓ ✓ ✓ 47.6 38.4 66.5 19.7
Segmentation via registration only. We also test whether
simply propagating labels based on either affine or de-
4.3. Comparison with Alternative Supervised Con-
formable registration achieves adequate segmentation per-
trastive Learning Losses
formance (Reg.only rows in Tables 1 & 2). We see this per-
forms substantially worse than the learning-based methods. In Table 4, we compare our proposed approach with the
state-of-the-art contrastive S4 method MCSC [49], and with
4.2. Benefit of Our Registration-Based Modules Ap- incorporating other recent patch-level and slice-level con-
plied on Different Baselines trastive learning techniques (GLCL [36] and ReCo [51])
into CTS. While all the contrastive losses improve on
Our main experiments build on CTS; however to show vanilla CTS, our CCT-R achieves higher segmentation ac-
the wide applicability of our approach, we measure perfor- curacy on nearly all datasets and labelling rates.
mance when it is integrated with alternative SSL baselines
(Table 3). We include UAMT [85], a classic teacher-student 4.4. Ablation Studies and Analysis
framework with two U-Nets, CPS [19], a student-student We conduct an ablation study on Synapse, measuring
framework with two cross-teaching U-Nets, and CTS [57], the importance of various aspects of our proposed CCT-R
which improves CPS by replacing one of the U-Nets with (Table 5). CTS, as our baseline, achieves Dice of 26.3%
Swin-UNet. With each baseline, we measure the benefit of and 55.2% for one and two labeled cases respectively (top
adding RSL only, and RSL in conjunction with contrastive row). Our registration supervision loss (RSL) improves the
learning and registration-based positive selection (SCL + baseline by +2.7% and 9.0%. The best registration se-
REPS row). Our registration-derived modules boost all lection strategy (BRS), which is only applicable for two
baselines. Enhanced UAMT approaches CTS performance, or more labeled cases, further boosts performance by an
while improved CPS surpasses CTS by 4% on DSC. CTS additional +1.2% in DSC and reduces HD by -5.4 mm.
with our modules remains the top performer. Adding a standard supervised local contrastive learning
References
[1] Julia Andresen, Timo Kepp, Jan Ehrhardt, Claus von der
Burchard, Johann Roider, and Heinz Handels. Deep
learning-based simultaneous registration and unsupervised
non-correspondence segmentation of medical images with
pathologies. International Journal of Computer Assisted Ra-
(a) 1 case (5%) (b) 2 cases (10%)
diology and Surgery, 17(4):699–710, 2022. 3
Figure 4. DSC of pseudo-labels from two models on unlabeled [2] Rajath C Aralikatti, SJ Pawan, and Jeny Rajan. A dual-
data during the early training stages, for Synapse (a) 1 labeled stage semi-supervised pre-training approach for medical im-
case, and (b) 2 labeled cases. age segmentation. IEEE Transactions on Artificial Intelli-
gence, 5(2):556–565, 2023. 1
(SCL) improves the baseline by +1.2% and 7.9% respec-
[3] Brian B Avants, Nick Tustison, Gang Song, et al. Advanced
tively even without registration; also incorporating RSL
normalization tools (ants). Insight j, 2(365):1–35, 2009. 3
gives further improvements of 0.6% and 1.7%, indicating
[4] Yunhao Bai, Duowen Chen, Qingli Li, Wei Shen, and Yan
that contrastive learning and RSL are complementary strate- Wang. Bidirectional copy-paste for semi-supervised medical
gies. The registration-enhanced positive sampling (REPS), image segmentation. In Proceedings of the IEEE/CVF con-
which mitigates bias towards single pseudo-label supervi- ference on computer vision and pattern recognition, pages
sion in SCL, yields significant improvements: a +3.9% 11514–11524, 2023. 3
DSC and -4.6 mm HD for one labeled case and +0.8% for [5] Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Gut-
two labeled cases versus just SCL. Lastly, when combining tag, and Adrian V. Dalca. Voxelmorph: A learning frame-
all components, our full method achieves substantial Dice work for deformable medical image registration. IEEE
score improvement compared to the CTS baseline of 21.3% Transactions on Medical Imaging, 38(8):1788–1800, 2019.
for 1 labeled case (from 26.3% to 47.6%) and 11.3% for 2 1, 2
labeled cases (from 55.2% to 66.5%). [6] Hritam Basak and Zhaozheng Yin. Pseudo-label guided con-
trastive learning for semi-supervised medical image segmen-
Analysing the quality of pseudo-labels. We measured tation. In Proceedings of the IEEE/CVF Conference on Com-
the DSC of pseudo-labels predicted for unlabeled training puter Vision and Pattern Recognition (CVPR), pages 19786–
data and used for cross-teaching, illustrating the noisiness 19797, June 2023. 3
of pseudo-labels and demonstrating how the proposed RSL [7] Laurens Beljaards, Mohamed S Elmahdy, Fons Verbeek, and
mitigates this issue. Fig. 4 shows that early in training, Marius Staring. A cross-stitch architecture for joint registra-
cross-teaching models without RSL (dashed lines) yield tion and segmentation in adaptive radiotherapy. In Medical
suboptimal results due to the insufficient training. This lim- Imaging with Deep Learning, pages 62–74. PMLR, 2020. 3
itation persists even in later training stages, as the model [8] Noah C Benson, Omar H Butt, David H Brainard, and Ge-
struggles to generalize and often converges to local op- offrey K Aguirre. Correction of distortion in flattened repre-
tima, especially in the 5% labeled setting. In contrast, the sentations of the cortical surface allows prediction of v1-v3
functional organization from anatomy. PLoS computational
supervision provided by registrations, RSL, offers consis-
biology, 10(3):e1003538, 2014. 3, 6
tent and reliable guidance throughout the training process
[9] Olivier Bernard et al. Deep learning techniques for automatic
(solid lines), significantly mitigating these issues and en-
mri cardiac multi-structures segmentation and diagnosis: is
abling more effective learning from limited data. the problem solved? IEEE T Med Imaging, 37(11):2514–
2525, 2018. 2, 6
[10] Gerda Bortsova et al. Semi-supervised medical image seg-
5. Conclusion mentation via learning consistency under transformations. In
MICCAI, pages 810–818. Springer, 2019. 2
We have introduced CCT-R, a registration-guided
[11] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi-
method for semi-supervised medical image segmentation. aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet:
This builds on cross-teaching methods, and improves seg- Unet-like pure transformer for medical image segmentation.
mentation via two novel modules: the Registration Su- arXiv preprint arXiv:2105.05537, 2021. 6
pervision Loss and Registration-Enhanced Positive Sam- [12] Krishna Chaitanya et al. Contrastive learning of global and
pling module. The RSL uses segmentation knowledge de- local features for medical image segmentation with limited
rived from transforms between labeled and unlabeled vol- annotations. Adv Neur In, 33:12546–12558, 2020. 3
ume pairs, providing an additional source of supervision for [13] Krishna Chaitanya et al. Local contrastive loss with pseudo-
the models. With the REPS, supervised contrastive learn- label based self-training for semi-supervised medical image
ing can sample anatomically-corresponding positives across segmentation. Med Image Anal, 87:102792, 2023. 1, 3
volumes. Without introducing extra training parameters, [14] Chen Chen et al. Deep learning for cardiac image segmenta-
CCT-R achieves the new SOTA on popular S4 benchmarks. tion: a review. Front Cardiovasc Med, 7:25, 2020. 2
[15] Jieneng Chen et al. Transunet: Transformers make strong [28] Geoff French, Timo Aila, Samuli Laine, Michal Mack-
encoders for medical image segmentation. arXiv preprint iewicz, and Graham Finlayson. Semi-supervised semantic
arXiv:2102.04306, 2021. 6 segmentation needs strong, high-dimensional perturbations.
[16] Ting Chen et al. A simple framework for contrastive learn- In Proceedings of the IEEE/CVF International Conference
ing of visual representations. In ICML, pages 1597–1607. on Learning Representations, 2019.
PMLR, 2020. 3 [29] Jean-Bastien Grill et al. Bootstrap your own latent-a new ap-
[17] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad proach to self-supervised learning. NIPS, 33:21271–21284,
Norouzi, and Geoffrey E Hinton. Big self-supervised mod- 2020. 3
els are strong semi-supervised learners. Advances in neural [30] Jean-Bastien Grill et al. Bootstrap your own latent-a new ap-
information processing systems, 33:22243–22255, 2020. 1 proach to self-supervised learning. NIPS, 33:21271–21284,
[18] Xinlei Chen et al. Improved baselines with momentum con- 2020. 3
trastive learning. arXiv preprint arXiv:2003.04297, 2020. 3 [31] Xiao Gu, Fani Deligianni, Jinpei Han, Xiangyu Liu, Wei
[19] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Chen, Guang-Zhong Yang, and Benny Lo. Beyond super-
Wang. Semi-supervised semantic segmentation with cross vised learning for pervasive healthcare. IEEE Reviews in
pseudo supervision. In Proceedings of the IEEE/CVF Con- Biomedical Engineering, 17:42–62, 2024. 1
ference on Computer Vision and Pattern Recognition, pages [32] Gilion Hautvast, Steven Lobregt, Marcel Breeuwer, and
2613–2622, 2021. 1, 2, 3, 6, 7, 8, 14, 15 Frans Gerritsen. Automatic contour propagation in cine car-
[20] Neel Dey, Jo Schlemper, Seyed Sadegh Mohseni Salehi, Bo diac magnetic resonance images. IEEE transactions on med-
Zhou, Guido Gerig, and Michal Sofka. Contrareg: Con- ical imaging, 25(11):1472–1482, 2006. 2
trastive learning of multi-modality unsupervised deformable [33] Kaiming He et al. Momentum contrast for unsupervised vi-
image registration. In International Conference on Medi- sual representation learning. In CVPR, pages 9729–9738,
cal Image Computing and Computer-Assisted Intervention, 2020. 3
pages 66–77. Springer, 2022. 3 [34] Kaiming He et al. Momentum contrast for unsupervised vi-
[21] Wangbin Ding, Lei Li, Junyi Qiu, Sihan Wang, Liqin Huang, sual representation learning. In CVPR, pages 9729–9738,
Yinyin Chen, Shan Yang, and Xiahai Zhuang. Align- 2020. 3
ing multi-sequence cmr towards fully automated myocar- [35] Thao Thi Ho, Woo Jin Kim, Chang Hyun Lee, Gong Yong
dial pathology segmentation. IEEE Transactions on Medical Jin, Kum Ju Chae, and Sanghun Choi. An unsupervised im-
Imaging, 42(12):3474–3486, 2023. 3 age registration method employing chest computed tomogra-
[22] Yuzhen Ding, Hongying Feng, Yunze Yang, Jason Holmes, phy images and deep neural networks. Computers in Biology
Zhengliang Liu, David Liu, William W Wong, Nathan Y Yu, and Medicine, 154:106612, 2023. 2
Terence T Sio, Steven E Schild, et al. Deep-learning based [36] Xinrong Hu et al. Semi-supervised contrastive learning for
fast and accurate 3d ct deformable image registration in lung label-efficient medical image segmentation. In MICCAI,
cancer. Medical physics, 50(11):6864–6880, 2023. 2 pages 481–490. Springer, 2021. 1, 3, 4, 8
[23] Mohamed S. Elmahdy, Laurens Beljaards, Sahar Yousefi, [37] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin
Hessam Sokooti, Fons Verbeek, Uulke A. Van Der Heide, Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula,
and Marius Staring. Joint registration and segmentation via Caroline M Moore, Mark Emberton, et al. Weakly-
multi-task learning for adaptive radiotherapy of prostate can- supervised convolutional neural networks for multimodal
cer. IEEE Access, 9:95551–95568, 2021. 3 image registration. Medical image analysis, 49:1–13, 2018.
[24] Koen AJ Eppenhof and Josien PW Pluim. Pulmonary ct 2
registration through supervised learning with convolutional [38] Bin Huang, Yufeng Ye, Ziyue Xu, Zongyou Cai, Yan He,
neural networks. IEEE transactions on medical imaging, Zhangnan Zhong, Lingxiang Liu, Xin Chen, Hanwei Chen,
38(5):1097–1105, 2018. 2 and Bingsheng Huang. 3d lightweight network for simul-
[25] Jiashuo Fan, Bin Gao, Huan Jin, and Lihui Jiang. Ucc: Un- taneous registration and segmentation of organs-at-risk in ct
certainty guided cross-head co-training for semi-supervised images of head and neck cancer. IEEE Transactions on Med-
semantic segmentation. In Proceedings of the IEEE/CVF ical Imaging, 41(4):951–964, 2022. 3
conference on computer vision and pattern recognition, [39] Shirui Huang, Keyan Wang, Huan Liu, Jun Chen, and Yun-
pages 9947–9956, 2022. 1 song Li. Contrastive semi-supervised learning for underwa-
[26] Bei Fang, Xian Li, Guangxin Han, and Juhou He. Rethinking ter image restoration via reliable bank. In Proceedings of
pseudo-labeling for semi-supervised facial expression recog- the IEEE/CVF conference on computer vision and pattern
nition with contrastive self-supervised learning. IEEE Ac- recognition, pages 18145–18155, 2023. 3
cess, 11:45547–45558, 2023. 3 [40] Prannay Khosla et al. Supervised contrastive learning. In
[27] Bruce Fischl, David H Salat, Evelina Busa, Marilyn Al- H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H.
bert, Megan Dieterich, Christian Haselgrove, Andre Van Lin, editors, NIPS, volume 33, pages 18661–18673. Curran
Der Kouwe, Ron Killiany, David Kennedy, Shuna Klave- Associates, Inc., 2020. 4
ness, et al. Whole brain segmentation: automated labeling [41] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
of neuroanatomical structures in the human brain. Neuron, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
33(3):341–355, 2002. 3, 6 Dilip Krishnan. Supervised contrastive learning. Advances
in neural information processing systems, 33:18661–18673, [56] Bradley C Lowekamp, David T Chen, Luis Ibáñez, and
2020. 3 Daniel Blezek. The design of simpleitk. Frontiers in neu-
[42] Arno Klein and Joy Hirsch. Mindboggle: a scatter- roinformatics, 7:45, 2013. 6
brained approach to automate brain labeling. NeuroImage, [57] Xiangde Luo, Minhao Hu, Tao Song, Guotai Wang, and
24(2):261–280, 2005. 3 Shaoting Zhang. Semi-supervised medical image segmen-
[43] Bennett Landman et al. Miccai multi-atlas labeling beyond tation via cross teaching between cnn and transformer. In
the cranial vault–workshop and challenge. In MICCAI, vol- International Conference on Medical Imaging with Deep
ume 5, page 12, 2015. 2, 6 Learning, pages 820–833. PMLR, 2022. 1, 2, 3, 6, 7, 8,
[44] Tao Lei et al. Semi-supervised medical image segmentation 14, 15
using adversarial consistency learning and dynamic convolu- [58] J.B.A. Maintz and M.A. Viergever. A survey of medical im-
tion network. 2022. 2 age registration. Medical Image Analysis, 2(1):1–36, 1998.
[45] Yiwen Li, Yunguan Fu, Iani JMB Gayo, Qianye Yang, Zhe 1, 2
Min, Shaheer U Saeed, Wen Yan, Yipei Wang, J Alison No- [59] David Mattes, David R Haynor, Hubert Vesselle, Thomas K
ble, Mark Emberton, et al. Prototypical few-shot segmenta- Lewellen, and William Eubank. Pet-ct image registration in
tion for cross-institution male pelvic structures with spatial the chest using free-form deformations. IEEE transactions
registration. Medical Image Analysis, 90:102935, 2023. 3 on medical imaging, 22(1):120–128, 2003. 5
[46] Huibin Lin, Chun-Yang Zhang, Shiping Wang, and Wen- [60] Matthew McCormick, Xiaoxiao Liu, Julien Jomier, Charles
zhong Guo. A probabilistic contrastive framework for semi- Marion, and Luis Ibanez. Itk: enabling reproducible research
supervised learning. IEEE Transactions on Multimedia, and open science. Frontiers in neuroinformatics, 8:13, 2014.
25:8767–8779, 2023. 3 6
[47] Xian Lin et al. Batformer: Towards boundary-aware [61] Seungjong Oh, David Jaffray, and Young-Bin Cho. A novel
lightweight transformer for efficient medical image segmen- method to quantify and compare anatomical shape: applica-
tation. IEEE JBHI, 2023. 6, 7, 14 tion in cervix cancer radiotherapy. Physics in Medicine &
[48] Lihao Liu, Angelica I Aviles-Rivero, and Carola-Bibiane Biology, 59(11):2687, 2014. 2
Schönlieb. Contrastive registration for unsupervised medical [62] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and
image segmentation. IEEE Transactions on Neural Networks Lennart Svensson. Classmix: Segmentation-based data aug-
and Learning Systems, 2023. 3 mentation for semi-supervised learning. In Proceedings of
[49] Qianying Liu, Xiao Gu, Paul Henderson, and Fani Deli- the IEEE/CVF winter conference on applications of com-
gianni. Multi-scale cross contrastive learning for semi- puter vision, pages 1369–1378, 2021. 1
supervised medical image segmentation. In 34th British Ma- [63] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
chine Vision Conference 2023, BMVC 2023, Aberdeen, UK, sentation learning with contrastive predictive coding. arXiv
November 20-24, 2023. BMVA, 2023. 1, 2, 3, 4, 6, 7, 8, 14, preprint arXiv:1807.03748, 2018. 3
15 [64] Yassine Ouali et al. Semi-supervised semantic segmenta-
[50] Shikun Liu et al. Bootstrapping semantic segmentation with tion with cross-consistency training. In CVPR, pages 12674–
regional contrast. In ICRL, 2022. 2, 4 12684, 2020. 1, 6, 7, 14, 15
[51] Shikun Liu, Shuaifeng Zhi, Edward Johns, and Andrew [65] Jizong Peng et al. Deep co-training for semi-supervised im-
Davison. Bootstrapping semantic segmentation with re- age segmentation. Lect Notes Comput Sc, 107:107269, 2020.
gional contrast. In International Conference on Learning 1, 2
Representations (ICLR), 2022. 8 [66] Siyuan Qiao et al. Deep co-training for semi-supervised im-
[52] Yang Liu and Shi Gu. Co-learning semantic-aware unsuper- age recognition. In ECCV, pages 135–152, 2018. 6, 7, 14
vised segmentation for pathological image registration. In [67] Olaf Ronneberger et al. U-net: Convolutional networks for
International Conference on Medical Image Computing and biomedical image segmentation. In MICCAI, pages 234–
Computer-Assisted Intervention, pages 537–547. Springer, 241. Springer, 2015. 6
2023. 3 [68] László Ruskó, György Bekes, and Márta Fidrich. Au-
[53] Nikos K Logothetis. What we can do and what we cannot do tomatic segmentation of the liver from multi-and single-
with fmri. Nature, 453(7197):869–878, 2008. 1 phase contrast-enhanced ct images. Medical Image Analysis,
[54] Zijun Long, George Killick, Lipeng Zhuang, Richard Mc- 13(6):871–882, 2009. 3
Creadie, Gerardo Aragon Camarasa, and Paul Henderson. [69] Hessam Sokooti, Bob De Vos, Floris Berendsen,
Elucidating and overcoming the challenges of label noise in Boudewijn PF Lelieveldt, Ivana Išgum, and Marius
supervised contrastive learning. In European Conference on Staring. Nonrigid image registration using multi-scale
Computer Vision, 2024. 3 3d convolutional neural networks. In Medical Image
[55] Maria Lorenzo-Valdés, Gerardo I Sanchez-Ortiz, Raad Mo- Computing and Computer Assisted Intervention- MICCAI
hiaddin, and Daniel Rueckert. Atlas-based segmentation and 2017: 20th International Conference, Quebec City, QC,
tracking of 3d cardiac mr images using non-rigid registra- Canada, September 11-13, 2017, Proceedings, Part I 20,
tion. In Medical Image Computing and Computer-Assisted pages 232–239. Springer, 2017. 2
Intervention—MICCAI 2002: 5th International Conference [70] Xinrui Song, Hanqing Chao, Xuanang Xu, Hengtao Guo,
Tokyo, Japan, September 25–28, 2002 Proceedings, Part I 5, Sheng Xu, Baris Turkbey, Bradford J Wood, Thomas San-
pages 642–650. Springer, 2002. 3, 6 ford, Ge Wang, and Pingkun Yan. Cross-modal attention
for multi-modal image registration. Medical Image Analy- [85] Lequan Yu et al. Uncertainty-aware self-ensembling model
sis, 82:102612, 2022. 3 for semi-supervised 3d left atrium segmentation. In MICCAI,
[71] Antti Tarvainen et al. Mean teachers are better role mod- pages 605–613. Springer, 2019. 2, 6, 7, 8, 14, 15
els: Weight-averaged consistency targets improve semi- [86] Xiangyu Zhao et al. Rcps: Rectified contrastive pseudo su-
supervised deep learning results. NIPS, 30, 2017. 1, 6, 7, pervision for semi-supervised medical image segmentation.
14 arXiv preprint arXiv:2301.05500, 2023. 1, 3
[72] J.P. Thirion. Image matching as a diffusion process: an [87] Yuanyi Zhong et al. Pixel contrastive-consistent semi-
analogy with maxwell’s demons. Medical Image Analysis, supervised semantic segmentation. In CVPR, pages 7273–
2(3):243–260, 1998. 2 7282, 2021. 3
[73] Maria Thor, Jørgen BB Petersen, Lise Bentzen, Morten [88] Hong-Yu Zhou et al. nnformer: Interleaved transformer for
Høyer, and Ludvig Paul Muren. Deformable image reg- volumetric segmentation. arXiv preprint arXiv:2109.03201,
istration for contour propagation from ct to cone-beam ct 2021. 6
scans in radiotherapy of prostate cancer. Acta Oncologica,
50(6):918–925, 2011. 2
[74] Yuandong Tian et al. Understanding self-supervised learning
dynamics without contrastive pairs. In ICML, pages 10268–
10278. PMLR, 2021. 3
[75] Vikas Verma et al. Interpolation consistency training for
semi-supervised learning. Neural Networks, 145:90–106,
2022. 6, 7, 14, 15
[76] Paul Viola and William M. Wells III. Alignment by max-
imization of mutual information. International Journal of
Computer Vision, 24(2):137–154, 1997. 2
[77] Kaiping Wang et al. Semi-supervised medical image seg-
mentation via a tripled-uncertainty guided mean teacher
model with contrastive learning. Med Image Anal,
79:102447, 2022. 2
[78] Xinlong Wang et al. Dense contrastive learning for self-
supervised visual pre-training. In CVPR, pages 3024–3033,
2021. 3
[79] Zhiwei Wang, Xiaoyu Zeng, Chongwei Wu, Xu Zhang, Wei
Fang, Qiang Li, et al. Styleseg v2: Towards robust one-
shot segmentation of brain tissue via optimization-free reg-
istration error perception. arXiv preprint arXiv:2405.03197,
2024. 3
[80] Huisi Wu et al. Cross-patch dense contrastive learn-
ing for semi-supervised segmentation of cellular nuclei in
histopathologic images. In CVPR, pages 11666–11675,
2022. 1, 3
[81] Zhenda Xie et al. Propagate yourself: Exploring pixel-level
consistency for unsupervised visual representation learning.
In CVPR, pages 16684–16693, 2021. 3
[82] Zhenlin Xu and Marc Niethammer. Deepatlas: Joint semi-
supervised learning of image registration and segmenta-
tion. In Medical Image Computing and Computer Assisted
Intervention–MICCAI 2019: 22nd International Conference,
Shenzhen, China, October 13–17, 2019, Proceedings, Part II
22, pages 420–429. Springer, 2019. 3, 6, 7, 14, 15
[83] Tokihiro Yamamoto, Sven Kabus, Jens Von Berg, Cris-
tian Lorenz, and Paul J Keall. Impact of four-dimensional
computed tomography pulmonary ventilation imaging-based
functional avoidance for lung cancer radiotherapy. Inter-
national Journal of Radiation Oncology* Biology* Physics,
79(1):279–288, 2011. 2
[84] Fan Yang et al. Class-aware contrastive semi-supervised
learning. In CVPR, pages 14421–14430, 2022. 3
A. Additional Results
Here we show extended versions of Table 1 and Table 2
in the main paper as Table 6 and Table 7. In these extended
tables, we provide additional comparisons by separately
evaluating the performance of the two branches (CNN and
Transformer) of our CCT-R (whereas in the main paper we
use the mean of their logits); we also give results for all
baselines under three different settings on both datasets. It
can be seen that on the ACDC dataset, the performance
of CCT-R’s CNN and Transformer branches is quite simi-
lar. However, on the more challenging Synapse dataset, the
Transformer outperforms the CNN, likely due to its superior
ability to capture long-range dependencies, which allows it
to better handle the relationships between large and small
organs.
Table 6. Segmentation results on ACDC for our method CCT-R and baselines, according to DSC(%) and HD(mm) for organs.

Mean Myo LV RV
Labeled Methods
DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓ DSC↑ HD↓
UNet-FS 91.7 4.0 89.0 5.0 94.6 5.9 91.4 1.2
70 (100%)
BATFormer [47] 92.8 8.0 90.26 6.8 96.3 5.9 91.97 11.3
Reg. only (Aff) 30.7 16.4 19.7 13.9 42.0 14.4 30.5 20.8
DeepAtlas [82] 79.4 8.0 79.0 11.7 81.9 3.2 77.3 9.0
UNet-LS 75.9 10.8 78.2 8.6 85.5 13.0 63.9 10.7
MT [71] 80.9 11.5 79.1 7.7 86.1 13.4 77.6 13.3
DCT [66] 80.4 13.8 79.3 10.7 87.0 15.5 75.0 15.3
UAMT [85] 81.1 11.2 80.1 13.7 87.1 18.1 77.6 14.7
7 (10%) ICT [75] 82.4 7.2 81.5 7.8 87.6 10.6 78.2 3.2
CCT [64] 84.0 6.6 82.3 5.4 88.6 9.4 81.0 5.1
CPS [19] 85.0 6.6 82.9 6.6 88.0 10.8 84.2 2.3
CTS [57] 86.4 8.6 84.4 6.9 90.1 11.2 84.8 7.8
MCSC [49] 89.4 2.3 87.6 1.1 93.6 3.5 87.1 2.1
Ours (CNN, Affine) 89.5 1.8 87.2 2.0 92.9 1.8 88.4 1.7
Ours (Trans, Affine) 89.1 1.8 85.7 1.2 91.7 2.8 89.9 1.3
Ours (mean, Affine) 90.3 1.6 87.4 1.4 92.7 2.2 90.9 1.3
Reg. only (Aff) 32.0 17.8 18.0 15.7 43.9 16.0 34.0 21.7
DeepAtlas [82] 59.0 8.6 62.8 5.4 67.8 7.7 46.4 12.6
UNet-LS 51.2 31.2 54.8 24.4 61.8 24.3 37.0 44.4
MT [71] 56.6 34.5 58.6 23.1 70.9 26.3 40.3 53.9
DCT [66] 58.2 26.4 61.7 20.3 71.7 27.3 41.3 31.7
UAMT [85] 61.0 25.8 61.5 19.3 70.7 22.6 50.8 35.4
3 (5%) ICT [75] 58.1 22.8 62.0 20.4 67.3 24.1 44.8 23.8
CCT [64] 58.6 27.9 64.7 22.4 70.4 27.1 40.8 34.2
CPS [19] 60.3 25.5 65.2 18.3 72.0 22.2 43.8 35.8
CTS [57] 65.6 16.2 62.8 11.5 76.3 15.7 57.7 21.4
MCSC [49] 73.6 10.5 70.0 8.8 79.2 14.9 71.7 7.8
Ours (CNN, Affine) 85.2 1.9 83.3 1.5 89.9 2.9 82.4 2.2
Ours (Trans, Affine) 85.4 2.6 83.2 1.8 89.3 3.8 83.5 2.1
Ours (mean, Affine) 85.7 2.0 83.8 1.4 89.9 2.4 83.5 2.1
Reg. only (Aff) 23.4 19.7 13.6 18.7 31.6 19.0 25.1 21.4
DeepAtlas [82] 40.4 18.5 42.2 11.7 34.7 29.2 44.4 14.6
UNet-LS 26.4 60.1 26.3 51.2 28.3 52.0 24.6 77.0
1 (1.4%) CTS [57] 46.8 36.3 55.1 5.5 64.8 4.1 20.5 99.4
MCSC [49] 58.6 31.2 64.2 13.3 78.1 12.2 33.5 68.1
Ours (CNN, Affine) 79.6 5.2 77.6 5.3 83.2 5.1 78.0 5.1
Ours (Trans, Affine) 80.0 4.2 77.7 4.0 83.0 4.2 79.4 3.6
Ours (mean, Affine) 80.4 3.5 78.3 3.2 83.6 4.3 79.3 2.9
Best is bold, Second Best is underlined.
Table 7. Segmentation results on Synapse for our method CCT-R and baselines, according to DSC(%) and HD(mm).

Labeled Methods DSC↑ HD↓ Aorta Gallb Kid L Kid R Liver Pancr Spleen Stom
UNet-FS 75.6 42.3 88.8 56.1 78.9 72.6 91.9 55.8 85.8 74.7
18(100%)
nnFormer 86.6 10.6 92.0 70.2 86.6 86.3 96.8 83.4 90.5 86.8
Reg. only (Affine) 27.0 39.6 16.0 7.5 36.4 33.0 56.8 13.1 28.5 25.1
Reg. only (Aff+Def) 32.5 36.5 29.7 4.8 36.5 29.4 65.5 14.2 48.0 31.7
DeepAtlas [82] 56.1 85.3 69.2 43.3 50.8 55.2 88.8 30.5 62.7 48.0
UNet-LS 47.2 122.3 67.6 29.7 47.2 50.7 79.1 25.2 56.8 21.5
UAMT [85] 51.9 69.3 75.3 33.4 55.3 40.8 82.6 27.5 55.9 44.7
ICT [75] 57.5 79.3 74.2 36.6 58.3 51.7 86.7 34.7 66.2 51.6
CCT [64] 51.4 102.9 71.8 31.2 52.0 50.1 83.0 32.5 65.5 25.2
4(20%) CPS [19] 57.9 62.6 75.6 41.4 60.1 53.0 88.2 26.2 69.6 48.9
CTS [57] 64.0 56.4 79.9 38.9 66.3 63.5 86.1 41.9 75.3 60.4
MCSC [49] 68.5 24.8 76.3 44.4 73.4 72.3 91.8 46.9 79.9 62.9
Ours (CNN, Affine) 67.3 37.9 79.0 36.5 72.7 70.4 87.9 47.3 77.8 67.0
Ours (Trans, Affine) 70.5 22.7 81.0 34.1 71.1 71.9 93.2 49.9 87.9 75.2
Ours (mean, Affine) 70.0 23.2 79.8 34.5 71.0 70.7 92.8 49.6 87.4 74.4
Ours (CNN, Affine+Deform) 69.5 36.2 80.0 49.2 73.0 69.9 89.3 48.5 79.5 66.7
Ours (Trans, Affine+Deform) 72.5 20.5 80.9 43.4 75.6 75.1 93.5 51.3 87.4 72.2
Ours (mean, Affine+Deform) 71.4 21.1 80.4 42.3 73.0 70.0 93.7 49.4 87.9 74.2
Reg. only (Affine) 25.4 36.8 17.5 3.5 32.7 27.5 53.4 12.6 33.4 22.5
Reg. only (Aff+Def) 29.1 44.0 27.2 11.3 28.6 26.5 66.4 12.7 29.7 30.3
DeepAtlas [82] 44.0 67.1 68.0 24.9 37.9 46.0 82.7 18.4 44.2 30.6
UNet-LS 45.2 55.6 66.4 27.2 46.0 48.0 82.6 18.2 39.9 33.4
UAMT [85] 49.5 62.6 71.3 21.1 62.6 51.4 79.3 22.8 58.2 29.0
ICT [75] 49.0 59.9 68.9 19.9 52.5 52.2 83.7 25.4 53.2 36.0
CCT [64] 46.9 58.2 66.0 26.6 53.4 41.0 82.9 21.2 48.7 35.6
2(10%) CPS [19] 48.8 65.6 70.9 21.3 58.0 45.1 80.7 23.5 58.0 32.7
CTS [57] 55.2 45.4 71.5 25.6 62.6 67.5 78.2 26.3 75.9 34.3
MCSC [49] 61.1 32.6 73.9 26.4 69.9 72.7 90.0 33.2 79.4 43.0
Ours (CNN, Affine) 60.4 37.1 77.0 27.8 70.8 69.0 88.4 35.4 67.0 47.7
Ours (Trans, Affine) 64.2 22.1 77.4 22.1 75.0 74.2 92.2 39.6 78.2 54.8
Ours (mean, Affine) 65.1 22.5 75.7 28.4 74.5 75.0 91.8 38.0 82.3 55.1
Ours (CNN, Affine+Deform) 62.6 44.3 76.5 37.7 73.0 68.0 87.0 32.3 76.5 49.9
Ours (Trans, Affine+Deform) 68.3 23.1 74.8 49.1 75.2 74.7 92.8 39.7 84.1 56.2
Ours (mean, Affine+Deform) 66.5 19.7 77.6 34.4 75.1 74.2 92.6 39.5 82.1 56.1
Reg. only (Affine) 26.4 45.0 16.3 6.6 35.8 32.8 53.5 14.4 28.7 22.7
Reg. only (Aff+Def) 27.4 52.2 26.4 11.3 30.5 27.1 61.6 12.8 26.3 23.6
DeepAtlas [82] 16.1 72.3 18.4 14.9 1.2 10.1 57.1 0.6 14.4 12.2
UNet-LS 13.7 116.5 11.6 17.8 0.8 1.8 56.9 0.1 8.7 11.6
UAMT [85] 10.7 90.2 8.0 9.3 0.3 8.1 31.7 1.1 13.1 14.3
ICT [75] 15.9 82.3 13.8 11.9 0.3 2.7 70.5 0.8 16.4 10.6
CCT [64] 11.7 107.5 10.0 13.0 0.1 1.9 47.5 3.7 8.0 9.3
1(5%) CPS [19] 15.0 123.5 19.6 9.6 5.6 6.9 59.4 2.3 9.4 7.2
CTS [57] 26.3 96.5 44.6 4.0 11.2 5.5 60.3 9.6 54.1 21.2
MCSC [49] 34.0 53.8 50.9 13.0 17.6 54.6 64.3 5.5 43.1 23.5
Ours (CNN, Affine) 39.5 66.5 61.7 17.0 9.2 65.2 71.1 12.3 54.3 25.3
Ours (Trans, Affine) 43.2 67.5 58.5 12.5 20.2 66.6 78.9 10.3 72.9 26.5
Ours (mean, Affine) 43.4 40.8 62.5 13.3 17.9 71.0 77.0 11.4 65.4 28.7
Ours (CNN, Affine+Deform) 44.2 54.2 63.8 10.8 48.7 61.6 74.6 5.4 61.8 26.6
Ours (Trans, Affine+Deform) 45.3 46.9 62.9 9.9 56.5 65.6 70.9 0.1 72.8 24.2
Ours (mean, Affine+Deform) 47.6 38.4 65.5 9.3 61.6 70.2 72.7 0.1 73.9 27.8
Best is bold, Second Best is underlined.