Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper

The document presents a novel cascaded attention-based decoder (CASCADE) for medical image segmentation that enhances the performance of transformers by effectively capturing both long-range dependencies and local contextual relations among pixels. CASCADE utilizes attention gates and convolutional attention modules to refine features and suppress irrelevant information, leading to significant improvements in segmentation accuracy compared to existing CNN and transformer-based methods. Experimental results demonstrate that CASCADE achieves state-of-the-art performance on various medical image segmentation benchmarks, showcasing its versatility and effectiveness in the field.

Uploaded by

122001049

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper

Uploaded by

122001049

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Medical Image Segmentation via Cascaded Attention Decoding

Md Mostafijur Rahman Radu Marculescu

The University of Texas at Austin
{mostafijur.rahman, radum}@utexas.edu

Abstract pressive performance in medical image segmentation. De-

spite the satisfactory performance of CNN-based methods,
Transformers have shown great promise in medical im- they have limitations in learning the long-range dependen-
age segmentation due to their ability to capture long-range cies among pixels due to the spatial context of the convolu-
dependencies through self-attention. However, they lack the tion operation [2]. To overcome this limitation, some works
ability to learn the local (contextual) relations among pix- [23, 6, 10] incorporate attention modules in their architec-
els. Previous works try to overcome this problem by em- tures to enhance the feature map for better pixel-level classi-
bedding convolutional layers either in the encoder or de- fication of medical images. Although these attention-based
coder modules of transformers thus ending up sometimes methods achieve improved performance (due to capturing
with inconsistent features. To address this issue, we pro- salient features), they still suffer from capturing insufficient
pose a novel attention-based decoder, namely CASCaded long-range dependencies.
Attention DEcoder (CASCADE), which leverages the multi- The recent progress in vision transformers [9] over-
scale features of hierarchical vision transformers. CAS- comes the above limitation in capturing long-range depen-
CADE consists of i) an attention gate which fuses features dencies, particularly for medical image segmentation [3, 2,
with skip connections and ii) a convolutional attention mod- 8, 30]. Transformers rely on an attention-based network
ule that enhances the long-range and local context by sup- architecture; they were first introduced for sequence-to-
pressing background information. We use a multi-stage fea- sequence prediction in natural language processing (NLP)
ture and loss aggregation framework due to their faster con- [28]. Transformers use self-attention to learn correlations
vergence and better performance. Our experiments demon- among all the input tokens that enable them to capture long-
strate that transformers with CASCADE significantly out- range dependencies. Following the success of transform-
perform state-of-the-art CNN- and transformer-based ap- ers in NLP, the vision transformer [9] divides an image
proaches, obtaining up to 5.07% and 6.16% improvements into non-overlapping patches which are fed into the trans-
in DICE and mIoU scores, respectively. CASCADE opens former module with positional embeddings. More recently,
new ways of designing better attention-based decoders. hierarchical vision transformers, such as Swin transformer
[20] with window-based attention and pyramid vision trans-
former (PVT) [31] with spatial reduction attention have
1. Introduction been introduced to reduce the computational costs. These
Medical image segmentation is one of the critical steps hierarchical vision transformers are effective for medical
in pre-treatment diagnoses, treatment planning, and post- image segmentation tasks [2, 8, 30]. However, the self-
treatment assessments of various diseases. Medical image attention used in transformers limits their ability to learn
segmentation can be formulated as a dense prediction prob- local (contextual) relations among pixels [7, 16].
lem which performs pixel-wise classification and creates Recently, SegFormer [35], UFormer [33] and PVTv2
segmentation maps of lesions or organs. Convolutional neu- [32] try to overcome this limitation by embedding convo-
ral networks (CNNs) have been widely used for medical im- lution layers in transformers. Although these architectures
age segmentation tasks [24, 37, 15, 22, 23, 10]. Specifically, can partly learn the local (contextual) relations among pix-
UNet [24] has shown remarkable performance in medical els, they i) have limited discrimination ability due to em-
image segmentation due to producing high-resolution seg- bedding convolution layer directly between fully-connected
mentation maps aggregating multi-stage features using skip layers of the feed-forward network, and ii) do not properly
connections. Due to the sophisticated encoder-decoder ar- aggregate the multi-stage features generated by the hierar-
chitecture of UNet, a few variants of UNet, such as UNet++ chical encoder. Considering these issues, we introduce a
[37], UNet 3+ [15], DC-UNet [22] have demonstrated im- novel CASCaded Attention DEcoder (CASCADE) which

6222
leverages the hierarchical representation of vision trans- tries to minimize the computational cost for ViT using data-
formers. CASCADE fuses (with skip connections) and re- efficient training strategies. Liu et al. [20] develop the Swin
fines features using attention gates (AGs) and convolutional transformer using a sliding window attention mechanism.
attention modules (CAMs), respectively. Due to using hier- In SegFormer, Xie et al. [35] introduce a Mix-FFN mod-
archical transformers as a backbone network and aggregat- ule for encoding better positional information and an effi-
ing multi-stage features using attention-based convolutional cient self-attention mechanism for reducing the computa-
modules, CASCADE captures both global and local (con- tional costs. SegFormer is also a hierarchical transformer
textual) relationships among pixels. Our contributions are where image patches are merged to preserve the local con-
summarized as follows: tinuity among patches. Wang et al. [31] propose a pyra-
mid vision transformer (PVT) where the computational cost
• Novel Network Architecture: We introduce a novel is reduced using a spatial reduction attention mechanism.
hierarchical cascaded attention-based decoder (CAS- In PVTv2, Wang et al. [32] improve the performance of
CADE) for 2D medical image segmentation which PVT by incorporating a linear complexity attention layer,
takes advantage of the multi-stage feature represen- an overlapping patch embedding, and a convolutional feed-
tation of vision transformers while learning multi- forward network.
scale and multiresolution spatial representations. We Although vision transformers have shown excellent
build our decoder using a novel convolutional atten- promise, their performance is limited when trained on small
tion module which suppresses unnecessary informa- datasets. This limitation makes the transformers difficult to
tion. Additionally, we incorporate skip connections train for applications like medical image segmentation with
with attention-gated fusion which also suppresses ir- small amounts of data. We try to overcome this limitation
relevant regions and highlights salient features. To the by using pretrained transformer backbones in large datasets
best of our knowledge, we are the first to propose this (like ImageNet); indeed, previous studies [8, 30] have found
type of decoder for medical image segmentation. that pretrained transformer weights on other non-medical
• Multi-stage Loss Optimization and Feature Aggre- large datasets boost the performance of medical image seg-
gation: We aggregate and optimize multiple losses mentation tasks.
from different stages of the hierarchical decoder. Our
empirical analysis shows that multistage loss enables 2.2. Attention mechanisms
faster convergence of models accuracy and improves
Oktay et al. [23] introduce a low-cost attention gate
decoder performance. We also produce the final seg-
module for U-shaped architectures to fuse features with
mentation map incorporating multi-resolution features
skip-connections; this helps the model focus on the relevant
which puts more confidence on salient features.
information in the image. Chen et al. [6] propose a reverse
• Versatile and Improved Performance: We empiri- attention module to explore the missing detail information
cally show that CASCADE can be used with any hi- which results in high resolution and accurate outputs. Hu
erarchical vision encoder (e.g., PVT [32], TransUNet et al. [14] introduce a squeeze-and-excitation block using
[3]) while significantly improving the performance of global average-pooled features to compute channel atten-
2D medical image segmentation. When compared tion; this identifies the important feature maps for learning
against multiple baselines, CASCADE produces new and then enhances them. Although channel attention can
state-of-the-art (SOTA) results on ACDC, Synapse identify which feature map to focus on, it lacks the abil-
multi-organ, and Polyp segmentation benchmarks. ity to identify where to focus. To supplement the channel
attention block, Chen et al. [4] propose a spatial attention
2. Related Work block to better focus on a feature map. Woo et al. [34]
introduce a convolutional block attention module (CBAM)
We divide the related work into three parts, i.e., vi- utilizing both channel and spatial attention to capture where
sion transformers, attention mechanisms, and medical im- and on which feature to focus in a feature map. Their ex-
age segmentation; these are described next. periments show that channel attention followed by spatial
attention produces the best results.
2.1. Vision transformers
Due to the additive advantage of CBAM with negligi-
Dosovitskiy et al. [9] first introduce the vision trans- ble overhead, we incorporate channel attention followed
former (ViT), which achieves outstanding performance due by spatial attention in our CAM. The CAM differs from
to capturing long-range dependencies among the pixels. CBAM in the design of the block itself and in how the
While early vision transformers were computationally ex- blocks are used. Firstly, our CAM consists of channel at-
pensive, recent works have tried to further enhance ViT in tention, spatial attention, and a convolutional block, while
several ways. Touvron et al. [27] introduce DeiT which CBAM consists of only channel attention and spatial at-

6223
Figure 1. PVT-CASCADE network architecture. (a) PVTv2-b2 Encoder backbone with four stages, (b) CASCADE decoder, (c) Attention
gate (AG), (d) Convolutional attention module (CAM), (e) Channel attention (CA), (f) Spatial attention (SA), (g) ConvBlock, (h) UpConv.
X1, X2, X3, and X4 are the output features of the four stages of hierarchical encoder backbones. p1, p2, p3, and p4 are output feature maps
from four stages of our decoder.

tention. Secondly, CBAM is placed in each convolutional generally adopted as the backbone for medical image seg-
block of both encoder and decoder, while the CAM module mentation. The pyramid pooling and dilated convolution [5]
appears only in the decoder. are also used for lesion and organ segmentation [12, 11].
Nowadays, transformer-based methods have also shown
2.3. Medical image segmentation great success in medical image segmentation [3, 2, 19, 8,
Medical image segmentation is a dense prediction task 30]. Chen et al. [3] proposed TransUNet which uses a hy-
that classifies the pixels of organs or lesions in a given med- brid CNN- transformer encoder to capture long-range de-
ical image (e.g., CT, MRI, endoscopy, OCT, etc.) [3, 8]. pendencies and a cascaded CNN upsampler as a decoder to
UNet [24] and its variants [37, 15, 22, 23] are widely used capture local contextual relations among pixels. In contrast,
in medical image segmentation tasks because of their bet- we propose a new attention-based cascaded decoder which
ter performance and sophisticated architecture. UNet [24] shows a significant performance boost when used on top of
is an encoder−decoder architecture where features from the encoder. Li et al. [19] introduce TFCNs by combining
the encoder are aggregated with upsampled features of the transformer and fully convolutional DenseNet to propagate
decoder using skip connections to produce high-resolution semantic features and filter out non-semantic features. Cao
segmentation maps. Zhou et al. [37] introduce UNet++ et al. [2] proposed Swin-Unet, which is a pure transformer
where the encoder-decoder sub-networks are linked using architecture based on Swin transformer [20]. Swin-Unet
nested and dense skip connections. Huang et al. [15] pro- uses transformers in both the encoder and decoder, which
pose UNet 3+ utilizing full-scale skip connections includ- does not lead to performance improvement.
ing intra-connections among the decoder blocks. Lou et al. Recent studies incorporate different attention mecha-
[22] introduce a dual channel UNet (DC−UNet) architec- nisms with CNN [23, 10, 36] and transformer-based archi-
ture that utilizes the multi-resolution convolution block and tectures [8, 30] for medical image segmentation. Fan et
residual path in skip connections. Following the progress al. [10] adopt reverse attention [6] for polyp segmentation.
of computer vision, the ResNet architecture [13] has been Zhang et al. [36] utilize squeeze-and-excitation attention

6224
[14] for segmenting vessels in retina images. Dong et al. 3.2.1 Attention gate (AG)
[8] adopt a CBAM [34] attention block in their decoder;
they use the CBAM block only with the low-level features AGs are used to progressively suppress features in irrelevant
from the first layer of the PVTv2 which limits the ability background regions by adopting a grid-attention technique
to refine all multi-stage features. In contrast, we incorpo- where the gating signal is based on the spatial information
rate the AG to fuse features with skip connection and use a of the image [23]. More specifically, the gating signal used
CAM module in all of our decoder blocks. to aggregate each skip connection fuses the multi-stage fea-
tures which increase the spatial resolution of the query sig-
nal. Like Attention UNet [23], we use additive attention
3. Method to obtain the gating coefficient because of its better perfor-
We first introduce the transformer backbones and our mance compared to multiplicative attention. The additive
proposed CASCADE decoder. We then describe two differ- attention gate AG(·) is given in Equations 1 and 2:
ent transformer-based architectures (TransCASCADE and
PVT-CASCADE) incorporating our proposed decoder. q_{att}(g, x) = \sigma _1(BN(C_g(g) + BN(C_x(x))))) \label {eq:ags_1} (1)

3.1. Transformer backbones AG(g, x) = x * \sigma _2(BN(C(q_{att}(g, x)))) \label {eq:ags_2} (2)
To ensure enough generalization and multi-scale feature
where σ1 (·) and σ2 (·) correspond to ReLU and Sigmoid ac-
processing abilities for medical image segmentation, we
tivation function, respectively. Cg (·), Cx (·), and C(·) repre-
use the pyramid transformer, as well as the hybrid CNN-
sent channel-wise 1×1 convolution operation. BN (·) is the
transformer (instead of only CNN) as the encoder. Specif-
batch normalization operation. g and x are the upsampled
ically, we adopt the encoder design of PVTv2 [32] (Figure
and skip connection features, respectively.
1(a)) and TransUNet [3]. PVTv2 uses the convolution op-
eration instead of the patch embedding module of the tradi-
tional transformer to consistently capture the spatial infor- 3.2.2 Convolutional attention module (CAM)
mation. TransUNet utilizes a transformer on top of CNN
We use the convolutional attention modules to refine the
to capture both global and spatial relationships among fea-
feature maps. CAM consists of a channel attention [14]
tures. Our proposed decoder is flexible and easy to adopt
(CA(·)), a spatial attention [4] (SA(·)), and a convolutional
with other hierarchical backbone networks.
block (ConvBlock) as in Equation 3:
3.2. CASCaded Attention DEcoder (CASCADE)
CAM(x) = ConvBlock(SA(CA(x))) \label {eq:cam} (3)
Existing transformer-based models have limited (local)
contextual information processing ability among pixels. As where x is the input tensor and CAM (·) represents the con-
a result, the transformer-based model faces difficulties in volutional attention module.
locating the more discriminating local features. To ad- Channel Attention (CA): Channel attention identifies
dress this issue, we propose a novel attention-based cas- which feature maps to focus on (and then refine them). The
caded multi-stage feature aggregation decoder, CASCADE, channel attention CA(·) is defined using Equation 4:
for pyramid features. CA(x)=\sigma _2(C_2(\sigma _1(C_1(P_m(x))))+C_2(\sigma _1(C_1(P_a(x))))) \circledast x \label {eq:ca}
As shown in Figure 1(b), CASCADE consists of the Up- (4)
Conv block to upsample the features, the AG for cascaded where σ2 (·) is the Sigmoid activation. Pm (·) and Pa (·) de-
feature fusion, and the CAM to robustly enhance the fea- note adaptive maximum pooling and adaptive average pool-
ture maps. We have four CAM blocks for the four stages of ing, respectively. C1 (·) is a convolutional layer with 1 × 1
pyramid features from the encoder backbone and three AGs kernel size to reduce the channel dimension 16 times. σ1 is
for three skip connections. To aggregate the multi-scale a ReLU activation layer and C2 (·) is another convolutional
features, we first combine the upsampled features from the layer to recover the original channel dimension. ⊛ is the
previous decoder block with the features from the skip con- Hadamard product.
nections using AG. Then, we concatenate the fused features Spatial Attention (SA): Spatial attention determines
with the upsampled features from the previous layer. After- where to focus in a feature map and then enhances those
ward, we process the concatenated features using our CAM features. The spatial attention SA(·) is given in Equation 5:
module for pixel grouping and suppressing background in-
formation using both channel and spatial attention. Finally, SA(x) = \sigma (C(C_m(x) + C_a(x))) \circledast x \label {eq:sa} (5)
we send the output from each CAM layer to a prediction
head and aggregate four different predictions to produce the where σ(·) is a Sigmoid activation function. Cm (·) and Ca (·)
final segmentation map. represent the maximum and average values obtained along

6225
the channel dimension, respectively. C(·) is a 7 × 7 convo- extract the features (X1, X2, X3, and X4) from four layers
lutional layer with padding 3 to enhance spatial contextual and feed them (i.e., X4 in the upsample path and X3, X2,
information (as in [8]). X1 in the skip connections) into our CASCADE decoder as
ConvBlock: The ConvBlock is used to further enhance shown in Figure 1(a-b). Then our CASCADE decoder pro-
the features generated using our CA and SA operations. cesses them and produces four prediction feature maps for
ConvBlock consists of two 3 × 3 convolution layers each the four stages of the encoder network. Afterward, we ag-
followed by a batch normalization layer and a ReLU activa- gregate the prediction feature maps using Equation 8 to pro-
tion layer. ConvBlock(·) is formulated as Equation 6: duce the final prediction feature map. Finally, we apply the
Sigmoid activation for binary segmentation and Softmax for
ConvBlock(x) = \sigma (BN(C(\sigma (BN(C(x)))))) \label {eq:conv_block} (6) multi-class segmentation tasks. Besides, we introduce Tran-
sCASCADE architecture by adopting the backbone encoder
where σ is the ReLU activation layer, BN (·) represents
network of TransUNet. We follow similar steps in our Tran-
batch normalization, and C(·) is a 3 × 3 convolution layer.
sCASCADE architecture. These two architectures achieve
SOTA performance on Synapse multi-organ segmentation,
3.2.3 UpConv ACDC, and several polyp segmentation benchmarks. De-
UpConv progressively upsamples the features of the cur- tails are given in the experimental section.
rent layer to match the dimension to the next skip connec- 4. Experiments
tion. Each UpConv layer consists of an UpSampling U P (·)
with scale-factor 2, a 3 × 3 convolution Conv(·), a batch In this section, we first compare the results of our pro-
normalization BN (·), and a ReLU activation layers. The posed CASCADE decoder with SOTA methods to demon-
U pConv(·) can be formulated as Equation 7: strate the superiority of our proposed method. Then, we
carry out ablation studies to evaluate the effectiveness of
UpConv(x) = ReLU(BN(Conv(Up(x)))) \label {eq:up_conv} (7) our CASCADE decoder.
3.3. Multi-stage loss and feature aggregation 4.1. Datasets and evaluation metrics
We use four prediction heads for the four stages of hi- Synapse multi-organ dataset. The Synapse multi-
erarchical encoders. We compute the final prediction map organ dataset1 has 30 abdominal CT scans with 3779 ax-
using additive aggregation as in Equation 8: ial contrast-enhanced abdominal CT images. Each CT
scan consists of 85-198 slices of 512 × 512 pixels, with a
output = w \times p1 + x \times p2 + y \times p3 + z \times p4 \label {eq:output_aggregation} (8) voxel spatial resolution of ([0:54-0:54] × [0:98-0:98]×[2:5-
5:0])mm3 . Following TransUNet [3], we divide the dataset
where p1, p2, p3, and p4 are the feature maps of four pre-
randomly into 18 scans for training (2212 axial slices), and
diction heads, and w, x, y, and z are the weights for indi-
12 for validation. We segment 8 anatomical structures, such
vidual prediction heads. In our experiments, we set all w,
as aorta, gallbladder (GB), left kidney (KL), right kidney
x, y, and z to 1.0. We get the final prediction output by ap-
(KR), liver, pancreas (PC), spleen (SP), and stomach (SM).
plying the Sigmoid activation for binary segmentation and
ACDC dataset. The ACDC dataset2 consists of 100 cardiac
Softmax activation for multi-class segmentation.
MRI scans collected from different patients. Each scan con-
However, we compute the loss for each prediction head
tains three organs, right ventricle (RV), left ventricle (LV),
separately and then aggregate them using equation 9:
and myocardium (Myo). Following TransUNet [3], we use
loss = \alpha \times loss_{p1} + \beta \times loss_{p2} + \gamma \times loss_{p3} + \zeta \times loss_{p4} \label {eq:loss_aggregation} (9) 70 cases (1930 axial slices) for training, 10 for validation,
and 20 for testing. Polyp datasets. CVC-ClinicDB [1] con-
where lossp1 , lossp2 , lossp3 , and lossp4 are the losses for tains 612 images, which are extracted from 31 colonoscopy
four different prediction heads, and α, β, γ, and ζ are the videos. Kvasir includes 1,000 polyp images, which are
weights for the loss of individual prediction heads. In our collected from the polyp class in the Kvasir-SEG dataset
experiments, we set all α, β, γ, and ζ to 1.0. [17]. Following the settings in PraNet [10], we adopt the
same 900 and 548 images from CVC-ClinicDB and Kvasir
3.4. Overall architecture datasets as the training set, and the remaining 64 and 100
We utilize two different hierarchical backbone encoder images are employed as the respective testsets. To evaluate
networks such as PVTv2 [32] and TransUNet [3] for our the generalization performance, we test the model on three
experiments. In the case of TransUNet, we only use their unseen datasets, namely EndoScene [29], ColonDB [26],
hybrid CNN-transformer backbone encoder network. By and ETIS-LaribDB [25]. These three testsets are collected
utilizing the PVTv2-b2 (Standard) encoder, we create the 1 https://www.synapse.org/#!Synapse:syn3193805/wiki/217789

PVT-CASCADE architecture. To adopt PVTv2-b2, we first 2 https://www.creatis.insa-lyon.fr/Challenge/acdc/

6226
Average
Architectures Aorta GB KL KR Liver PC SP SM
DICE↑ HD95↓ mIoU↑ ASD↓
UNet [24] 70.11 44.69 59.39 14.41 84.00 56.70 72.41 62.64 86.98 48.73 81.48 67.96
AttnUNet [23] 71.70 34.47 61.38 10.00 82.61 61.94 76.07 70.42 87.54 46.70 80.67 67.66
R50+UNet [3] 74.68 36.87 − − 84.18 62.84 79.19 71.29 93.35 48.23 84.41 73.92
R50+AttnUNet [3] 75.57 36.97 − − 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
SSFormerPVT [30] 78.01 25.72 67.23 4.56 82.78 63.74 80.72 78.11 93.53 61.53 87.07 76.61
PolypPVT [8] 78.08 25.61 67.43 4.89 82.34 66.14 81.21 73.78 94.37 59.34 88.05 79.4
TFCNs [19] 75.63 30.63 64.69 5.29 88.23 59.18 80.99 73.12 92.02 54.24 88.36 68.9
TransUNet [3] 77.61 26.9 67.32 4.66 86.56 60.43 80.54 78.53 94.33 58.47 87.06 75
SwinUNet [2] 77.58 27.32 66.88 4.7 81.76 65.95 82.32 79.22 93.73 53.81 88.04 75.79
PVT-CASCADE (Ours) 81.06 20.23 70.88 3.61 83.01 70.59 82.23 80.37 94.08 64.43 90.1 83.69
TransCASCADE (Ours) 82.68 17.34 73.48 2.83 86.63 68.48 87.66 84.56 94.43 65.33 90.79 83.52
Improve TransUNet 5.07 9.56 6.16 1.83 0.07 8.05 7.12 6.03 0.1 6.86 3.73 8.52

Table 1. Results of Synapse multi-organ segmentation. Only DICE scores are reported for individual organs. R50+UNet and
R50+AttnUNet adopt a pre-trained ResNet50 backbone network. We reproduce the results of UNet, AttnUNet, SSFormerPVT, Polyp-
PVT, TFCNs, TransUNet, and SwinUNet with the default experimental settings of TransUNet. ↑ denotes higher the better, ↓ denotes lower
the better. All CASCADE results are averaged over five runs. The best results are in bold.

Architectures Avg DICE RV Myo LV Synapse Multi-organ dataset. Following TransUNet

[3], we use a batch size of 24 and train each model max-
R50+UNet [3] 87.55 87.10 80.63 94.92
imum of 150 epochs. We use the input resolution and patch
R50+AttnUNet [3] 86.75 87.58 79.20 93.47
ViT+CUP [3] 81.45 81.46 70.71 92.18 size P as 224×224 and 16, respectively. We employ random
R50+ViT+CUP [3] 87.57 86.07 81.88 94.75 flipping and rotation for data augmentation. The combined
TransUNet [3] 89.71 86.67 87.27 95.18 cross-entropy and DICE loss are used as the loss function.
SwinUNet [2] 88.07 85.77 84.42 94.03 ACDC dataset. For the ACDC dataset, we train each model
for a maximum of 150 epochs with a batch size of 12. We
PVT-CASCADE (Ours) 91.46 88.9 89.97 95.50
TransCASCADE (Ours) 91.63 89.14 90.25 95.50 set the input resolution and patch size P as 224×224 and 16,
respectively. Random flipping and rotation are applied for
Table 2. Results on ACDC dataset. DICE scores are reported for data augmentation. We use the combined cross-entropy and
individual organs. We reproduce the results of SwinUNet. All DICE loss function. Polyp datasets. Following Polyp-PVT
CASCADE results are averaged over five runs. [8], we use a batch size of 16 and train each model maxi-
mum of 100 epochs. We resize the image to 352 × 352 and
use a similar multi-scale {0.75, 1.0, 1.25} training strategy
from different medical centers. In other words, the data with a gradient clip limit of 0.5 as Polyp-PVT. We use the
from these three sources are not used to train our model. combined weighted IoU and weighted BCE loss function.
EndoScene, ColonDB, and ETIS-LaribDB contain 60, 380,
and 196 images, respectively. 4.3. Results
Evaluation metrics. We use DICE, mean intersection We compare our architectures (i.e., PVT-CASCADE and
over union (mIoU), 95% Hausdorff Distance (95HD), and TransCASCADE) with SOTA CNN and transformer-based
Average surface distance (ASD) as the evaluation metrics in segmentation methods on Synapse Multi-organ, ACDC, and
our experiments on Synapse Multi-organ dataset. Following Polyp (i.e., Endoscene [29], CVC-ClinicDB [1], Kvasir
existing methods, we use only DICE scores for the ACDC [17], ColonDB [26], ETIS-LaribDB [25]) datasets. More
dataset. For the experiments on polyp segmentation, we use results are available in the supplementary materials.
DICE and mIoU as the evaluation metrics.
4.2. Implementation details 4.3.1 Experimental results on Synapse dataset
All our experiments are implemented in Pytorch 1.11.0. We demonstrate the performance of different CNN and
We train all models on a single NVIDIA RTX A6000 GPU transformer-based methods in Table 1. As shown in Ta-
with 48GB of memory. We utilize the pre-trained weights ble 1, transformer-based models have superior performance
on ImageNet for backbone networks. We use AdamW opti- compared to CNN-based models. Our proposed CASCADE
mizer [21] with learning rate and weight decay of 1e-4. decoder improves the average DICE, mIoU, and HD95

6227
EndoScene CVC-ClinicDB Kvasir ColonDB ETIS-LaribDB
Architectures
DICE mIoU DICE mIoU DICE mIoU DICE mIoU DICE mIoU
UNet [24] 71.0 62.7 82.3 75.5 81.8 74.6 51.2 44.4 39.8 33.5
UNet++ [37] 70.7 62.4 79.4 72.9 82.1 74.3 48.3 41.0 40.1 34.4
PraNet [10] 87.1 79.7 89.9 84.9 89.8 84.0 71.2 64.0 62.8 56.7
UACANet-L [18] 88.21 80.84 91.07 86.7 90.83 85.95 72.57 65.41 63.89 56.87
SSFormerPVT [30] 89.46 82.68 92.88 88.27 91.11 86.01 79.34 70.63 78.03 70.1
PolypPVT [8] 88.71 81.89 93.08 88.28 91.23 86.3 80.75 71.85 78.67 70.97
PVT-CASCADE (Ours) 90.47 83.79 94.34 89.98 92.58 87.76 82.54 74.53 80.07 72.58
Improve SSFormerPVT 1.01 1.11 1.46 1.71 1.47 1.75 3.2 3.9 2.04 2.48
Improve PolypPVT 1.76 1.9 1.26 1.7 1.35 1.46 1.79 2.68 1.4 1.61

Table 3. Results on polyp segmentation datasets. Training on combined Kvasir [17] and CVC-ClinicDB [1] trainset. The results of UNet,
UNet++ and PraNet are taken from [10]. We reproduce the results of PolypPVT, SSFormerPVT, and UACANet using their public source
code with default settings. All PVT-CASCADE results are averaged over five runs. The best results are in bold.

Components EndoScene CVC-ClinicDB Kvasir ColonDB ETIS-LaribDB

Cascaded AG CAM DICE mIoU DICE mIoU DICE mIoU DICE mIoU DICE mIoU
No No No 88.41 81.47 91.82 87.12 91.09 86.13 77.86 69.43 77.04 68.47
Yes No No 89.11 82.32 93.54 88.95 91.98 87.05 81.30 73.21 78.16 69.97
Yes Yes No 89.25 82.57 93.61 89.04 92.45 87.57 81.72 73.67 79.27 71.38
Yes No Yes 89.39 82.79 93.88 89.31 92.20 87.28 82.11 74.09 79.57 71.73
Yes Yes Yes 90.47 83.79 94.34 89.98 92.58 87.76 82.54 74.53 80.07 72.58

Table 4. Quantitative results of different components of CASCADE with PVTv2-b2 backbone. Training on combined Kvasir and CVC-
ClinicDB trainset and testing on five testsets (i.e., Endoscene, CVC-ClinicDB, Kvasir, ColonDB, ETIS-LaribDB). All results are averaged
over five runs. The best results are in bold.

scores of TransUNet by 5.07%, 6.16%, and 9.56, respec- same encoder. Our PVT-CASCADE gains 91.46% DICE
tively. TransCASCADE achieves the best average DICE score which is also better than all other methods. Besides,
(82.67%), mIoU (73.48%), HD95 (17.34), and ASD (2.83) our TransCASCADE has an improvement of 2.5 - 3% DICE
scores among all other methods. Moreover, TransCAS- score in challenging organs RV and Myo segmentation.
CADE demonstrates significant performance improvements
in both small and large organ segmentation. For small or-
4.3.3 Experimental results on Polyp datasets
gans, 8.05%, 7.12%, and 6.03% improvements in gallblad-
der, left kidney, and right kidney, respectively. For large We evaluate the performance and generalizability of our
organs, 8.52%, 6.86%, and 3.73% improvements in stom- CASCADE decoder on five different polyp segmentation
ach, pancreas, and spleen, respectively. This is because test sets among which three are completely unseen datasets
CASCADE captures both long-range dependencies and lo- collected from different labs. Table 3 displays the DICE
cal contextual relations among pixels. Due to using atten- and mIoU scores of SOTA methods along with our CAS-
tion, CASCADE better refines the feature maps and pro- CADE decoder. From Table 3, we can show that CAS-
duces stronger feature representations than other decoders. CADE significantly outperforms all other methods achiev-
The lower HD95 scores indicate that our CASCADE de- ing 2.04 - 3.2% and 2.5 - 3.9% improvement in DICE and
coder can better locate the boundary of organs. mIoU scores in unseen test sets over the previous best model
using the same pre-trained transformer backbone. It is note-
4.3.2 Experimental results on ACDC dataset worthy that CASCADE outperforms the best CNN-based
model UACANet by a large margin on unseen datasets (i.e.,
We evaluate the performance of our method on the MRI 16.2% and 10% DICE score improvement in ETIS-LaribDB
images of the ACDC dataset. Table 2 presents the aver- and ColonDB, respectively). Therefore, we can conclude
age DICE scores of our PVT-CASCADE and TransCAS- that due to using transformers as a backbone network and
CADE along with other SOTA methods. Our TransCAS- our attention-based CASCADE decoder, PVT-CASCADE
CADE achieves the highest average DICE score of 91.63% inherits the merits of transformers, CNNs, and attentions
improving about 2% over TransUNet though we share the which makes them highly generalizable for unseen datasets.

6228
Figure 2. CASCADE vs. Cascaded Upsampler (CUP) features. First and second rows present CASCADE and CUP features, respectively.
We put only the similar layers feature for our CASCADE decoder for fair comparisons. Layers are numbered based on their corresponding
transformer layer number. In both cases, we use the ImageNet pretrained PVTv2-b2 backbone as the encoder.

90
CADE decoder to get the overall loss and final segmenta-
80 tion map. Figure 3 plots the average DICE score across
Average dice scores (%)

70 five datasets for each epoch. The graph contains six dif-
ferent loss and output aggregation settings such as ”1-loss,
60
1-output”, ”1-loss, 4-output”, ”4-loss, 1-output”, ”4-loss
50 1-loss, 1-output (add), 4 output”, ”4-loss (avg), 4 output”, and ”1-loss, 1 out-
40 1-loss, 4-output put with learning rate 4e-4”. It is evident from the graph that
4-loss (add), 4-output
30
”4-loss (add), 4 output” and ”4-loss (avg), 4 output” achieve
4-loss (add), 1-output
4-loss (avg), 4-output 74 - 75% DICE scores in the first epoch, and these set-
20
1-loss, 1-output with lr 4e-4 tings gain more than 82% DICE score within 5 epochs. On
100 5 10 15 20 25 30 the other hand, other losses and output aggregations have a
Epochs DICE score of around 35 - 53%, and these settings achieve
71% DICE score within 5 epochs. We can also see from the
Figure 3. Multi-stage loss and output aggregation vs. single loss graph that ”4-loss (add), 4 output” shows the best perfor-
and output. We plot the average DICE scores of five testsets (i.e.,
mance, achieving 84.67% average DICE score. Therefore,
Endoscene, CVC-ClinicDB, Kvasir, ColonDB, ETIS-LaribDB)
vs. # epochs in six different loss and output aggregation settings.
we can conclude that aggregation of multi-stage loss and
output leverages multi-scale features that help to produce
accurate and high-resolution segmentation outputs.
4.4. Ablation study
Effective enhancement/refinement of features. We vi-
5. Conclusion
sualize the features of our CASCADE, as well as Cascaded In this paper, we have proposed a novel attention-based
Upsampler (CUP) [3] in Figure 2. We compute the aver- decoder for hierarchical feature aggregation, which has ro-
age of all channels in the feature map and then produce the bust generalization and learning ability; these are crucial
heatmap using OpenCV-Python. It is evident from Figure 2 for medical image segmentation. We believe that CAS-
that the attention mechanism used in our CASCADE helps CADE has great potential to improve deep learning perfor-
identify, enhance, and group the pixels better than CUP. mance in other medical image segmentation tasks. More-
Effectiveness of different parts of CASCADE. We over, experiments demonstrate that CASCADE effectively
carry out ablation studies on the Polyp datasets to eval- enhances transformer features and incorporates spatial rela-
uate the effectiveness of the different components of our tionships among pixels (e.g., improves baseline TransUNet
proposed CASCADE decoder. We use the same PVTv2- by 5.07% DICE and 6.16% mIoU in Synapse Multi-organ
b2 backbone pre-trained on ImageNet and the same exper- segmentation). Experimental results demonstrate that CAS-
imental settings for polyp datasets in all experiments. We CADE can locate the organs or lesions well (e.g., improves
remove different modules such as AGs and CAM from the HD95 score by 9.56) due to using attention in the decoding
CASCADE decoder and compare the results. It is evident process. Therefore, our decoder can be further used to en-
from the Table 4 that the cascaded structure of the decoder hance the transformer feature for general computer vision
improves performance over the non-cascaded decoder. AG and highly generalizable medical applications.
and CAM modules also help improve performance. How-
ever, the use of both AG and CAM modules produces the Acknowledgment
best performance in all test datasets.
Faster learning of multi-stage loss and output fusions. This work is supported, in part, by NSF grant CNS
We add the loss and output from four stages of our CAS- 2007284.

6229
References Jiang Liu. Ce-net: Context encoder network for 2d medical
image segmentation. IEEE transactions on medical imaging,
[1] Jorge Bernal, F Javier Sánchez, Gloria Fernández- 38(10):2281–2292, 2019.
Esparrach, Debora Gil, Cristina Rodrı́guez, and Fernando
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Vilariño. Wm-dova maps for accurate polyp highlighting
Deep residual learning for image recognition. In Proceed-
in colonoscopy: Validation vs. saliency maps from physi-
ings of the IEEE conference on computer vision and pattern
cians. Computerized medical imaging and graphics, 43:99–
recognition, pages 770–778, 2016.
111, 2015.
[14] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
[2] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi-
works. In Proceedings of the IEEE conference on computer
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet:
vision and pattern recognition, pages 7132–7141, 2018.
Unet-like pure transformer for medical image segmentation.
arXiv preprint arXiv:2105.05537, 2021. [15] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu,
Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei
[3] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan
Chen, and Jian Wu. Unet 3+: A full-scale connected unet for
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou.
medical image segmentation. In ICASSP 2020-2020 IEEE
Transunet: Transformers make strong encoders for medi-
International Conference on Acoustics, Speech and Signal
cal image segmentation. arXiv preprint arXiv:2102.04306,
Processing (ICASSP), pages 1055–1059. IEEE, 2020.
2021.
[16] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much
[4] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian
position information do convolutional neural networks en-
Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and
code? arXiv preprint arXiv:2001.08248, 2020.
channel-wise attention in convolutional networks for im-
age captioning. In Proceedings of the IEEE conference on [17] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål
computer vision and pattern recognition, pages 5659–5667, Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D
2017. Johansen. Kvasir-seg: A segmented polyp dataset. In Inter-
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, national Conference on Multimedia Modeling, pages 451–
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image 462. Springer, 2020.
segmentation with deep convolutional nets, atrous convolu- [18] Taehun Kim, Hyemin Lee, and Daijin Kim. Uacanet: Uncer-
tion, and fully connected crfs. IEEE transactions on pattern tainty augmented context attention for polyp segmentation.
analysis and machine intelligence, 40(4):834–848, 2017. In Proceedings of the 29th ACM International Conference
[6] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- on Multimedia, pages 2167–2175, 2021.
verse attention for salient object detection. In Proceedings of [19] Zihan Li, Dihan Li, Cangbai Xu, Weice Wang, Qingqi Hong,
the European conference on computer vision (ECCV), pages Qingde Li, and Jie Tian. Tfcns: A cnn-transformer hybrid
234–250, 2018. network for medical image segmentation. arXiv preprint
[7] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xi- arXiv:2207.03450, 2022.
aolin Wei, Huaxia Xia, and Chunhua Shen. Conditional po- [20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
sitional encodings for vision transformers. arXiv preprint Zhang, Stephen Lin, and Baining Guo. Swin transformer:
arXiv:2102.10882, 2021. Hierarchical vision transformer using shifted windows. In
[8] Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li, Proceedings of the IEEE/CVF International Conference on
Huazhu Fu, and Ling Shao. Polyp-pvt: Polyp segmen- Computer Vision, pages 10012–10022, 2021.
tation with pyramid vision transformers. arXiv preprint [21] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
arXiv:2108.06932, 2021. regularization. arXiv preprint arXiv:1711.05101, 2017.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [22] Ange Lou, Shuyue Guan, and Murray Loew. Dc-unet: re-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, thinking the u-net architecture with dual channel efficient
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- cnn for medical image segmentation. In Medical Imaging
vain Gelly, et al. An image is worth 16x16 words: Trans- 2021: Image Processing, volume 11596, pages 758–768.
formers for image recognition at scale. arXiv preprint SPIE, 2021.
arXiv:2010.11929, 2020. [23] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee,
[10] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven
Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Atten-
attention network for polyp segmentation. In International tion u-net: Learning where to look for the pancreas. arXiv
conference on medical image computing and computer- preprint arXiv:1804.03999, 2018.
assisted intervention, pages 263–273. Springer, 2020. [24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
[11] Shuanglang Feng, Heming Zhao, Fei Shi, Xuena Cheng, net: Convolutional networks for biomedical image segmen-
Meng Wang, Yuhui Ma, Dehui Xiang, Weifang Zhu, and tation. In International Conference on Medical image com-
Xinjian Chen. Cpfnet: Context pyramid fusion network for puting and computer-assisted intervention, pages 234–241.
medical image segmentation. IEEE transactions on medical Springer, 2015.
imaging, 39(10):3008–3018, 2020. [25] Juan Silva, Aymeric Histace, Olivier Romain, Xavier Dray,
[12] Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huay- and Bertrand Granado. Toward embedded detection of
ing Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and polyps in wce images for early diagnosis of colorectal can-

6230
cer. International journal of computer assisted radiology and
surgery, 9(2):283–293, 2014.
[26] Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming
Liang. Automated polyp detection in colonoscopy videos
using shape and context information. IEEE transactions on
medical imaging, 35(2):630–644, 2015.
[27] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. In International Conference on Machine Learning,
pages 10347–10357. PMLR, 2021.
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017.
[29] David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria
Fernández-Esparrach, Antonio M López, Adriana Romero,
Michal Drozdzal, and Aaron Courville. A benchmark for en-
doluminal scene segmentation of colonoscopy images. Jour-
nal of healthcare engineering, 2017, 2017.
[30] Jinfeng Wang, Qiming Huang, Feilong Tang, Jia Meng, Jion-
glong Su, and Sifan Song. Stepwise feature fusion: Local
guides global. arXiv preprint arXiv:2203.03635, 2022.
[31] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 568–578, 2021.
[32] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt
v2: Improved baselines with pyramid vision transformer.
Computational Visual Media, 8(3):415–424, 2022.
[33] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang
Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general
u-shaped transformer for image restoration. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 17683–17693, 2022.
[34] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So
Kweon. Cbam: Convolutional block attention module. In
Proceedings of the European conference on computer vision
(ECCV), pages 3–19, 2018.
[35] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021.
[36] Zhijie Zhang, Huazhu Fu, Hang Dai, Jianbing Shen, Yan-
wei Pang, and Ling Shao. Et-net: A generic edge-attention
guidance network for medical image segmentation. In In-
ternational Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 442–450. Springer,
2019.
[37] Z Zhou, MMR Siddiquee, N Tajbakhsh, and J Liang. A
nested u-net architecture for medical image segmentation.
arxiv 2018. arXiv preprint arXiv:1807.10165.