EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
Image Segmentation
Abstract tation output. Attention mechanisms [12, 17, 20, 41, 57]
have also been integrated into these models to enhance fea-
An efficient and effective decoding mechanism is crucial ture maps and improve pixel-level classification. Although
in medical image segmentation, especially in scenarios with attention-based models have shown improved performance,
limited computational resources. However, these decoding they still face significant challenges due to the computation-
mechanisms usually come with high computational costs. ally expensive convolutional blocks that are typically used
To address this concern, we introduce EMCAD, a new effi- in conjunction with attention mechanisms.
cient multi-scale convolutional attention decoder, designed Recently, vision transformers [18] have shown promise
to optimize both performance and computational efficiency. in medical image segmentation tasks [5, 8, 17, 42, 43, 52,
EMCAD leverages a unique multi-scale depth-wise convo- 54, 61] by capturing long-range dependencies among pix-
lution block, significantly enhancing feature maps through els through Self-attention (SA) mechanisms. Hierarchical
multi-scale convolutions. EMCAD also employs channel, vision transformers like Swin [34], PVT [55, 56], MaxViT
spatial, and grouped (large-kernel) gated attention mech- [49], MERIT [43], ConvFormer [33], and MetaFormer [59]
anisms, which are highly effective at capturing intricate have been introduced to further improve the performance in
spatial relationships while focusing on salient regions. By this field. While the SA excels at capturing global informa-
employing group and depth-wise convolution, EMCAD is tion, it is less adept at understanding the local spatial context
very efficient and scales well (e.g., only 1.91M parame- [13, 28]. To address this limitation, some approaches have
ters and 0.381G FLOPs are needed when using a stan- integrated local convolutional attention within the decoders
dard encoder). Our rigorous evaluations across 12 datasets to better grasp spatial details. Nevertheless, these meth-
that belong to six medical image segmentation tasks re- ods can still be computationally demanding because they
veal that EMCAD achieves state-of-the-art (SOTA) perfor- frequently employ costly convolutional blocks. This limits
mance with 79.4% and 80.3% reduction in #Params and their applicability to real-world scenarios where computa-
#FLOPs, respectively. Moreover, EMCAD’s adaptability tional resources are restricted.
to different encoders and versatility across segmentation To address the aforementioned limitations, we introduce
tasks further establish EMCAD as a promising tool, ad- EMCAD, an efficient multi-scale convolutional attention
vancing the field towards more efficient and accurate med- decoding using a new multi-scale depth-wise convolution
ical image analysis. Our implementation is available at block. More precisely, EMCAD enhances the feature maps
https://github.com/SLDGroup/EMCAD. via efficient multi-scale convolutions, while incorporating
complex spatial relationships and local attention through the
use of channel, spatial, and grouped (large-kernel) gated at-
1. Introduction
tention mechanisms. Our contributions are as follows:
In the realm of medical diagnostics and therapeutic strate- • New Efficient Multi-scale Convolutional Decoder:
gies, automated segmentation of medical images is vital, as We introduce an efficient multi-scale cascaded fully-
it classifies pixels to identify critical regions such as lesions, convolutional attention decoder (EMCAD) for 2D med-
tumors, or entire organs. A variety of U-shaped convolu- ical image segmentation; this takes the multi-stage fea-
tional neural network (CNN) architectures [20, 24, 37, 41, tures of vision encoders and progressively enhances the
44, 62], notably UNet [44], UNet++ [62], UNet3+ [24], and multi-scale and multi-resolution spatial representations.
nnU-Net [19], have become standard techniques for this EMCAD has only 0.506M parameters and 0.11G FLOPs
purpose, achieving high-quality, high-resolution segmen- for a tiny encoder with #channels = [32, 64, 160, 256],
1
ficiency in handling spatial relationships in images. More
precisely, AlexNet [32] and VGG [46] pave the way, lever-
aging deep layers of convolutions to extract features pro-
gressively. GoogleNet [47] introduces the inception mod-
ule, allowing more efficient computation of representations
across various scales. ResNet [21] introduces residual con-
nections, enabling the training of networks with substan-
tially more layers by addressing the vanishing gradients
problem. MobileNets [22, 45] bring CNNs to mobile de-
vices through lightweight, depth-wise separable convolu-
tions. EfficientNet [48] introduces a scalable architectural
design to CNNs with compound scaling. Although CNNs
Figure 1. Average DICE scores vs. #FLOPs for different methods are pivotal for many vision applications, they generally lack
over 10 binary medical image segmentation datasets. As shown, the ability to capture long-range dependencies within im-
our approaches (PVT-EMCAD-B0 and PVT-EMCAD-B2) have ages due to their inherent local receptive fields.
the lowest #FLOPs, yet the highest DICE scores. Recently, Vision Transformers (ViTs), pioneered by
Dosovitskiy et al. [18], enabled the learning of long-range
while it has 1.91M parameters and 0.381G FLOPs for a relationships among pixels using Self-attention (SA). Since
standard encoder with #channels = [64, 128, 320, 512]. then, ViTs have been enhanced by integrating CNN fea-
• Efficient Multi-scale Convolutional Attention Mod- tures [49, 56], developing novel self-attention (SA) blocks
ule: We introduce MSCAM, a new efficient multi-scale [34, 49], and introducing new architectural designs [55, 58].
convolutional attention module that performs depth-wise The Swin Transformer [34] incorporates a sliding window
convolutions at multiple scales; this refines the feature attention mechanism, while SegFormer [58] leverages Mix-
maps produced by vision encoders and enables captur- FFN blocks for hierarchical structures. PVT [55] uses spa-
ing multi-scale salient features by suppressing irrele- tial reduction attention, refined in PVTv2 [56] with over-
vant regions. The use of depth-wise convolutions makes lapping patch embedding and a linear complexity attention
MSCAM very efficient. layer. MaxViT [49] introduces a multi-axis self-attention
• Large-kernel Grouped Attention Gate: We introduce to form a hierarchical CNN-transformer encoder. Although
a new grouped attention gate to fuse refined features with ViTs address the CNNs limitation in capturing long-range
the features from skip connections. By using larger kernel pixel dependencies [21–23, 32, 35, 45–48], they face chal-
(3 × 3) group convolutions instead of point-wise convolu- lenges in capturing the local spatial relationships among
tions in the design, we capture salient features in a larger pixels. In this paper, we aim to overcome these limita-
local context with less computation. tions by introducing a new multi-scale cascaded attention
• Improved Performance: We empirically show that EM- decoder that refines feature maps and incorporates local at-
CAD can be used with any hierarchical vision encoder tention using a multi-scale convolutional attention module.
(e.g., PVTv2-B0, PVTv2-B2 [56]), while significantly
improving the performance of 2D medical image seg- 2.2. Medical image segmentation
mentation. EMCAD produces better results than SOTA
methods with a significantly lower computational cost (as Medical image segmentation involves pixel-wise classifica-
shown in Figure 1) on 12 medical image segmentation tion to identify various anatomical structures like lesions,
benchmarks that belong to six different tasks. tumors, or organs within different imaging modalities such
as endoscopy, MRI, or CT scans [8]. U-shaped networks
The remaining of this paper is organized as follows:
[7, 19, 24, 26, 37, 41, 44, 62] are particularly favored due
Section 2 summarizes related work. Section 3 describes
to their simple but effective encoder-decoder design. The
the proposed method. Section 4 explains our experimental
UNet [44] pioneered this approach with its use of skip
setup and results on 12 medical image segmentation bench-
connections to fuse features at different resolution stages.
marks. Section 5 covers different ablation experiments.
UNet++ [62] evolves this design by incorporating nested
Lastly, Section 6 concludes the paper.
encoder-decoder pathways with dense skip connections.
Expanding on these ideas, UNet 3+ [24] introduces compre-
2. Related Work hensive skip pathways that facilitate full-scale feature inte-
2.1. Vision encoders gration. Further advancement comes with DC-UNet [37],
which integrates a multi-resolution convolution scheme and
Convolutional Neural Networks (CNNs) [21–23, 32, 35, residual paths into its skip connections. The DeepLab se-
45–48] have been foundational as encoders due to their pro- ries, including DeepLabv3 [10] and DeepLabv3+ [11], in-
2
troduce atrous convolutions and spatial pyramid pooling to scale convolutional attention modules (MSCAMs) to ro-
handle multi-scale information. SegNet [2] uses pooling in- bustly enhance the feature maps, large-kernel grouped at-
dices to upsample feature maps, preserving the boundary tention gates (LGAGs) to refine feature maps fusing with
details. nnU-Net [19] automatically configures hyperpa- the skip connection via gated attention mechanism, efficient
rameters based on the specific dataset characteristics, using up-convolution blocks (EUCBs) for up-sampling followed
standard 2D and 3D UNets. Collectively, these U-shaped by enhancement of feature maps, and segmentation heads
models have become a benchmark for success in the domain (SHs) to produce the segmentation outputs.
of medical image segmentation. More specifically, we use four MSCAMs to refine pyra-
Recently, vision transformers have emerged as a mid features (i.e., X1, X2, X3, X4 in Figure 2) extracted
formidable force in medical image segmentation, harness- from the four stages of the encoder. After each MSCAM,
ing the ability to capture pixel relationships at global scales we use an SH to produce a segmentation map of that stage.
[5, 8, 17, 42, 43, 52, 58, 61]. TransUNet [8] presents a novel Subsequently, we upscale the refined feature maps using
blend of CNNs for local feature extraction and transform- EUCBs and add them to the outputs from the corresponding
ers for global context, enhancing both local and global fea- LGAGs. Finally, we add four different segmentation maps
ture capture. Swin-Unet [5] extends this by incorporating to produce the final segmentation output. Different modules
Swin Transformer blocks [34] into a U-shaped model for of our decoder are described next.
both encoding and decoding processes. Building on these
concepts, MERIT [43] introduces a multi-scale hierarchi-
cal transformer, which employs SA across different window 3.1.1 Large-kernel grouped attention gate (LGAG)
sizes, thus enhancing the model capacity to capture multi-
scale features critical for medical image segmentation. We introduce a new large-kernel grouped attention gate
The integration of attention mechanisms has been in- (LGAG) to progressively combine feature maps with atten-
vestigated within CNNs [20, 41] and transformer-based tion coefficients, which are learned by the network to allow
systems [17] for enhancing medical image segmentation. higher activation of relevant features and suppression of ir-
PraNet [20] employs a reverse attention strategy for fea- relevant ones. This process employs a gating signal derived
ture refinement. PolypPVT [17] leverages PVTv2 [56] as from higher-level features to control the flow of informa-
its backbone encoder and incorporates CBAM [57] within tion across different stages of the network, thus enhancing
its decoding stages. The CASCADE [42] presents a novel its precision for medical image segmentation. Unlike At-
cascaded decoder, combining channel [23] and spatial [9] tention UNet [41] which uses 1 × 1 convolution to process
attention to refine features at multiple stages, extracted from gating signal g (features from skip connections) and input
a transformer encoder, culminating in high-resolution seg- feature map x (upsampled features), in our qatt (.) func-
mentation outputs. While CASCADE achieves notable per- tion, we process g and x by applying separate 3 × 3 group
formance in segmenting medical images by integrating lo- convolutions GCg (.) and GCx (.), respectively. These con-
cal and global insights from transformers, it is computation- volved features are then normalized using batch normaliza-
ally inefficient due to the use of triple 3 × 3 convolution lay- tion (BN (.)) [27] and merged through element-wise addi-
ers at each decoder stage. In addition to this, it uses single- tion. The resultant feature map is activated through a ReLU
scale convolutions during decoding. Our new proposal in- (R(.)) layer [39]. Afterward, we apply a 1 × 1 convolu-
volves the adoption of multi-scale depth-wise convolutions tion (C(.)) followed by BN (.) layer to get a single channel
to mitigate these constraints. feature map. We then pass the resultant single-channel fea-
ture map through a Sigmoid (σ(.)) activation function to
3. Methodology yield the attention coefficients. The output of this transfor-
mation is used to scale the input feature x through element-
In this section, we first introduce our new EMCAD de- wise multiplication, producing the attention-gated feature
coder and then explain two transformer-based architectures LGAG(g, x). The LGAG(·) (Figure 2(g)) can be formu-
(i.e., PVT-EMCAD-B0 and PVT-EMCAD-B2) incorporat- lated as in Equations 1 and 2:
ing our proposed decoder.
3.1. Efficient multi-scale convolutional attention de- q_{att}(g, x) = R(BN(GC_g(g) + BN(GC_x(x))))) \vspace {-.1cm} \label {eq:ags_1} (1)
coding (EMCAD)
In this section, we introduce our efficient multi-scale con-
LGAG(g, x) = x \circledast \sigma (BN(C(q_{att}(g, x)))) \vspace {-.1cm} \label {eq:ags_2} (2)
volutional decoding (EMCAD) to process the multi-stage
features extracted from pretrained hierarchical vision en- Due to using 3 × 3 kernel group convolutions in qatt (.), our
coders for high-resolution semantic segmentation. As LGAG captures comparatively larger spatial contexts with
shown in Figure 2(b), EMCAD consits of efficient multi- less computational cost.
3
Figure 2. Hierarchical encoder with newly proposed EMCAD decoder architecture. (a) CNN or transformer encoder with four hierarchical
stages, (b) EMCAD decoder, (c) Efficient up-convolution block (EUCB), (d) Multi-scale convolutional attention module (MSCAM), (e)
Multi-scale convolution block (MSCB), (f) Multi-scale (parallel) depth-wise convolution (MSDC), (g) Large-kernel grouped attention gate
(LGAG), (h) Channel attention block (CAB), and (i) Spatial attention block (SAB). X1, X2, X3, and X4 are the features from the four
stages of the hierarchical encoder. p1, p2, p3, and p4 are output segmentation maps from four stages of our decoder.
3.1.2 Multi-scale convolutional attention module the relationships among channels, we use a channel shuffle
(MSCAM) operation to incorporate relationships among channels. Af-
terward, we use another point-wise convolution P W C2 (.)
We introduce an efficient multi-scale convolutional atten-
followed by a BN (.) to transform back the original #chan-
tion module to refine the feature maps. MSCAM consists
nels, which also encodes dependency among channels. The
of a channel attention block (CAB(·)) to put emphasis on
M SCB(·) (Figure 2(e)) is formulated as in Equation 4:
pertinent channels, a spatial attention block [9] (SAB(·))
to capture the local contextual information, and an effi- MSCB(x) = BN(PWC_2(CS(MSDC(R6(BN(PWC_1(x))))))) (4)
cient multi-scale convolution block (M SCB(.)) to enhance
the feature maps preserving contextual relationships. The where parallel M SDC(.) (Figure 2(f)) for different kernel
M SCAM (.) (Figure 2(d)) is given in Equation 3: sizes (KS) can be formulated using Equation 5:
MSCAM(x) = MSCB(SAB(CAB(x))) \vspace {-.1cm} \label {eq:mscam} (3) MSDC(x) = \sum _{ks \in KS} DWCB_{ks}(x) (5)
where x is the input tensor. Due to using depth-wise con- where DW CBks (x) = R6(BN (DW Cks (x))). Here,
volution in multiple scales, our MSCAM is more effective DW Cks (.) is a depth-wise convolution with the kernel
with significantly lower computational cost than the convo- size ks. BN (.) and R6(.) are batch normalization
lutional attention module (CAM) proposed in [42]. and ReLU6 activation, respectively. Additionally, our se-
Multi-scale Convolution Block (MSCB): We introduce quential M SDC(.) uses the recursively updated input x,
an efficient multi-scale convolution block to enhance the where the input x is residually connected to the previous
features generated by our cascaded expanding path. In our DW CBks (.) for better regularization as in Equation 6:
MSCB, we follow the design of the inverted residual block x = x + DWCB_{ks}(x) \vspace {-.1cm} \label {eq:msdc_sequential} (6)
(IRB) of MobileNetV2 [45]. However, unlike IRB, our
MSCB performs depth-wise convolution at multiple scales Channel Attention Block (CAB): We use channel at-
and uses channel shuffle [60] to shuffle channels across tention block to assign different levels of importance to each
groups. More specifically, in our MSCB, we first expand channel, thus emphasizing more relevant features while
the number of channels (i.e., expansion factor = 2) using a suppressing less useful ones. Basically, the CAB identi-
point-wise (1×1) convolution layers P W C1 (·) followed by fies which feature maps to focus on (and then refine them).
a batch normalization layer BN (·) and a ReLU6 [31] activa- Following [57], in CAB, we first apply the adaptive maxi-
tion layer R6(.). We then use a multi-scale depth-wise con- mum pooling (Pm (·)) and adaptive average pooling (Pa (·))
volution M SDC(.) to capture both multi-scale and multi- to the spatial dimensions (i.e., height and width) to extract
resolution contexts. As depth-wise convolution overlooks the most significant feature of the entire feature map per
4
channel. Then, for each pooled feature map, we reduce produces output with #channels equal to #classes in target
the number of channels r = 1/16 times separately using dataset for multi-class but 1 channel for binary segmenta-
a point-wise convolution (C1 (·)) followed by a ReLU ac- tion. The SH(·) is formulated as in Equation 10:
tivation (R). Afterward, we recover the original channels
SH(x) = Conv_{1\times 1}(x) \vspace {-0.1cm} \label {eq:seg_head} (10)
using another point-wise convolution (C2 (·)). We then add
both recovered feature maps and apply Sigmoid (σ) activa- 3.2. Overall architecture
tion to estimate attention weights. Finally, we incorporate
these weights to input x using the Hadamard product (⊛). To show the generalization, effectiveness, and ability to pro-
The CAB(·) (Figure 2(h)) is defined using Equation 7: cess multi-scale features for medical image segmentation,
we integrate our EMCAD decoder alongside tiny (PVTv2-
CAB(x)=\sigma (C_2(R(C_1(P_m(x))))+C_2(R(C_1(P_a(x))))) \circledast x (7) B0) and standard (PVTv2-B2) networks of PVTv2 [56].
However, our decoder is adaptable and seamlessly compat-
Spatial Attention Block (SAB): We use spatial atten-
tion to mimic the attentional processes of the human brain ible with other hierarchical backbone networks.
by focusing on specific parts of an input image. Basically, PVTv2 differs from conventional transformer patch em-
the SAB determines where to focus in a feature map; then it bedding modules by applying convolutional operations for
enhances those features. This process enhances the model’s consistent spatial information capture. Using PVTv2-b0
ability to recognize and respond to relevant spatial features, (Tiny) and PVTv2-b2 (Standard) encoders [56], we develop
which is crucial for image segmentation where the context the PVT-EMCAD-B0 and PVT-EMCAD-B2 architectures.
and location of objects significantly influence the output. To adopt PVTv2, we first extract the features (X1, X2, X3,
In SAB, we first pool maximum (Chmax (·)) and average and X4) from four layers and feed them (i.e., X4 in the up-
(Chavg (·)) values along the channel dimension to pay atten- sample path and X3, X2, X1 in the skip connections) into
tion to local features. Then, we use a large kernel (i.e., 7 × 7 our EMCAD decoder as shown in Figure 2(a-b). EMCAD
as in [17]) convolution layer to enhance local contextual re-
then processes them and produces four segmentation maps
lationships among features. Afterward, we apply the Sig-
moid activation (σ) to calculate attention weights. Finally, that correspond to the four stages of the encoder network.
we feed these weights to the input x (using Hadamard prod- 3.3. Multi-stage loss and outputs aggregation
uct (⊛) to attend information in a more targeted way. The
SAB(.) (Figure 2(i)) is defined using Equation 8: Our EMCAD decoder’s four segmentation heads produce
four prediction maps p1 , p2 , p3 , and p4 across its stages.
SAB(x) = \sigma (LKC([Ch_{max}(x), Ch_{avg}(x)])) \circledast x \vspace {-0.1cm} \label {eq:spab} (8) Loss aggregation: We adopt a combinatorial approach
to loss combination called MUTATION, inspired by the
3.1.3 Efficient up-convolution block (EUCB) work of MERIT [43] for multi-class segmentation. This
involves calculating the loss for all possible combinations
We use an efficient up-convolution block to progressively
of predictions derived from 4 heads, totaling 24 − 1 = 15
upsample the feature maps of the current stage to match the
unique predictions, and then summing these losses. We fo-
dimension and resolution of the feature maps from the next
cus on minimizing this cumulative combinatorial loss dur-
skip connection. The EUCB first uses an UpSampling U p(·)
ing the training process. For binary segmentation, we op-
with scale-factor 2 to upscale the feature maps. Then, it
timize the additive loss like [42] with an additional term
enhances the upscaled feature maps by applying a 3 × 3
Lp1 +p2 +p3 +p4 as in Equation 11:
depth-wise convolution DW C(·) followed by a BN (·) and
a ReLU (.) activation. Finally, a 1 × 1 convolution C1×1 (.) (11)
\mathcal {L}_{total} = \alpha \mathcal {L}_{p_1} + \beta \mathcal {L}_{p_2} + \gamma \mathcal {L}_{p_3} + \zeta \mathcal {L}_{p_4} + \delta \mathcal {L}_{p_1 + p_2 + p_3 + p_4}
is used to reduce the #channels to match with the next stage. where Lp1 , Lp2 , Lp3 , and Lp4 are the losses of each indi-
The EU CB(·) (Figure 2(c)) is formulated as in Equation 9: vidual prediction maps. α = β = γ = ζ = δ = 1.0 are the
weights assigned to each loss.
Output segmentation maps aggregation: We consider
EUCB(x) = C_{1\times 1}(ReLU(BN(DWC(Up(x))))) \vspace {-0.1cm} \label {eq:up_conv} (9) the prediction map, p4 , from the last stage of our decoder as
Due to using depth-wise convolution instead of 3 × 3 con- the final segmentation map. Then, we obtain the final seg-
volution, our EUCB is very efficient. mentation output by employing a Sigmoid function for bi-
nary or a Sof tmax function for multi-class segmentation.
3.1.4 Segmentation head (SH)
4. Experiments
We use segmentation heads to produce the segmentation
outputs from the refined feature maps of four stages of In this section, we present the details of our implementation
the decoder. The SH layer applies a 1 × 1 convolution followed by a comparative analysis of our PVT-EMCAD-
Conv1×1 (·) to the refined feature maps having chi chan- B0 and PVT-EMCAD-B2 against SOTA methods. Datasets
nels (chi is the #channels in the feature map of stage i) and and evaluation metrics are in Supplementary Section 7.
5
Polyp Skin Lesion Cell
Methods #Params #FLOPs BUSI Avg.
Clinic Colon ETIS Kvasir BKAI ISIC17 ISIC18 DSB18 EM
UNet [44] 34.53M 65.53G 92.11 83.95 76.85 82.87 85.05 83.07 86.67 92.23 95.46 74.04 85.23
UNet++ [62] 9.16M 34.65G 92.17 87.88 77.40 83.36 84.07 82.98 87.46 91.97 95.48 74.76 85.75
AttnUNet [41] 34.88M 66.64G 92.20 86.46 76.84 83.49 84.07 83.66 87.05 92.22 95.55 74.48 85.60
DeepLabv3+ [10] 39.76M 14.92G 93.24 91.92 90.73 89.06 89.74 83.84 88.64 92.14 94.96 76.81 89.11
PraNet [20] 32.55M 6.93G 91.71 89.16 83.84 84.82 85.56 83.03 88.56 89.89 92.37 75.14 86.41
CaraNet [38] 46.64M 11.48G 94.08 91.19 90.25 89.74 89.71 85.02 90.18 89.15 92.78 77.34 88.94
UACANet-L [30] 69.16M 31.51G 94.16 91.02 89.77 90.17 90.35 83.72 89.76 88.86 89.28 76.96 88.41
SSFormer-L [54] 66.22M 17.28G 94.18 92.11 90.16 91.47 91.14 85.28 90.25 92.03 94.95 78.76 90.03
PolypPVT [17] 25.11M 5.30G 94.13 91.53 89.93 91.56 91.17 85.56 90.36 90.69 94.40 79.35 89.87
TransUNet [8] 105.32M 38.52G 93.90 91.63 87.79 91.08 89.17 85.00 89.16 92.04 95.27 78.30 89.33
SwinUNet [5] 27.17M 6.2G 92.42 89.27 85.10 89.59 87.61 83.97 89.26 91.03 94.47 77.38 88.01
TransFuse [61] 143.74M 82.71G 93.62 90.35 86.91 90.24 87.47 84.89 89.62 90.85 94.35 79.36 88.77
UNeXt [50] 1.47M 0.57G 90.20 83.84 74.03 77.88 77.93 82.74 87.78 86.01 93.81 74.71 82.89
PVT-CASCADE [42] 34.12M 7.62G 94.53 91.60 91.03 92.05 92.14 85.50 90.41 92.35 95.42 79.21 90.42
PVT-EMCAD-B0 (Ours) 3.92M 0.84G 94.60 91.71 91.65 91.95 91.30 85.67 90.70 92.46 95.35 79.80 90.52
PVT-EMCAD-B2 (Ours) 26.76M 5.6G 95.21 92.31 92.29 92.75 92.96 85.95 90.96 92.74 95.53 80.25 91.10
Table 1. Results of binary medical image segmentation (i.e., polyp, skin lesion, cell, and breast cancer). We reproduce the results of SOTA
methods using their publicly available implementation with our train-val-test splits of 80:10:10. #FLOPs of all the methods are reported
for 256 × 256 inputs, except Swin-UNet (224 × 224). All results are averaged over five runs. Best results are shown in bold.
Average
Architectures Aorta GB KL KR Liver PC SP SM
DICE↑ HD95↓ mIoU↑
UNet [44] 70.11 44.69 59.39 84.00 56.70 72.41 62.64 86.98 48.73 81.48 67.96
AttnUNet [41] 71.70 34.47 61.38 82.61 61.94 76.07 70.42 87.54 46.70 80.67 67.66
R50+UNet [8] 74.68 36.87 − 84.18 62.84 79.19 71.29 93.35 48.23 84.41 73.92
R50+AttnUNet [8] 75.57 36.97 − 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
SSFormer [54] 78.01 25.72 67.23 82.78 63.74 80.72 78.11 93.53 61.53 87.07 76.61
PolypPVT [17] 78.08 25.61 67.43 82.34 66.14 81.21 73.78 94.37 59.34 88.05 79.4
TransUNet [8] 77.61 26.9 67.32 86.56 60.43 80.54 78.53 94.33 58.47 87.06 75.00
SwinUNet [5] 77.58 27.32 66.88 81.76 65.95 82.32 79.22 93.73 53.81 88.04 75.79
MT-UNet [53] 78.59 26.59 − 87.92 64.99 81.47 77.29 93.06 59.46 87.75 76.81
MISSFormer [25] 81.96 18.20 − 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
PVT-CASCADE [42] 81.06 20.23 70.88 83.01 70.59 82.23 80.37 94.08 64.43 90.1 83.69
TransCASCADE [42] 82.68 17.34 73.48 86.63 68.48 87.66 84.56 94.43 65.33 90.79 83.52
PVT-EMCAD-B0 (Ours) 81.97 17.39 72.64 87.21 66.62 87.48 83.96 94.57 62.00 92.66 81.22
PVT-EMCAD-B2 (Ours) 83.63 15.68 74.65 88.14 68.87 88.08 84.10 95.26 68.51 92.17 83.92
Table 2. Results of abdomen organ segmentation on Synapse Multi-organ dataset. DICE scores are reported for individual organs. Results
of UNet, AttnUNet, PolypPVT, SSFormerPVT, TransUNet, and SwinUNet are taken from [42]. ↑ (↓) denotes the higher (lower) the better.
‘−’ means missing data from the source. EMCAD results are averaged over five runs. Best results are shown in bold.
4.1. Implementation details to 352×352 and use a multi-scale {0.75, 1.0, 1.25} training
strategy with a gradient clip limit of 0.5 for ClinicDB [3],
We implement our network and conduct experiments us- Kvasir [29], ColonDB [51], ETIS [51], BKAI [40], ISIC17
ing Pytorch 1.11.0 on a single NVIDIA RTX A6000 GPU [15], and ISIC18 [15], while we resize images to 256 × 256
with 48GB of memory. We utilize ImageNet [16] pre- for BUSI [1], EM [6], and DSB18 [4]. For Synapse and
trained PVTv2-b0 and PVTv2-b2 [56] as encoders. In ACDC datasets, images are resized to 224 × 224, with
the MSDC of our decoder, we set the multi-scale kernels random rotation and flipping augmentations, optimizing a
[1, 3, 5] through an ablation study. We use the parallel ar- combined Cross-entropy (0.3) and DICE (0.7) loss. For bi-
rangement of depth-wise convolutions in all experiments. nary segmentation, we utilize the combined weighted Bina-
Our models are trained using the AdamW optimizer [36] ryCrossEntropy (BCE) and weighted IoU loss function.
with a learning rate and weight decay of 1e − 4. We gener-
ally train for 200 epochs with a batch size of 16, except 4.2. Results
for Synapse multi-organ (300 epochs, batch size 6) and
ACDC cardiac organ (400 epochs, batch size 12), saving We compare our architectures (i.e., PVT-EMCAD-B0 and
the best model based on the DICE score. We resize images PVT-EMCAD-B2) with SOTA CNN and transformer-based
6
Components #FLOPs(G) #Params Avg
Cascaded LGAG MSCAM 224 256 (M) DICE
No No No 0 0 0 80.10±0.2
Yes No No 0.100 0.131 0.224 81.08±0.2
Yes Yes No 0.108 0.141 0.235 81.92±0.2
Yes No Yes 0.373 0.487 1.898 82.86±0.3
Yes Yes Yes 0.381 0.498 1.91 83.63±0.3
7
Conv. kernels [1] [3] [5] [1, 3] [3, 3] [1, 3, 5] [3, 3, 3] [3, 5, 7] [1, 3, 5, 7] [1, 3, 5, 7, 9]
Synapse 82.43 82.79 82.74 82.98 82.81 83.63 82.92 83.11 83.57 83.34
ClinicDB 94.81 94.90 94.98 95.13 95.06 95.21 95.15 95.03 95.18 95.07
Table 5. Effect of multi-scale kernels in the depth-wise convolution of MSDC on ClinicDB and Synapse multi-organ datasets. We use the
PVTv2-b2 encoder for these experiments. All results are averaged over five runs. Best results are highlighted in bold.
Encoders Decoders #FLOPs(G) #Params(M) DICE (%) that performance improves from 1×1 to 3×3 kernel. When
PVTv2-B0 CASCADE 0.439 2.32 80.54 1 × 1 kernel is used together with 3 × 3 it improves more
PVTv2-B0 EMCAD (Ours) 0.110 0.507 81.97 than when using them alone. However, when two 3 × 3
PVTv2-B2 CASCADE 1.93 9.27 82.78 kernels are used together, performance drops. The incorpo-
PVTv2-B2 EMCAD (Ours) 0.381 1.91 83.63
ration of a 5 × 5 kernel with 1 × 1 and 3 × 3 kernels further
Table 6. Comparison with the baseline decoder on Synapse Multi- improves the performance and it achieves the best results in
organ dataset. We only report the #FLOPs (with input resolution both Synapse multi-organ and ClinicDB datasets. If we add
of 224 × 224) and the #parameters of the decoders. All the results additional larger kernels (e.g., 7×7, 9×9), the performance
are averaged over five runs. Best results are shown in bold. of both datasets drops. Based on these empirical observa-
ual organ segmentation, significantly outperforming SOTA tions, we choose [1, 3, 5] kernels in all our experiments.
methods on six of eight organs.
5.3. Comparison with the baseline decoder
4.2.3 Results of cardiac organ segmentation
In Table 6, we report the experimental results with the com-
Table 3 shows the DICE scores of our PVT-EMCAD-B2 putational complexity of our EMCAD decoder and a base-
and PVT-EMCAD-B0 along with other SOTA methods, on line decoder, namely CASCADE. From Table 6, we can see
the MRI images of the ACDC dataset for cardiac organ seg- that our EMCAD decoder with PVTv2-b2 requires 80.3%
mentation. Our PVT-EMCAD-B2 achieves the highest av- fewer FLOPs and 79.4% fewer parameters to outperform
erage DICE score of 92.12%, thus improving about 0.27% (by 0.85%) the respective CASCADE decoder. Similarly,
over Cascaded MERIT though our network has significantly our EMCAD decoder with PVTv2-B0 achieves 1.43% bet-
lower computational cost. Besides, PVT-EMCAD-B2 has ter DICE score than the CASCADE decoder with 78.1%
better DICE scores in all three organ segmentations. fewer parameters and 74.9% fewer FLOPs.
5. Ablation Studies 6. Conclusions
In this section, we conduct ablation studies to explore differ- In this paper, we have presented EMCAD, a new and effi-
ent aspects of our architectures and the experimental frame- cient multi-scale convolutional attention decoder designed
work. More ablations are in Supplementary Section 8. for multi-stage feature aggregation and refinement in med-
5.1. Effect of different components of EMCAD ical image segmentation. EMCAD employs a multi-scale
depth-wise convolution block, which is key for capturing di-
We conduct a set of experiments on the Synapse multi-organ
verse scale information within feature maps, a critical factor
dataset to understand the effect of different components of
for precision in medical image segmentation. This design
our EMCAD decoder. We start with only the encoder and
choice, using depth-wise convolutions instead of standard
add different modules such as Cascaded structure, LGAG,
3 × 3 convolution blocks, makes EMCAD notably efficient.
and MSCAM to understand their effect. Table 4 exhibits
Our experiments reveal that EMCAD surpasses the re-
that the cascaded structure of the decoder helps to improve
cent CASCADE decoder in DICE scores with 79.4% fewer
performance over the non-cascaded one. The incorpora-
parameters and 80.3% less FLOPs. Our extensive experi-
tion of LGAG and MSCAM improves performance, how-
ments also confirm EMCAD’s superior performance com-
ever, MSCAM proves to be more effective. When both the
pared to SOTA methods across 12 public datasets covering
LGAG and MSCAM modules are used together, it produces
six different 2D medical image segmentation tasks. EM-
the best DICE score of 83.63%. It is also evident that there
CAD’s compatibility with smaller encoders makes it an ex-
is about 3.53% improvement in the DICE score with an ad-
cellent fit for point-of-care applications while maintaining
ditional 0.381G FLOPs and 1.91M parameters.
high performance. We anticipate that our EMCAD decoder
5.2. Effect of multi-scale kernels in MSCAM will be a valuable asset in enhancing a variety of medical
We have conducted another set of experiments on Synapse image segmentation and semantic segmentation tasks.
multi-organ and ClinicDB datasets to understand the effect Acknowledgements: This work is supported in part
of different multi-scale kernels used for depth-wise convo- by the NSF grant CNS 2007284, and in part by the
iMAGiNE Consortium (https://imagine.utexas.edu/).
lutions in MSDC. Table 5 reports these results which show
8
References sitional encodings for vision transformers. arXiv preprint
arXiv:2102.10882, 2021. 1
[1] Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, [14] Noel Codella, Veronica Rotemberg, Philipp Tschandl,
and Aly Fahmy. Dataset of breast ultrasound images. Data M Emre Celebi, Stephen Dusza, David Gutman, Brian
in brief, 28:104863, 2020. 6, 1 Helba, Aadi Kalloo, Konstantinos Liopyris, Michael
[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Marchetti, et al. Skin lesion analysis toward melanoma
Segnet: A deep convolutional encoder-decoder architecture detection 2018: A challenge hosted by the interna-
for image segmentation. IEEE Trans. Pattern Anal. Mach. tional skin imaging collaboration (isic). arXiv preprint
Intell., 39(12):2481–2495, 2017. 3 arXiv:1902.03368, 2019. 1
[3] Jorge Bernal, F Javier Sánchez, Gloria Fernández- [15] Noel CF Codella, David Gutman, M Emre Celebi, Brian
Esparrach, Debora Gil, Cristina Rodrı́guez, and Fernando Helba, Michael A Marchetti, Stephen W Dusza, Aadi
Vilariño. Wm-dova maps for accurate polyp highlighting in Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kit-
colonoscopy: Validation vs. saliency maps from physicians. tler, et al. Skin lesion analysis toward melanoma detection:
Comput. Med. Imaging Graph., 43:99–111, 2015. 6, 1 A challenge at the 2017 international symposium on biomed-
[4] Juan C Caicedo, Allen Goodman, Kyle W Karhohs, Beth A ical imaging (isbi), hosted by the international skin imaging
Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng collaboration (isic). In IEEE Int. Symp. Biomed. Imaging,
Heng, Tim Becker, Minh Doan, Claire McQuin, et al. Nu- pages 168–172. IEEE, 2018. 6, 1
cleus segmentation across imaging experiments: the 2018 [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
data science bowl. Nature methods, 16(12):1247–1253, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
2019. 6, 7, 1 database. In IEEE Conf. Comput. Vis. Pattern Recog., pages
[5] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- 248–255. Ieee, 2009. 6
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: [17] Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li,
Unet-like pure transformer for medical image segmentation. Huazhu Fu, and Ling Shao. Polyp-pvt: Polyp segmen-
arXiv preprint arXiv:2105.05537, 2021. 1, 3, 6, 7 tation with pyramid vision transformers. arXiv preprint
[6] Albert Cardona, Stephan Saalfeld, Stephan Preibisch, Ben- arXiv:2108.06932, 2021. 1, 3, 5, 6
jamin Schmid, Anchi Cheng, Jim Pulokas, Pavel Tomancak, [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
and Volker Hartenstein. An integrated micro-and macroar- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
chitectural analysis of the drosophila brain by computer- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
assisted serial section electron microscopy. PLoS biology, vain Gelly, et al. An image is worth 16x16 words: Trans-
8(10):e1000502, 2010. 6, 7, 1 formers for image recognition at scale. arXiv preprint
[7] Gongping Chen, Lei Li, Yu Dai, Jianxun Zhang, and arXiv:2010.11929, 2020. 1, 2
Moi Hoon Yap. Aau-net: an adaptive attention u-net for [19] Isensee et al. nnu-net: a self-configuring method for deep
breast lesions segmentation in ultrasound images. IEEE learning-based biomedical image segmentation. Nature
Trans. Med. Imaging, 2022. 2 methods, 18(2):203–211, 2021. 1, 2, 3
[8] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan [20] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel re-
Transunet: Transformers make strong encoders for medi- verse attention network for polyp segmentation. In Int. Conf.
cal image segmentation. arXiv preprint arXiv:2102.04306, Med. Image Comput. Comput. Assist. Interv., pages 263–273.
2021. 1, 2, 3, 6, 7 Springer, 2020. 1, 3, 6
[9] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and Deep residual learning for image recognition. In IEEE Conf.
channel-wise attention in convolutional networks for image Comput. Vis. Pattern Recog., pages 770–778, 2016. 2
captioning. In IEEE Conf. Comput. Vis. Pattern Recog., [22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
pages 5659–5667, 2017. 3, 4 Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
[10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image tional neural networks for mobile vision applications. arXiv
segmentation with deep convolutional nets, atrous convolu- preprint arXiv:1704.04861, 2017. 2
tion, and fully connected crfs. IEEE Trans. Pattern Anal. [23] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
Mach. Intell., 40(4):834–848, 2017. 2, 6 works. In IEEE Conf. Comput. Vis. Pattern Recog., pages
[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 7132–7141, 2018. 2, 3
Schroff, and Hartwig Adam. Encoder-decoder with atrous [24] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu,
separable convolution for semantic image segmentation. In Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei
Eur. Conf. Comput. Vis., pages 801–818, 2018. 2 Chen, and Jian Wu. Unet 3+: A full-scale connected unet
[12] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- for medical image segmentation. In ICASSP, pages 1055–
verse attention for salient object detection. In Eur. Conf. 1059. IEEE, 2020. 1, 2
Comput. Vis., pages 234–250, 2018. 1 [25] Xiaohong Huang, Zhifang Deng, Dandan Li, and Xueguang
[13] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xi- Yuan. Missformer: An effective medical image segmentation
aolin Wei, Huaxia Xia, and Chunhua Shen. Conditional po- transformer. arXiv preprint arXiv:2109.07162, 2021. 6, 7
9
[26] Nabil Ibtehaz and Daisuke Kihara. Acc-unet: A com- [41] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee,
pletely convolutional unet model for the 2020s. In Int. Conf. Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven
Med. Image Comput. Comput. Assist. Interv., pages 692–702. McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Atten-
Springer, 2023. 2 tion u-net: Learning where to look for the pancreas. arXiv
[27] Sergey Ioffe and Christian Szegedy. Batch normalization: preprint arXiv:1804.03999, 2018. 1, 2, 3, 6
Accelerating deep network training by reducing internal co- [42] Md Mostafijur Rahman and Radu Marculescu. Medical
variate shift. In Int. Conf. Mach. Learn., pages 448–456. image segmentation via cascaded attention decoding. In
pmlr, 2015. 3 IEEE/CVF Winter Conf. Appl. Comput. Vis., pages 6222–
[28] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much 6231, 2023. 1, 3, 4, 5, 6, 7
position information do convolutional neural networks en- [43] Md Mostafijur Rahman and Radu Marculescu. Multi-scale
code? arXiv preprint arXiv:2001.08248, 2020. 1 hierarchical vision transformer with cascaded attention de-
[29] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål coding for medical image segmentation. In Med. Imaging
Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Deep Learn., 2023. 1, 3, 5, 7
Johansen. Kvasir-seg: A segmented polyp dataset. In Int. [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Conf. Multimedia Model., pages 451–462. Springer, 2020. Convolutional networks for biomedical image segmentation.
6, 1 In Int. Conf. Med. Image Comput. Comput. Assist. Interv.,
[30] Taehun Kim, Hyemin Lee, and Daijin Kim. Uacanet: Uncer- pages 234–241. Springer, 2015. 1, 2, 6
tainty augmented context attention for polyp segmentation. [45] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
In ACM Int. Conf. Multimedia, pages 2167–2175, 2021. 6 moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
[31] Alex Krizhevsky and Geoff Hinton. Convolutional deep be- residuals and linear bottlenecks. In IEEE Conf. Comput. Vis.
lief networks on cifar-10. Unpublished manuscript, 40(7): Pattern Recog., pages 4510–4520, 2018. 2, 4
1–9, 2010. 4 [46] Karen Simonyan and Andrew Zisserman. Very deep convo-
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. lutional networks for large-scale image recognition. arXiv
Imagenet classification with deep convolutional neural net- preprint arXiv:1409.1556, 2014. 2
works. Adv. Neural Inform. Process. Syst., 25, 2012. 2 [47] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
[33] Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Zheng, and Li Yu. Convformer: Plug-and-play cnn-style Vanhoucke, and Andrew Rabinovich. Going deeper with
transformers for improving medical image segmentation. In convolutions. In IEEE Conf. Comput. Vis. Pattern Recog.,
Int. Conf. Med. Image Comput. Comput. Assist. Interv., pages pages 1–9, 2015. 2
642–651. Springer, 2023. 1 [48] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
[34] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng scaling for convolutional neural networks. In Int. Conf.
Zhang, Stephen Lin, and Baining Guo. Swin transformer: Mach. Learn., pages 6105–6114. PMLR, 2019. 2
Hierarchical vision transformer using shifted windows. In [49] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang,
Int. Conf. Comput. Vis., pages 10012–10022, 2021. 1, 2, 3 Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit:
[35] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- Multi-axis vision transformer. In Eur. Conf. Comput. Vis.,
enhofer, Trevor Darrell, and Saining Xie. A convnet for the pages 459–479. Springer, 2022. 1, 2
2020s. In IEEE Conf. Comput. Vis. Pattern Recog., pages [50] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-
11976–11986, 2022. 2 based rapid medical image segmentation network. In Int.
[36] Ilya Loshchilov and Frank Hutter. Decoupled weight decay Conf. Med. Image Comput. Comput. Assist. Interv., pages
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 23–33. Springer, 2022. 6, 1
[37] Ange Lou, Shuyue Guan, and Murray Loew. Dc-unet: re- [51] David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria
thinking the u-net architecture with dual channel efficient Fernández-Esparrach, Antonio M López, Adriana Romero,
cnn for medical image segmentation. In Med. Imaging 2021: Michal Drozdzal, and Aaron Courville. A benchmark for
Image Process., pages 758–768. SPIE, 2021. 1, 2 endoluminal scene segmentation of colonoscopy images. J.
[38] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray H Loew. Healthc. Eng., 2017, 2017. 6, 1
Caranet: context axial reverse attention network for segmen- [52] Haonan Wang, Peng Cao, Jiaqi Wang, and Osmar R Zaiane.
tation of small medical objects. In Med. Imaging 2022: Im- Uctransnet: rethinking the skip connections in u-net from a
age Process., pages 81–92. SPIE, 2022. 6 channel-wise perspective with transformer. In AAAI, pages
[39] Vinod Nair and Geoffrey E Hinton. Rectified linear units 2441–2449, 2022. 1, 3
improve restricted boltzmann machines. In Int. Conf. Mach. [53] Hongyi Wang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto,
Learn., pages 807–814, 2010. 3 Xian-Hua Han, Yen-Wei Chen, and Ruofeng Tong. Mixed
[40] Phan Ngoc Lan, Nguyen Sy An, Dao Viet Hang, Dao Van transformer u-net for medical image segmentation. In
Long, Tran Quang Trung, Nguyen Thi Thuy, and Dinh Viet ICASSP, pages 2390–2394. IEEE, 2022. 6, 7
Sang. Neounet: Towards accurate colon polyp segmentation [54] Jinfeng Wang, Qiming Huang, Feilong Tang, Jia Meng, Jion-
and neoplasm detection. In Adv. Vis. Comput. – Int. Symp., glong Su, and Sifan Song. Stepwise feature fusion: Local
pages 15–28. Springer, 2021. 6, 1 guides global. arXiv preprint arXiv:2203.03635, 2022. 1, 6
10
[55] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense pre-
diction without convolutions. In Int. Conf. Comput. Vis.,
pages 568–578, 2021. 1, 2
[56] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt
v2: Improved baselines with pyramid vision transformer.
Comput. Vis. Media, 8(3):415–424, 2022. 1, 2, 3, 5, 6
[57] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So
Kweon. Cbam: Convolutional block attention module. In
Eur. Conf. Comput. Vis., pages 3–19, 2018. 1, 3, 4
[58] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
Adv. Neural Inform. Process. Syst., 34:12077–12090, 2021.
2, 3
[59] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
is actually what you need for vision. In IEEE Conf. Comput.
Vis. Pattern Recog., pages 10819–10829, 2022. 1
[60] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. In IEEE Conf. Comput. Vis. Pattern
Recog., pages 6848–6856, 2018. 4
[61] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fus-
ing transformers and cnns for medical image segmentation.
In Int. Conf. Med. Image Comput. Comput. Assist. Interv.,
pages 14–24. Springer, 2021. 1, 3, 6
[62] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
Tajbakhsh, and Jianming Liang. Unet++: A nested u-net ar-
chitecture for medical image segmentation. In Deep Learn.
Med. Image Anal. Multimodal Learn. Clin. Decis. Support,
pages 3–11. Springer, 2018. 1, 2, 6
11
EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical
Image Segmentation
Supplementary Material
7. Experimental Details 7.2. Evaluation metrics
This section extends our Section 4 in the original paper by We use the DICE score to evaluate performance on all
describing the datasets and evaluation metrics, followed by the datasets. However, we also use 95% Hausdorff Dis-
additional experimental results. tance (HD95) and mIoU as additional evaluation metrics
for Synapse multi-organ segmentation. The DICE score
7.1. Datasets DSC(Y, P ), IoU (Y, P ), and HD95 distance DH (Y, P ) are
calculated using Equations 12, 13, and 14, respectively:
To evaluate the performance of our EMCAD decoder, we
carry out experiments across 12 datasets that belong to six
\label {eq:dice} DSC(Y, P) = \frac {2 \times \lvert Y \cap P \rvert }{\lvert Y \rvert + \lvert P \rvert }\times 100 (12)
medical image segmentation tasks, as described next.
Polyp segmentation: We use five polyp segmentation
datasets: Kvasir [29] (1,000 images), ClinicDB [3] (612 im- \label {eq:iou} IoU(Y, P) = \frac {\lvert Y \cap P \rvert }{\lvert Y \cup P \rvert }\times 100 (13)
ages), ColonDB [51] (379 images), ETIS [51] (196 images),
and BKAI [40] (1,000 images). These datasets contain im-
ages from different imaging centers/clinics, having greater \label {eq:95hd} \small D_H(Y, P) = \max \{\max _{y \in Y} \min _{p \in P}d(y, p), \{\max _{p \in P} \min _{y \in Y}d(y, p)\} (14)
diversity in image nature as well as size and shape of polyps. where Y and P are the ground truth and predicted segmen-
Abdomen organ segmentation: We use the Synapse tation map, respectively.
multi-organ dataset1 for abdomen organ segmentation. This
dataset contains 30 abdominal CT scans which have 3,779
axial contrast-enhanced slices. Each CT scan has 85-198 7.3. Qualitative results
slices of 512 × 512 pixels. Following TransUNet [8], we This subsection describes the qualitative results of differ-
use the same 18 scans for training (2,212 axial slices) and ent methods including our EMCAD. From, the qualitative
12 scans for validation. We segment only eight abdominal results on Synapse Multi-organ dataset in Figure 4, we can
organs, namely aorta, gallbladder (GB), left kidney (KL), see that most of the methods face challenges segmenting the
right kidney (KR), liver, pancreas (PC), spleen (SP), and left kidney (orange) and part of the pancreas (pink). How-
stomach (SM). ever, our PVT-EMCAD-B0 (Figure 4g) and PVT-EMCAD-
Cardiac organ segmentation: We use ACDC dataset2 B2 (Figure 4h) can segment those organs more accurately
for cardiac organ segmentation. It contains 100 cardiac (see red rectangular box) with significantly lower computa-
MRI scans having three sub-organs, namely right ventricle tional costs. Similarly, qualitative results of polyp segmen-
(RV), myocardium (Myo), and left ventricle (LV). Follow- tation on a representative image from ClinicDB dataset in
ing TransUNet [8], we use 70 cases (1,930 axial slices) for Figure 5 show that predicted segmentation outputs of our
training, 10 for validation, and 20 for testing. PVT-EMCAD-B0 (Figure 5p) and PVT-EMCAD-B2 (Fig-
Skin lesion segmentation: We use ISIC17 [15] (2,000 ure 5q) have strong overlaps with the GroundTruth mask
training, 150 validation, and 600 testing images) and (Figure 5r), while existing SOTA methods exhibit false seg-
ISIC18 [14] (2,594 images) for skin lesion segmentation. mentation of polyp (see red rectangular box).
Breast cancer segmentation: We use BUSI [1] dataset for
breast cancer segmentation. Following [50], we use 647
(437 benign and 210 malignant) images from this dataset. 8. Additional Ablation Study
Cell nuclei/structure segmentation: We use the DSB18
This section further elaborates on Section 5 by detailing five
[4] (670 images) and EM [6] (30 images) datasets of bio-
additional ablation studies related to our architectural de-
logical imaging for cell nuclei/structure segmentation.
sign and experimental setup.
We use a train-val-test split of 80:10:10 in ClinicDB,
Kvasir, ColonDB, ETIS, BKAI, ISIC18, DSB18, EM, and 8.1. Parallel vs. sequential depth-wise convolution
BUSI datasets. For ISIC17, we use the official train-val-test
We have conducted another set of experiments to decide
sets provided by the competition organizer.
whether we use multi-scale depth-wise convolutions in par-
1 https://www.synapse.org/#!Synapse:syn3193805/wiki/217789 allel or sequential. Table 7 presents the results of these ex-
2 https://www.creatis.insa-lyon.fr/Challenge/acdc/ periments which show that there is no significant impact of
1
Figure 4. Qualitative results of multi-organ segmentation on Synapse Multi-organ dataset. The red rectangular box highlights incorrectly
segmented organs by SOTA methods.
Figure 5. Qualitative results of polyp segmentation. The red rectangular box highlights incorrectly segmented polyps by SOTA methods.
Architectures Depth-wise convolutions Synapse ClinicDB Architectures Module Params(K) FLOPs(M) Synapse
PVT-EMCAD-B0 Sequential 81.82±0.3 94.57±0.2 PVT-EMCAD-B0 AG 31.62 15.91 81.74
PVT-EMCAD-B0 Parallel 81.97±0.2 94.60±0.2 PVT-EMCAD-B0 LGAG 5.51 5.24 81.97
PVT-EMCAD-B2 Sequential 83.54±0.3 95.15±0.3 PVT-EMCAD-B2 AG 124.68 61.68 83.51
PVT-EMCAD-B2 Parallel 83.63±0.2 95.21±0.2
PVT-EMCAD-B2 LGAG 11.01 10.47 83.63
Table 7. Results of parallel and sequential depth-wise convolution Table 8. LGAG vs. AG (Attention gate) [41] on Synapse multi-
in MSDC on Synapse multi-organ and ClinicDB datasets. All re- organ dataset. The total #Params and #FLOPs of three AG/LGAGs
sults are averaged over five runs. Best results are in bold. in our decoder are reported for an input resolution of 256 × 256.
All results are averaged over five runs. Best results are in bold.
2
Average
Architectures Pretrain Aorta GB KL KR Liver PC SP SM
DICE↑ HD95↓ mIoU↑
PVT-EMCAD-B0 No 77.47 19.93 66.72 81.96 69.41 83.88 74.82 93.45 54.41 88.97 72.85
PVT-EMCAD-B0 Yes 81.97 17.39 72.64 87.21 66.62 87.48 83.96 94.57 62.00 92.66 81.22
PVT-EMCAD-B2 No 80.18 18.83 70.21 85.98 68.10 84.62 79.93 93.96 61.61 90.99 76.23
PVT-EMCAD-B2 Yes 83.63 15.68 74.65 88.14 68.87 88.08 84.10 95.26 68.51 92.17 83.92
Table 9. Effect of transfer learning from ImageNet pre-trained weights on Synapse multi-organ dataset. ↑ (↓) denotes the higher (lower)
the better. All results are averaged over five runs. Best results are in bold.
DS EM BUSI Clinic Kvasir ISIC18 Synapse ACDC the smaller PVT-EMCAD-B0 model than the larger PVT-
No 95.74 79.64 94.96 92.51 90.74 82.03 92.08
Yes 95.53 80.25 95.21 92.75 90.96 83.63 92.12
EMCAD-B2 model. For individual organs, transfer learn-
ing significantly boosts the performance of all organ seg-
Table 10. Effect of deep supervision (DS). PVT-EMCAD-B2 with mentation, except the Gallbladder (GB).
DS achieves slightly better DICE scores in 6 out of 7 datasets.
8.4. Effect of deep supervision
Architectures Resolutions FLOPs(G) DICE We have conducted an ablation study that drops the Deep
Supervision (DS). Results of our PVT-EMCAD-B2 on
PVT-EMCAD-B0 224 × 224 0.64 81.97 seven datasets are given in Table 10. Our PVT-EMCAD-
PVT-EMCAD-B0 256 × 256 0.84 82.63 B2 with DS achieves slightly better DICE scores in six out
PVT-EMCAD-B0 384 × 384 1.89 84.81 of seven datasets. Among all the datasets, the DS has the
PVT-EMCAD-B0 512 × 512 3.36 85.52 largest impact on the Synapse Multi-organ dataset.
PVT-EMCAD-B2 224 × 224 4.29 83.63
8.5. Effect of input resolutions
PVT-EMCAD-B2 256 × 256 5.60 84.47
PVT-EMCAD-B2 384 × 384 12.59 85.78 Table 11 presents the results of our PVT-EMCAD-B0 and
PVT-EMCAD-B2 512 × 512 22.39 86.53 PVT-EMCAD-B2 architectures with different input resolu-
tions. From this table, it is evident that the DICE scores im-
Table 11. Effect of input resolutions on Synapse multi-organ prove with the increase in input resolution. However, these
dataset. All results are averaged over five runs.
improvements in DICE score come with the increment in
#FLOPs. Our PVT-EMCAD-B0 achieves an 85.52% DICE
ductions in #Params (82.57% for PVT-EMCAD-B0 and score with only 3.36G FLOPs when using 512 × 512 in-
91.17% for PVT-EMCAD-B2) and #FLOPs (67.06% for puts. On the other hand, our PVT-EMCAD-B2 achieves
PVT-EMCAD-B0 and 83.03% for PVT-EMCAD-B2) than the best DICE score (86.53%) with 22.39G FLOPs when
AG. The reduction in #Params and #FLOPs is bigger for using 512 × 512 inputs. We also observe that our PVT-
the larger models. Therefore, our LGAG demonstrates im- EMCAD-B2 with 5.60G FLOPs when using 256 × 256 in-
proved scalability with models that have a greater number puts shows a 1.05% lower DICE score than PVT-EMCAD-
of channels, yielding enhanced DICE scores. B0 with 3.36G FLOPs. Therefore, we can conclude that
PVT-EMCAD-B0 is more suitable for larger input resolu-
8.3. Effect of transfer learning from ImageNet pre- tions than PVT-EMCAD-B2.
trained weights
We conduct experiments on the Synapse multi-organ dataset
to show the effect of transfer learning from the ImageNet
pre-trained encoder. Table 9 reports the results of these ex-
periments which show that transfer learning from ImageNet
pre-trained PVT-v2 encoders significantly boosts the per-
formance. Specifically, for PVT-EMCAD-B0, the DICE,
mIoU, and HD95 scores are improved by 4.5%, 5.92%,
and 2.54, respectively. Likewise, for PVT-EMCAD-B2, the
DICE, mIoU, and HD95 scores are improved by 3.45%,
4.44%, and 3.15, respectively. We can also conclude that
transfer learning has a comparatively greater impact on