0% found this document useful (0 votes)

16 views

EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

The document introduces EMCAD, an efficient multi-scale convolutional attention decoder designed for medical image segmentation, which significantly reduces computational costs while enhancing performance. EMCAD utilizes a unique multi-scale depth-wise convolution block and various attention mechanisms to improve feature map quality and capture intricate spatial relationships. Evaluations across multiple datasets demonstrate that EMCAD achieves state-of-the-art results with a substantial reduction in parameters and FLOPs, making it a promising tool for medical image analysis.

Uploaded by

harrymagic098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Uploaded by

harrymagic098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical

Image Segmentation

Md Mostafijur Rahman, Mustafa Munir, and Radu Marculescu

The University of Texas at Austin
Austin, Texas, USA
arXiv:2405.06880v1 [eess.IV] 11 May 2024

mostafijur.rahman, mmunir, [email protected]

Abstract tation output. Attention mechanisms [12, 17, 20, 41, 57]
have also been integrated into these models to enhance fea-
An efficient and effective decoding mechanism is crucial ture maps and improve pixel-level classification. Although
in medical image segmentation, especially in scenarios with attention-based models have shown improved performance,
limited computational resources. However, these decoding they still face significant challenges due to the computation-
mechanisms usually come with high computational costs. ally expensive convolutional blocks that are typically used
To address this concern, we introduce EMCAD, a new effi- in conjunction with attention mechanisms.
cient multi-scale convolutional attention decoder, designed Recently, vision transformers [18] have shown promise
to optimize both performance and computational efficiency. in medical image segmentation tasks [5, 8, 17, 42, 43, 52,
EMCAD leverages a unique multi-scale depth-wise convo- 54, 61] by capturing long-range dependencies among pix-
lution block, significantly enhancing feature maps through els through Self-attention (SA) mechanisms. Hierarchical
multi-scale convolutions. EMCAD also employs channel, vision transformers like Swin [34], PVT [55, 56], MaxViT
spatial, and grouped (large-kernel) gated attention mech- [49], MERIT [43], ConvFormer [33], and MetaFormer [59]
anisms, which are highly effective at capturing intricate have been introduced to further improve the performance in
spatial relationships while focusing on salient regions. By this field. While the SA excels at capturing global informa-
employing group and depth-wise convolution, EMCAD is tion, it is less adept at understanding the local spatial context
very efficient and scales well (e.g., only 1.91M parame- [13, 28]. To address this limitation, some approaches have
ters and 0.381G FLOPs are needed when using a stan- integrated local convolutional attention within the decoders
dard encoder). Our rigorous evaluations across 12 datasets to better grasp spatial details. Nevertheless, these meth-
that belong to six medical image segmentation tasks re- ods can still be computationally demanding because they
veal that EMCAD achieves state-of-the-art (SOTA) perfor- frequently employ costly convolutional blocks. This limits
mance with 79.4% and 80.3% reduction in #Params and their applicability to real-world scenarios where computa-
#FLOPs, respectively. Moreover, EMCAD’s adaptability tional resources are restricted.
to different encoders and versatility across segmentation To address the aforementioned limitations, we introduce
tasks further establish EMCAD as a promising tool, ad- EMCAD, an efficient multi-scale convolutional attention
vancing the field towards more efficient and accurate med- decoding using a new multi-scale depth-wise convolution
ical image analysis. Our implementation is available at block. More precisely, EMCAD enhances the feature maps
https://github.com/SLDGroup/EMCAD. via efficient multi-scale convolutions, while incorporating
complex spatial relationships and local attention through the
use of channel, spatial, and grouped (large-kernel) gated at-
1. Introduction
tention mechanisms. Our contributions are as follows:
In the realm of medical diagnostics and therapeutic strate- • New Efficient Multi-scale Convolutional Decoder:
gies, automated segmentation of medical images is vital, as We introduce an efficient multi-scale cascaded fully-
it classifies pixels to identify critical regions such as lesions, convolutional attention decoder (EMCAD) for 2D med-
tumors, or entire organs. A variety of U-shaped convolu- ical image segmentation; this takes the multi-stage fea-
tional neural network (CNN) architectures [20, 24, 37, 41, tures of vision encoders and progressively enhances the
44, 62], notably UNet [44], UNet++ [62], UNet3+ [24], and multi-scale and multi-resolution spatial representations.
nnU-Net [19], have become standard techniques for this EMCAD has only 0.506M parameters and 0.11G FLOPs
purpose, achieving high-quality, high-resolution segmen- for a tiny encoder with #channels = [32, 64, 160, 256],

1
ficiency in handling spatial relationships in images. More
precisely, AlexNet [32] and VGG [46] pave the way, lever-
aging deep layers of convolutions to extract features pro-
gressively. GoogleNet [47] introduces the inception mod-
ule, allowing more efficient computation of representations
across various scales. ResNet [21] introduces residual con-
nections, enabling the training of networks with substan-
tially more layers by addressing the vanishing gradients
problem. MobileNets [22, 45] bring CNNs to mobile de-
vices through lightweight, depth-wise separable convolu-
tions. EfficientNet [48] introduces a scalable architectural
design to CNNs with compound scaling. Although CNNs
Figure 1. Average DICE scores vs. #FLOPs for different methods are pivotal for many vision applications, they generally lack
over 10 binary medical image segmentation datasets. As shown, the ability to capture long-range dependencies within im-
our approaches (PVT-EMCAD-B0 and PVT-EMCAD-B2) have ages due to their inherent local receptive fields.
the lowest #FLOPs, yet the highest DICE scores. Recently, Vision Transformers (ViTs), pioneered by
Dosovitskiy et al. [18], enabled the learning of long-range
while it has 1.91M parameters and 0.381G FLOPs for a relationships among pixels using Self-attention (SA). Since
standard encoder with #channels = [64, 128, 320, 512]. then, ViTs have been enhanced by integrating CNN fea-
• Efficient Multi-scale Convolutional Attention Mod- tures [49, 56], developing novel self-attention (SA) blocks
ule: We introduce MSCAM, a new efficient multi-scale [34, 49], and introducing new architectural designs [55, 58].
convolutional attention module that performs depth-wise The Swin Transformer [34] incorporates a sliding window
convolutions at multiple scales; this refines the feature attention mechanism, while SegFormer [58] leverages Mix-
maps produced by vision encoders and enables captur- FFN blocks for hierarchical structures. PVT [55] uses spa-
ing multi-scale salient features by suppressing irrele- tial reduction attention, refined in PVTv2 [56] with over-
vant regions. The use of depth-wise convolutions makes lapping patch embedding and a linear complexity attention
MSCAM very efficient. layer. MaxViT [49] introduces a multi-axis self-attention
• Large-kernel Grouped Attention Gate: We introduce to form a hierarchical CNN-transformer encoder. Although
a new grouped attention gate to fuse refined features with ViTs address the CNNs limitation in capturing long-range
the features from skip connections. By using larger kernel pixel dependencies [21–23, 32, 35, 45–48], they face chal-
(3 × 3) group convolutions instead of point-wise convolu- lenges in capturing the local spatial relationships among
tions in the design, we capture salient features in a larger pixels. In this paper, we aim to overcome these limita-
local context with less computation. tions by introducing a new multi-scale cascaded attention
• Improved Performance: We empirically show that EM- decoder that refines feature maps and incorporates local at-
CAD can be used with any hierarchical vision encoder tention using a multi-scale convolutional attention module.
(e.g., PVTv2-B0, PVTv2-B2 [56]), while significantly
improving the performance of 2D medical image seg- 2.2. Medical image segmentation
mentation. EMCAD produces better results than SOTA
methods with a significantly lower computational cost (as Medical image segmentation involves pixel-wise classifica-
shown in Figure 1) on 12 medical image segmentation tion to identify various anatomical structures like lesions,
benchmarks that belong to six different tasks. tumors, or organs within different imaging modalities such
as endoscopy, MRI, or CT scans [8]. U-shaped networks
The remaining of this paper is organized as follows:
[7, 19, 24, 26, 37, 41, 44, 62] are particularly favored due
Section 2 summarizes related work. Section 3 describes
to their simple but effective encoder-decoder design. The
the proposed method. Section 4 explains our experimental
UNet [44] pioneered this approach with its use of skip
setup and results on 12 medical image segmentation bench-
connections to fuse features at different resolution stages.
marks. Section 5 covers different ablation experiments.
UNet++ [62] evolves this design by incorporating nested
Lastly, Section 6 concludes the paper.
encoder-decoder pathways with dense skip connections.
Expanding on these ideas, UNet 3+ [24] introduces compre-
2. Related Work hensive skip pathways that facilitate full-scale feature inte-
2.1. Vision encoders gration. Further advancement comes with DC-UNet [37],
which integrates a multi-resolution convolution scheme and
Convolutional Neural Networks (CNNs) [21–23, 32, 35, residual paths into its skip connections. The DeepLab se-
45–48] have been foundational as encoders due to their pro- ries, including DeepLabv3 [10] and DeepLabv3+ [11], in-

2
troduce atrous convolutions and spatial pyramid pooling to scale convolutional attention modules (MSCAMs) to ro-
handle multi-scale information. SegNet [2] uses pooling in- bustly enhance the feature maps, large-kernel grouped at-
dices to upsample feature maps, preserving the boundary tention gates (LGAGs) to refine feature maps fusing with
details. nnU-Net [19] automatically configures hyperpa- the skip connection via gated attention mechanism, efficient
rameters based on the specific dataset characteristics, using up-convolution blocks (EUCBs) for up-sampling followed
standard 2D and 3D UNets. Collectively, these U-shaped by enhancement of feature maps, and segmentation heads
models have become a benchmark for success in the domain (SHs) to produce the segmentation outputs.
of medical image segmentation. More specifically, we use four MSCAMs to refine pyra-
Recently, vision transformers have emerged as a mid features (i.e., X1, X2, X3, X4 in Figure 2) extracted
formidable force in medical image segmentation, harness- from the four stages of the encoder. After each MSCAM,
ing the ability to capture pixel relationships at global scales we use an SH to produce a segmentation map of that stage.
[5, 8, 17, 42, 43, 52, 58, 61]. TransUNet [8] presents a novel Subsequently, we upscale the refined feature maps using
blend of CNNs for local feature extraction and transform- EUCBs and add them to the outputs from the corresponding
ers for global context, enhancing both local and global fea- LGAGs. Finally, we add four different segmentation maps
ture capture. Swin-Unet [5] extends this by incorporating to produce the final segmentation output. Different modules
Swin Transformer blocks [34] into a U-shaped model for of our decoder are described next.
both encoding and decoding processes. Building on these
concepts, MERIT [43] introduces a multi-scale hierarchi-
cal transformer, which employs SA across different window 3.1.1 Large-kernel grouped attention gate (LGAG)
sizes, thus enhancing the model capacity to capture multi-
scale features critical for medical image segmentation. We introduce a new large-kernel grouped attention gate
The integration of attention mechanisms has been in- (LGAG) to progressively combine feature maps with atten-
vestigated within CNNs [20, 41] and transformer-based tion coefficients, which are learned by the network to allow
systems [17] for enhancing medical image segmentation. higher activation of relevant features and suppression of ir-
PraNet [20] employs a reverse attention strategy for fea- relevant ones. This process employs a gating signal derived
ture refinement. PolypPVT [17] leverages PVTv2 [56] as from higher-level features to control the flow of informa-
its backbone encoder and incorporates CBAM [57] within tion across different stages of the network, thus enhancing
its decoding stages. The CASCADE [42] presents a novel its precision for medical image segmentation. Unlike At-
cascaded decoder, combining channel [23] and spatial [9] tention UNet [41] which uses 1 × 1 convolution to process
attention to refine features at multiple stages, extracted from gating signal g (features from skip connections) and input
a transformer encoder, culminating in high-resolution seg- feature map x (upsampled features), in our qatt (.) func-
mentation outputs. While CASCADE achieves notable per- tion, we process g and x by applying separate 3 × 3 group
formance in segmenting medical images by integrating lo- convolutions GCg (.) and GCx (.), respectively. These con-
cal and global insights from transformers, it is computation- volved features are then normalized using batch normaliza-
ally inefficient due to the use of triple 3 × 3 convolution lay- tion (BN (.)) [27] and merged through element-wise addi-
ers at each decoder stage. In addition to this, it uses single- tion. The resultant feature map is activated through a ReLU
scale convolutions during decoding. Our new proposal in- (R(.)) layer [39]. Afterward, we apply a 1 × 1 convolu-
volves the adoption of multi-scale depth-wise convolutions tion (C(.)) followed by BN (.) layer to get a single channel
to mitigate these constraints. feature map. We then pass the resultant single-channel fea-
ture map through a Sigmoid (σ(.)) activation function to
3. Methodology yield the attention coefficients. The output of this transfor-
mation is used to scale the input feature x through element-
In this section, we first introduce our new EMCAD de- wise multiplication, producing the attention-gated feature
coder and then explain two transformer-based architectures LGAG(g, x). The LGAG(·) (Figure 2(g)) can be formu-
(i.e., PVT-EMCAD-B0 and PVT-EMCAD-B2) incorporat- lated as in Equations 1 and 2:
ing our proposed decoder.
3.1. Efficient multi-scale convolutional attention de- q_{att}(g, x) = R(BN(GC_g(g) + BN(GC_x(x))))) \vspace {-.1cm} \label {eq:ags_1} (1)
coding (EMCAD)
In this section, we introduce our efficient multi-scale con-
LGAG(g, x) = x \circledast \sigma (BN(C(q_{att}(g, x)))) \vspace {-.1cm} \label {eq:ags_2} (2)
volutional decoding (EMCAD) to process the multi-stage
features extracted from pretrained hierarchical vision en- Due to using 3 × 3 kernel group convolutions in qatt (.), our
coders for high-resolution semantic segmentation. As LGAG captures comparatively larger spatial contexts with
shown in Figure 2(b), EMCAD consits of efficient multi- less computational cost.

3
Figure 2. Hierarchical encoder with newly proposed EMCAD decoder architecture. (a) CNN or transformer encoder with four hierarchical
stages, (b) EMCAD decoder, (c) Efficient up-convolution block (EUCB), (d) Multi-scale convolutional attention module (MSCAM), (e)
Multi-scale convolution block (MSCB), (f) Multi-scale (parallel) depth-wise convolution (MSDC), (g) Large-kernel grouped attention gate
(LGAG), (h) Channel attention block (CAB), and (i) Spatial attention block (SAB). X1, X2, X3, and X4 are the features from the four
stages of the hierarchical encoder. p1, p2, p3, and p4 are output segmentation maps from four stages of our decoder.

3.1.2 Multi-scale convolutional attention module the relationships among channels, we use a channel shuffle
(MSCAM) operation to incorporate relationships among channels. Af-
terward, we use another point-wise convolution P W C2 (.)
We introduce an efficient multi-scale convolutional atten-
followed by a BN (.) to transform back the original #chan-
tion module to refine the feature maps. MSCAM consists
nels, which also encodes dependency among channels. The
of a channel attention block (CAB(·)) to put emphasis on
M SCB(·) (Figure 2(e)) is formulated as in Equation 4:
pertinent channels, a spatial attention block [9] (SAB(·))
to capture the local contextual information, and an effi- MSCB(x) = BN(PWC_2(CS(MSDC(R6(BN(PWC_1(x))))))) (4)
cient multi-scale convolution block (M SCB(.)) to enhance
the feature maps preserving contextual relationships. The where parallel M SDC(.) (Figure 2(f)) for different kernel
M SCAM (.) (Figure 2(d)) is given in Equation 3: sizes (KS) can be formulated using Equation 5:

MSCAM(x) = MSCB(SAB(CAB(x))) \vspace {-.1cm} \label {eq:mscam} (3) MSDC(x) = \sum _{ks \in KS} DWCB_{ks}(x) (5)

where x is the input tensor. Due to using depth-wise con- where DW CBks (x) = R6(BN (DW Cks (x))). Here,
volution in multiple scales, our MSCAM is more effective DW Cks (.) is a depth-wise convolution with the kernel
with significantly lower computational cost than the convo- size ks. BN (.) and R6(.) are batch normalization
lutional attention module (CAM) proposed in [42]. and ReLU6 activation, respectively. Additionally, our se-
Multi-scale Convolution Block (MSCB): We introduce quential M SDC(.) uses the recursively updated input x,
an efficient multi-scale convolution block to enhance the where the input x is residually connected to the previous
features generated by our cascaded expanding path. In our DW CBks (.) for better regularization as in Equation 6:
MSCB, we follow the design of the inverted residual block x = x + DWCB_{ks}(x) \vspace {-.1cm} \label {eq:msdc_sequential} (6)
(IRB) of MobileNetV2 [45]. However, unlike IRB, our
MSCB performs depth-wise convolution at multiple scales Channel Attention Block (CAB): We use channel at-
and uses channel shuffle [60] to shuffle channels across tention block to assign different levels of importance to each
groups. More specifically, in our MSCB, we first expand channel, thus emphasizing more relevant features while
the number of channels (i.e., expansion factor = 2) using a suppressing less useful ones. Basically, the CAB identi-
point-wise (1×1) convolution layers P W C1 (·) followed by fies which feature maps to focus on (and then refine them).
a batch normalization layer BN (·) and a ReLU6 [31] activa- Following [57], in CAB, we first apply the adaptive maxi-
tion layer R6(.). We then use a multi-scale depth-wise con- mum pooling (Pm (·)) and adaptive average pooling (Pa (·))
volution M SDC(.) to capture both multi-scale and multi- to the spatial dimensions (i.e., height and width) to extract
resolution contexts. As depth-wise convolution overlooks the most significant feature of the entire feature map per

4
channel. Then, for each pooled feature map, we reduce produces output with #channels equal to #classes in target
the number of channels r = 1/16 times separately using dataset for multi-class but 1 channel for binary segmenta-
a point-wise convolution (C1 (·)) followed by a ReLU ac- tion. The SH(·) is formulated as in Equation 10:
tivation (R). Afterward, we recover the original channels
SH(x) = Conv_{1\times 1}(x) \vspace {-0.1cm} \label {eq:seg_head} (10)
using another point-wise convolution (C2 (·)). We then add
both recovered feature maps and apply Sigmoid (σ) activa- 3.2. Overall architecture
tion to estimate attention weights. Finally, we incorporate
these weights to input x using the Hadamard product (⊛). To show the generalization, effectiveness, and ability to pro-
The CAB(·) (Figure 2(h)) is defined using Equation 7: cess multi-scale features for medical image segmentation,
we integrate our EMCAD decoder alongside tiny (PVTv2-
CAB(x)=\sigma (C_2(R(C_1(P_m(x))))+C_2(R(C_1(P_a(x))))) \circledast x (7) B0) and standard (PVTv2-B2) networks of PVTv2 [56].
However, our decoder is adaptable and seamlessly compat-
Spatial Attention Block (SAB): We use spatial atten-
tion to mimic the attentional processes of the human brain ible with other hierarchical backbone networks.
by focusing on specific parts of an input image. Basically, PVTv2 differs from conventional transformer patch em-
the SAB determines where to focus in a feature map; then it bedding modules by applying convolutional operations for
enhances those features. This process enhances the model’s consistent spatial information capture. Using PVTv2-b0
ability to recognize and respond to relevant spatial features, (Tiny) and PVTv2-b2 (Standard) encoders [56], we develop
which is crucial for image segmentation where the context the PVT-EMCAD-B0 and PVT-EMCAD-B2 architectures.
and location of objects significantly influence the output. To adopt PVTv2, we first extract the features (X1, X2, X3,
In SAB, we first pool maximum (Chmax (·)) and average and X4) from four layers and feed them (i.e., X4 in the up-
(Chavg (·)) values along the channel dimension to pay atten- sample path and X3, X2, X1 in the skip connections) into
tion to local features. Then, we use a large kernel (i.e., 7 × 7 our EMCAD decoder as shown in Figure 2(a-b). EMCAD
as in [17]) convolution layer to enhance local contextual re-
then processes them and produces four segmentation maps
lationships among features. Afterward, we apply the Sig-
moid activation (σ) to calculate attention weights. Finally, that correspond to the four stages of the encoder network.
we feed these weights to the input x (using Hadamard prod- 3.3. Multi-stage loss and outputs aggregation
uct (⊛) to attend information in a more targeted way. The
SAB(.) (Figure 2(i)) is defined using Equation 8: Our EMCAD decoder’s four segmentation heads produce
four prediction maps p1 , p2 , p3 , and p4 across its stages.
SAB(x) = \sigma (LKC([Ch_{max}(x), Ch_{avg}(x)])) \circledast x \vspace {-0.1cm} \label {eq:spab} (8) Loss aggregation: We adopt a combinatorial approach
to loss combination called MUTATION, inspired by the
3.1.3 Efficient up-convolution block (EUCB) work of MERIT [43] for multi-class segmentation. This
involves calculating the loss for all possible combinations
We use an efficient up-convolution block to progressively
of predictions derived from 4 heads, totaling 24 − 1 = 15
upsample the feature maps of the current stage to match the
unique predictions, and then summing these losses. We fo-
dimension and resolution of the feature maps from the next
cus on minimizing this cumulative combinatorial loss dur-
skip connection. The EUCB first uses an UpSampling U p(·)
ing the training process. For binary segmentation, we op-
with scale-factor 2 to upscale the feature maps. Then, it
timize the additive loss like [42] with an additional term
enhances the upscaled feature maps by applying a 3 × 3
Lp1 +p2 +p3 +p4 as in Equation 11:
depth-wise convolution DW C(·) followed by a BN (·) and
a ReLU (.) activation. Finally, a 1 × 1 convolution C1×1 (.) (11)
\mathcal {L}_{total} = \alpha \mathcal {L}_{p_1} + \beta \mathcal {L}_{p_2} + \gamma \mathcal {L}_{p_3} + \zeta \mathcal {L}_{p_4} + \delta \mathcal {L}_{p_1 + p_2 + p_3 + p_4}

is used to reduce the #channels to match with the next stage. where Lp1 , Lp2 , Lp3 , and Lp4 are the losses of each indi-
The EU CB(·) (Figure 2(c)) is formulated as in Equation 9: vidual prediction maps. α = β = γ = ζ = δ = 1.0 are the
weights assigned to each loss.
Output segmentation maps aggregation: We consider
EUCB(x) = C_{1\times 1}(ReLU(BN(DWC(Up(x))))) \vspace {-0.1cm} \label {eq:up_conv} (9) the prediction map, p4 , from the last stage of our decoder as
Due to using depth-wise convolution instead of 3 × 3 con- the final segmentation map. Then, we obtain the final seg-
volution, our EUCB is very efficient. mentation output by employing a Sigmoid function for bi-
nary or a Sof tmax function for multi-class segmentation.
3.1.4 Segmentation head (SH)
4. Experiments
We use segmentation heads to produce the segmentation
outputs from the refined feature maps of four stages of In this section, we present the details of our implementation
the decoder. The SH layer applies a 1 × 1 convolution followed by a comparative analysis of our PVT-EMCAD-
Conv1×1 (·) to the refined feature maps having chi chan- B0 and PVT-EMCAD-B2 against SOTA methods. Datasets
nels (chi is the #channels in the feature map of stage i) and and evaluation metrics are in Supplementary Section 7.

5
Polyp Skin Lesion Cell
Methods #Params #FLOPs BUSI Avg.
Clinic Colon ETIS Kvasir BKAI ISIC17 ISIC18 DSB18 EM
UNet [44] 34.53M 65.53G 92.11 83.95 76.85 82.87 85.05 83.07 86.67 92.23 95.46 74.04 85.23
UNet++ [62] 9.16M 34.65G 92.17 87.88 77.40 83.36 84.07 82.98 87.46 91.97 95.48 74.76 85.75
AttnUNet [41] 34.88M 66.64G 92.20 86.46 76.84 83.49 84.07 83.66 87.05 92.22 95.55 74.48 85.60
DeepLabv3+ [10] 39.76M 14.92G 93.24 91.92 90.73 89.06 89.74 83.84 88.64 92.14 94.96 76.81 89.11
PraNet [20] 32.55M 6.93G 91.71 89.16 83.84 84.82 85.56 83.03 88.56 89.89 92.37 75.14 86.41
CaraNet [38] 46.64M 11.48G 94.08 91.19 90.25 89.74 89.71 85.02 90.18 89.15 92.78 77.34 88.94
UACANet-L [30] 69.16M 31.51G 94.16 91.02 89.77 90.17 90.35 83.72 89.76 88.86 89.28 76.96 88.41
SSFormer-L [54] 66.22M 17.28G 94.18 92.11 90.16 91.47 91.14 85.28 90.25 92.03 94.95 78.76 90.03
PolypPVT [17] 25.11M 5.30G 94.13 91.53 89.93 91.56 91.17 85.56 90.36 90.69 94.40 79.35 89.87
TransUNet [8] 105.32M 38.52G 93.90 91.63 87.79 91.08 89.17 85.00 89.16 92.04 95.27 78.30 89.33
SwinUNet [5] 27.17M 6.2G 92.42 89.27 85.10 89.59 87.61 83.97 89.26 91.03 94.47 77.38 88.01
TransFuse [61] 143.74M 82.71G 93.62 90.35 86.91 90.24 87.47 84.89 89.62 90.85 94.35 79.36 88.77
UNeXt [50] 1.47M 0.57G 90.20 83.84 74.03 77.88 77.93 82.74 87.78 86.01 93.81 74.71 82.89
PVT-CASCADE [42] 34.12M 7.62G 94.53 91.60 91.03 92.05 92.14 85.50 90.41 92.35 95.42 79.21 90.42
PVT-EMCAD-B0 (Ours) 3.92M 0.84G 94.60 91.71 91.65 91.95 91.30 85.67 90.70 92.46 95.35 79.80 90.52
PVT-EMCAD-B2 (Ours) 26.76M 5.6G 95.21 92.31 92.29 92.75 92.96 85.95 90.96 92.74 95.53 80.25 91.10
Table 1. Results of binary medical image segmentation (i.e., polyp, skin lesion, cell, and breast cancer). We reproduce the results of SOTA
methods using their publicly available implementation with our train-val-test splits of 80:10:10. #FLOPs of all the methods are reported
for 256 × 256 inputs, except Swin-UNet (224 × 224). All results are averaged over five runs. Best results are shown in bold.

Average
Architectures Aorta GB KL KR Liver PC SP SM
DICE↑ HD95↓ mIoU↑
UNet [44] 70.11 44.69 59.39 84.00 56.70 72.41 62.64 86.98 48.73 81.48 67.96
AttnUNet [41] 71.70 34.47 61.38 82.61 61.94 76.07 70.42 87.54 46.70 80.67 67.66
R50+UNet [8] 74.68 36.87 − 84.18 62.84 79.19 71.29 93.35 48.23 84.41 73.92
R50+AttnUNet [8] 75.57 36.97 − 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
SSFormer [54] 78.01 25.72 67.23 82.78 63.74 80.72 78.11 93.53 61.53 87.07 76.61
PolypPVT [17] 78.08 25.61 67.43 82.34 66.14 81.21 73.78 94.37 59.34 88.05 79.4
TransUNet [8] 77.61 26.9 67.32 86.56 60.43 80.54 78.53 94.33 58.47 87.06 75.00
SwinUNet [5] 77.58 27.32 66.88 81.76 65.95 82.32 79.22 93.73 53.81 88.04 75.79
MT-UNet [53] 78.59 26.59 − 87.92 64.99 81.47 77.29 93.06 59.46 87.75 76.81
MISSFormer [25] 81.96 18.20 − 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
PVT-CASCADE [42] 81.06 20.23 70.88 83.01 70.59 82.23 80.37 94.08 64.43 90.1 83.69
TransCASCADE [42] 82.68 17.34 73.48 86.63 68.48 87.66 84.56 94.43 65.33 90.79 83.52
PVT-EMCAD-B0 (Ours) 81.97 17.39 72.64 87.21 66.62 87.48 83.96 94.57 62.00 92.66 81.22
PVT-EMCAD-B2 (Ours) 83.63 15.68 74.65 88.14 68.87 88.08 84.10 95.26 68.51 92.17 83.92
Table 2. Results of abdomen organ segmentation on Synapse Multi-organ dataset. DICE scores are reported for individual organs. Results
of UNet, AttnUNet, PolypPVT, SSFormerPVT, TransUNet, and SwinUNet are taken from [42]. ↑ (↓) denotes the higher (lower) the better.
‘−’ means missing data from the source. EMCAD results are averaged over five runs. Best results are shown in bold.

4.1. Implementation details to 352×352 and use a multi-scale {0.75, 1.0, 1.25} training
strategy with a gradient clip limit of 0.5 for ClinicDB [3],
We implement our network and conduct experiments us- Kvasir [29], ColonDB [51], ETIS [51], BKAI [40], ISIC17
ing Pytorch 1.11.0 on a single NVIDIA RTX A6000 GPU [15], and ISIC18 [15], while we resize images to 256 × 256
with 48GB of memory. We utilize ImageNet [16] pre- for BUSI [1], EM [6], and DSB18 [4]. For Synapse and
trained PVTv2-b0 and PVTv2-b2 [56] as encoders. In ACDC datasets, images are resized to 224 × 224, with
the MSDC of our decoder, we set the multi-scale kernels random rotation and flipping augmentations, optimizing a
[1, 3, 5] through an ablation study. We use the parallel ar- combined Cross-entropy (0.3) and DICE (0.7) loss. For bi-
rangement of depth-wise convolutions in all experiments. nary segmentation, we utilize the combined weighted Bina-
Our models are trained using the AdamW optimizer [36] ryCrossEntropy (BCE) and weighted IoU loss function.
with a learning rate and weight decay of 1e − 4. We gener-
ally train for 200 epochs with a batch size of 16, except 4.2. Results
for Synapse multi-organ (300 epochs, batch size 6) and
ACDC cardiac organ (400 epochs, batch size 12), saving We compare our architectures (i.e., PVT-EMCAD-B0 and
the best model based on the DICE score. We resize images PVT-EMCAD-B2) with SOTA CNN and transformer-based

6
Components #FLOPs(G) #Params Avg
Cascaded LGAG MSCAM 224 256 (M) DICE
No No No 0 0 0 80.10±0.2
Yes No No 0.100 0.131 0.224 81.08±0.2
Yes Yes No 0.108 0.141 0.235 81.92±0.2
Yes No Yes 0.373 0.487 1.898 82.86±0.3
Yes Yes Yes 0.381 0.498 1.91 83.63±0.3

Table 4. Effect of different components of EMCAD with PVTv2-

b2 encoder on Synapse multi-organ dataset. #FLOPs are reported
for input resolution of 224 × 224 and 256 × 256. All results are
averaged over five runs. Best results are shown in bold.

hibits the worst performance in all five polyp segmenta-

tion datasets. Our smaller model with only 3.92M param-
Figure 3. Average DICE scores vs. #Params for different methods
eters and 0.84G FLOPs also outperforms all the methods
over 10 binary medical image segmentation datasets. As shown,
our proposed approaches (PVT-EMCAD-B0 and PVT-EMCAD- except PVT-CASCADE (in Kvasir and BKAI-IGH) and
B2) have the fewest parameters, yet the highest DICE scores. SSFormer-L (in ColonDB), which achieve the best perfor-
mance among SOTA methods. In conclusion, our PVT-
Methods Avg. DICE RV Myo LV EMCAD-B2 achieves the new SOTA results in these five
R50+UNet [8] 87.55 87.10 80.63 94.92 polyp segmentation datasets.
R50+AttnUNet [8] 86.75 87.58 79.20 93.47 Skin lesion segmentation: Table 1 shows PVT-
ViT+CUP [8] 81.45 81.46 70.71 92.18 EMCAD-B2’s strong performance on ISIC17 and ISIC18
R50+ViT+CUP [8] 87.57 86.07 81.88 94.75 skin lesion segmentation datasets, achieving DICE scores
TransUNet [8] 89.71 86.67 87.27 95.18 of 85.95% and 90.96%, surpassing DeepLabV3+ by 2.11%
SwinUNet [5] 88.07 85.77 84.42 94.03
MT-UNet [53] 90.43 86.64 89.04 95.62
and 2.32%. It also beats the nearest method PVT-
MISSFormer [25] 90.86 89.55 88.04 94.99 CASCADE by 0.45% and 0.55% in ISIC17 and ISIC18,
PVT-CASCADE [42] 91.46 89.97 88.9 95.50 respectively, though our decoder is significantly more ef-
TransCASCADE [42] 91.63 90.25 89.14 95.50 ficient than CASCADE. Our PVT-EMCAD-B0 also shows
Cascaded MERIT [43] 91.85 90.23 89.53 95.80 huge potential in point care applications like skin lesion seg-
PVT-EMCAD-B0 (Ours) 91.34±0.2 89.37 88.99 95.65 mentation with only 3.92M parameters and 0.84G FLOPs.
PVT-EMCAD-B2 (Ours) 92.12±0.2 90.65 89.68 96.02 Cell segmentation: To evaluate our method’s effective-
ness in biological imaging, we use DSB18 [4] for cell nu-
Table 3. Results of cardiac organ segmentation on ACDC dataset.
DICE scores (%) are reported for individual organs. We get the
clei and EM [6] for cell structure segmentation. As Ta-
results of SwinUNet from [42]. Best results are shown in bold. ble 1 indicates, our PVT-EMCAD-B2 sets a SOTA bench-
mark in cell nuclei segmentation on DSB18, outperforming
segmentation methods on 12 datasets that belong to six DeepLabv3+, TransFuse, and PVT-CASCADE. On the EM
medical image segmentation tasks. Qualitative results are dataset, PVT-EMCAD-B2 secures the second-best DICE
in the Supplementary Section 7.3. score (95.53%), offering significantly lower computational
costs than the top-performing AttnUNet (95.55%).
4.2.1 Results of binary medical image segmentation Breast cancer segmentation: We conduct experiments
on the BUSI dataset for breast cancer segmentation in ultra-
Results for different methods on 10 binary medical im-
sound images. Our PVT-EMCAD-B2 achieves the SOTA
age segmentation datasets are shown in Table 1 and Fig-
DICE score (80.25%) on this dataset. Furthermore, our
ure 1. Our PVT-EMCAD-B2 attains the highest average
PVT-EMCAD-B0 outperforms the computationally similar
DICE score (91.10%) with only 26.76M parameters and
method UNeXt by a notable margin of 5.54%.
5.6G FLOPs. The multi-scale depth-wise convolution in our
EMCAD decoder, combined with the transformer encoder, 4.2.2 Results of abdomen organ segmentation
contributes to these performance gains.
Polyp segmentation: Table 1 reveals that our PVT- Table 2 shows that our PVT-EMCAD-B2 excels in abdomen
EMCAD-B2 surpasses all SOTA methods in five polyp organ segmentation on the Synapse multi-organ dataset,
segmentation datasets. PVT-EMCAD-B2 achieves DICE achieving the highest average DICE score of 83.63% and
score improvements of 1.08%, 0.78%, 2.36%, 1.19%, surpassing all SOTA CNN- and transformer-based meth-
and 1.79% over PolypPVT in ClinicDB, ColonDB, ETIS, ods. It outperforms PVT-CASCADE by 2.57% in DICE
Kvasir, and BKAI-IGI, despite having slightly more pa- score and 4.55 in HD95 distance, indicating superior organ
rameters and FLOPs. The smallest model UNeXt, ex- boundary location. Our EMCAD decoder boosts individ-

7
Conv. kernels [1] [3] [5] [1, 3] [3, 3] [1, 3, 5] [3, 3, 3] [3, 5, 7] [1, 3, 5, 7] [1, 3, 5, 7, 9]
Synapse 82.43 82.79 82.74 82.98 82.81 83.63 82.92 83.11 83.57 83.34
ClinicDB 94.81 94.90 94.98 95.13 95.06 95.21 95.15 95.03 95.18 95.07
Table 5. Effect of multi-scale kernels in the depth-wise convolution of MSDC on ClinicDB and Synapse multi-organ datasets. We use the
PVTv2-b2 encoder for these experiments. All results are averaged over five runs. Best results are highlighted in bold.

Encoders Decoders #FLOPs(G) #Params(M) DICE (%) that performance improves from 1×1 to 3×3 kernel. When
PVTv2-B0 CASCADE 0.439 2.32 80.54 1 × 1 kernel is used together with 3 × 3 it improves more
PVTv2-B0 EMCAD (Ours) 0.110 0.507 81.97 than when using them alone. However, when two 3 × 3
PVTv2-B2 CASCADE 1.93 9.27 82.78 kernels are used together, performance drops. The incorpo-
PVTv2-B2 EMCAD (Ours) 0.381 1.91 83.63
ration of a 5 × 5 kernel with 1 × 1 and 3 × 3 kernels further
Table 6. Comparison with the baseline decoder on Synapse Multi- improves the performance and it achieves the best results in
organ dataset. We only report the #FLOPs (with input resolution both Synapse multi-organ and ClinicDB datasets. If we add
of 224 × 224) and the #parameters of the decoders. All the results additional larger kernels (e.g., 7×7, 9×9), the performance
are averaged over five runs. Best results are shown in bold. of both datasets drops. Based on these empirical observa-
ual organ segmentation, significantly outperforming SOTA tions, we choose [1, 3, 5] kernels in all our experiments.
methods on six of eight organs.
5.3. Comparison with the baseline decoder
4.2.3 Results of cardiac organ segmentation
In Table 6, we report the experimental results with the com-
Table 3 shows the DICE scores of our PVT-EMCAD-B2 putational complexity of our EMCAD decoder and a base-
and PVT-EMCAD-B0 along with other SOTA methods, on line decoder, namely CASCADE. From Table 6, we can see
the MRI images of the ACDC dataset for cardiac organ seg- that our EMCAD decoder with PVTv2-b2 requires 80.3%
mentation. Our PVT-EMCAD-B2 achieves the highest av- fewer FLOPs and 79.4% fewer parameters to outperform
erage DICE score of 92.12%, thus improving about 0.27% (by 0.85%) the respective CASCADE decoder. Similarly,
over Cascaded MERIT though our network has significantly our EMCAD decoder with PVTv2-B0 achieves 1.43% bet-
lower computational cost. Besides, PVT-EMCAD-B2 has ter DICE score than the CASCADE decoder with 78.1%
better DICE scores in all three organ segmentations. fewer parameters and 74.9% fewer FLOPs.
5. Ablation Studies 6. Conclusions
In this section, we conduct ablation studies to explore differ- In this paper, we have presented EMCAD, a new and effi-
ent aspects of our architectures and the experimental frame- cient multi-scale convolutional attention decoder designed
work. More ablations are in Supplementary Section 8. for multi-stage feature aggregation and refinement in med-
5.1. Effect of different components of EMCAD ical image segmentation. EMCAD employs a multi-scale
depth-wise convolution block, which is key for capturing di-
We conduct a set of experiments on the Synapse multi-organ
verse scale information within feature maps, a critical factor
dataset to understand the effect of different components of
for precision in medical image segmentation. This design
our EMCAD decoder. We start with only the encoder and
choice, using depth-wise convolutions instead of standard
add different modules such as Cascaded structure, LGAG,
3 × 3 convolution blocks, makes EMCAD notably efficient.
and MSCAM to understand their effect. Table 4 exhibits
Our experiments reveal that EMCAD surpasses the re-
that the cascaded structure of the decoder helps to improve
cent CASCADE decoder in DICE scores with 79.4% fewer
performance over the non-cascaded one. The incorpora-
parameters and 80.3% less FLOPs. Our extensive experi-
tion of LGAG and MSCAM improves performance, how-
ments also confirm EMCAD’s superior performance com-
ever, MSCAM proves to be more effective. When both the
pared to SOTA methods across 12 public datasets covering
LGAG and MSCAM modules are used together, it produces
six different 2D medical image segmentation tasks. EM-
the best DICE score of 83.63%. It is also evident that there
CAD’s compatibility with smaller encoders makes it an ex-
is about 3.53% improvement in the DICE score with an ad-
cellent fit for point-of-care applications while maintaining
ditional 0.381G FLOPs and 1.91M parameters.
high performance. We anticipate that our EMCAD decoder
5.2. Effect of multi-scale kernels in MSCAM will be a valuable asset in enhancing a variety of medical
We have conducted another set of experiments on Synapse image segmentation and semantic segmentation tasks.
multi-organ and ClinicDB datasets to understand the effect Acknowledgements: This work is supported in part
of different multi-scale kernels used for depth-wise convo- by the NSF grant CNS 2007284, and in part by the
iMAGiNE Consortium (https://imagine.utexas.edu/).
lutions in MSDC. Table 5 reports these results which show

8
References sitional encodings for vision transformers. arXiv preprint
arXiv:2102.10882, 2021. 1
[1] Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, [14] Noel Codella, Veronica Rotemberg, Philipp Tschandl,
and Aly Fahmy. Dataset of breast ultrasound images. Data M Emre Celebi, Stephen Dusza, David Gutman, Brian
in brief, 28:104863, 2020. 6, 1 Helba, Aadi Kalloo, Konstantinos Liopyris, Michael
[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Marchetti, et al. Skin lesion analysis toward melanoma
Segnet: A deep convolutional encoder-decoder architecture detection 2018: A challenge hosted by the interna-
for image segmentation. IEEE Trans. Pattern Anal. Mach. tional skin imaging collaboration (isic). arXiv preprint
Intell., 39(12):2481–2495, 2017. 3 arXiv:1902.03368, 2019. 1
[3] Jorge Bernal, F Javier Sánchez, Gloria Fernández- [15] Noel CF Codella, David Gutman, M Emre Celebi, Brian
Esparrach, Debora Gil, Cristina Rodrı́guez, and Fernando Helba, Michael A Marchetti, Stephen W Dusza, Aadi
Vilariño. Wm-dova maps for accurate polyp highlighting in Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kit-
colonoscopy: Validation vs. saliency maps from physicians. tler, et al. Skin lesion analysis toward melanoma detection:
Comput. Med. Imaging Graph., 43:99–111, 2015. 6, 1 A challenge at the 2017 international symposium on biomed-
[4] Juan C Caicedo, Allen Goodman, Kyle W Karhohs, Beth A ical imaging (isbi), hosted by the international skin imaging
Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng collaboration (isic). In IEEE Int. Symp. Biomed. Imaging,
Heng, Tim Becker, Minh Doan, Claire McQuin, et al. Nu- pages 168–172. IEEE, 2018. 6, 1
cleus segmentation across imaging experiments: the 2018 [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
data science bowl. Nature methods, 16(12):1247–1253, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
2019. 6, 7, 1 database. In IEEE Conf. Comput. Vis. Pattern Recog., pages
[5] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- 248–255. Ieee, 2009. 6
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: [17] Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li,
Unet-like pure transformer for medical image segmentation. Huazhu Fu, and Ling Shao. Polyp-pvt: Polyp segmen-
arXiv preprint arXiv:2105.05537, 2021. 1, 3, 6, 7 tation with pyramid vision transformers. arXiv preprint
[6] Albert Cardona, Stephan Saalfeld, Stephan Preibisch, Ben- arXiv:2108.06932, 2021. 1, 3, 5, 6
jamin Schmid, Anchi Cheng, Jim Pulokas, Pavel Tomancak, [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
and Volker Hartenstein. An integrated micro-and macroar- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
chitectural analysis of the drosophila brain by computer- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
assisted serial section electron microscopy. PLoS biology, vain Gelly, et al. An image is worth 16x16 words: Trans-
8(10):e1000502, 2010. 6, 7, 1 formers for image recognition at scale. arXiv preprint
[7] Gongping Chen, Lei Li, Yu Dai, Jianxun Zhang, and arXiv:2010.11929, 2020. 1, 2
Moi Hoon Yap. Aau-net: an adaptive attention u-net for [19] Isensee et al. nnu-net: a self-configuring method for deep
breast lesions segmentation in ultrasound images. IEEE learning-based biomedical image segmentation. Nature
Trans. Med. Imaging, 2022. 2 methods, 18(2):203–211, 2021. 1, 2, 3
[8] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan [20] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel re-
Transunet: Transformers make strong encoders for medi- verse attention network for polyp segmentation. In Int. Conf.
cal image segmentation. arXiv preprint arXiv:2102.04306, Med. Image Comput. Comput. Assist. Interv., pages 263–273.
2021. 1, 2, 3, 6, 7 Springer, 2020. 1, 3, 6
[9] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and Deep residual learning for image recognition. In IEEE Conf.
channel-wise attention in convolutional networks for image Comput. Vis. Pattern Recog., pages 770–778, 2016. 2
captioning. In IEEE Conf. Comput. Vis. Pattern Recog., [22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
pages 5659–5667, 2017. 3, 4 Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
[10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image tional neural networks for mobile vision applications. arXiv
segmentation with deep convolutional nets, atrous convolu- preprint arXiv:1704.04861, 2017. 2
tion, and fully connected crfs. IEEE Trans. Pattern Anal. [23] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
Mach. Intell., 40(4):834–848, 2017. 2, 6 works. In IEEE Conf. Comput. Vis. Pattern Recog., pages
[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 7132–7141, 2018. 2, 3
Schroff, and Hartwig Adam. Encoder-decoder with atrous [24] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu,
separable convolution for semantic image segmentation. In Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei
Eur. Conf. Comput. Vis., pages 801–818, 2018. 2 Chen, and Jian Wu. Unet 3+: A full-scale connected unet
[12] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- for medical image segmentation. In ICASSP, pages 1055–
verse attention for salient object detection. In Eur. Conf. 1059. IEEE, 2020. 1, 2
Comput. Vis., pages 234–250, 2018. 1 [25] Xiaohong Huang, Zhifang Deng, Dandan Li, and Xueguang
[13] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xi- Yuan. Missformer: An effective medical image segmentation
aolin Wei, Huaxia Xia, and Chunhua Shen. Conditional po- transformer. arXiv preprint arXiv:2109.07162, 2021. 6, 7

9
[26] Nabil Ibtehaz and Daisuke Kihara. Acc-unet: A com- [41] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee,
pletely convolutional unet model for the 2020s. In Int. Conf. Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven
Med. Image Comput. Comput. Assist. Interv., pages 692–702. McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Atten-
Springer, 2023. 2 tion u-net: Learning where to look for the pancreas. arXiv
[27] Sergey Ioffe and Christian Szegedy. Batch normalization: preprint arXiv:1804.03999, 2018. 1, 2, 3, 6
Accelerating deep network training by reducing internal co- [42] Md Mostafijur Rahman and Radu Marculescu. Medical
variate shift. In Int. Conf. Mach. Learn., pages 448–456. image segmentation via cascaded attention decoding. In
pmlr, 2015. 3 IEEE/CVF Winter Conf. Appl. Comput. Vis., pages 6222–
[28] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much 6231, 2023. 1, 3, 4, 5, 6, 7
position information do convolutional neural networks en- [43] Md Mostafijur Rahman and Radu Marculescu. Multi-scale
code? arXiv preprint arXiv:2001.08248, 2020. 1 hierarchical vision transformer with cascaded attention de-
[29] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål coding for medical image segmentation. In Med. Imaging
Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Deep Learn., 2023. 1, 3, 5, 7
Johansen. Kvasir-seg: A segmented polyp dataset. In Int. [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Conf. Multimedia Model., pages 451–462. Springer, 2020. Convolutional networks for biomedical image segmentation.
6, 1 In Int. Conf. Med. Image Comput. Comput. Assist. Interv.,
[30] Taehun Kim, Hyemin Lee, and Daijin Kim. Uacanet: Uncer- pages 234–241. Springer, 2015. 1, 2, 6
tainty augmented context attention for polyp segmentation. [45] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
In ACM Int. Conf. Multimedia, pages 2167–2175, 2021. 6 moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
[31] Alex Krizhevsky and Geoff Hinton. Convolutional deep be- residuals and linear bottlenecks. In IEEE Conf. Comput. Vis.
lief networks on cifar-10. Unpublished manuscript, 40(7): Pattern Recog., pages 4510–4520, 2018. 2, 4
1–9, 2010. 4 [46] Karen Simonyan and Andrew Zisserman. Very deep convo-
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. lutional networks for large-scale image recognition. arXiv
Imagenet classification with deep convolutional neural net- preprint arXiv:1409.1556, 2014. 2
works. Adv. Neural Inform. Process. Syst., 25, 2012. 2 [47] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
[33] Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Zheng, and Li Yu. Convformer: Plug-and-play cnn-style Vanhoucke, and Andrew Rabinovich. Going deeper with
transformers for improving medical image segmentation. In convolutions. In IEEE Conf. Comput. Vis. Pattern Recog.,
Int. Conf. Med. Image Comput. Comput. Assist. Interv., pages pages 1–9, 2015. 2
642–651. Springer, 2023. 1 [48] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
[34] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng scaling for convolutional neural networks. In Int. Conf.
Zhang, Stephen Lin, and Baining Guo. Swin transformer: Mach. Learn., pages 6105–6114. PMLR, 2019. 2
Hierarchical vision transformer using shifted windows. In [49] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang,
Int. Conf. Comput. Vis., pages 10012–10022, 2021. 1, 2, 3 Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit:
[35] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- Multi-axis vision transformer. In Eur. Conf. Comput. Vis.,
enhofer, Trevor Darrell, and Saining Xie. A convnet for the pages 459–479. Springer, 2022. 1, 2
2020s. In IEEE Conf. Comput. Vis. Pattern Recog., pages [50] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-
11976–11986, 2022. 2 based rapid medical image segmentation network. In Int.
[36] Ilya Loshchilov and Frank Hutter. Decoupled weight decay Conf. Med. Image Comput. Comput. Assist. Interv., pages
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 23–33. Springer, 2022. 6, 1
[37] Ange Lou, Shuyue Guan, and Murray Loew. Dc-unet: re- [51] David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria
thinking the u-net architecture with dual channel efficient Fernández-Esparrach, Antonio M López, Adriana Romero,
cnn for medical image segmentation. In Med. Imaging 2021: Michal Drozdzal, and Aaron Courville. A benchmark for
Image Process., pages 758–768. SPIE, 2021. 1, 2 endoluminal scene segmentation of colonoscopy images. J.
[38] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray H Loew. Healthc. Eng., 2017, 2017. 6, 1
Caranet: context axial reverse attention network for segmen- [52] Haonan Wang, Peng Cao, Jiaqi Wang, and Osmar R Zaiane.
tation of small medical objects. In Med. Imaging 2022: Im- Uctransnet: rethinking the skip connections in u-net from a
age Process., pages 81–92. SPIE, 2022. 6 channel-wise perspective with transformer. In AAAI, pages
[39] Vinod Nair and Geoffrey E Hinton. Rectified linear units 2441–2449, 2022. 1, 3
improve restricted boltzmann machines. In Int. Conf. Mach. [53] Hongyi Wang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto,
Learn., pages 807–814, 2010. 3 Xian-Hua Han, Yen-Wei Chen, and Ruofeng Tong. Mixed
[40] Phan Ngoc Lan, Nguyen Sy An, Dao Viet Hang, Dao Van transformer u-net for medical image segmentation. In
Long, Tran Quang Trung, Nguyen Thi Thuy, and Dinh Viet ICASSP, pages 2390–2394. IEEE, 2022. 6, 7
Sang. Neounet: Towards accurate colon polyp segmentation [54] Jinfeng Wang, Qiming Huang, Feilong Tang, Jia Meng, Jion-
and neoplasm detection. In Adv. Vis. Comput. – Int. Symp., glong Su, and Sifan Song. Stepwise feature fusion: Local
pages 15–28. Springer, 2021. 6, 1 guides global. arXiv preprint arXiv:2203.03635, 2022. 1, 6

10
[55] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense pre-
diction without convolutions. In Int. Conf. Comput. Vis.,
pages 568–578, 2021. 1, 2
[56] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt
v2: Improved baselines with pyramid vision transformer.
Comput. Vis. Media, 8(3):415–424, 2022. 1, 2, 3, 5, 6
[57] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So
Kweon. Cbam: Convolutional block attention module. In
Eur. Conf. Comput. Vis., pages 3–19, 2018. 1, 3, 4
[58] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
Adv. Neural Inform. Process. Syst., 34:12077–12090, 2021.
2, 3
[59] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
is actually what you need for vision. In IEEE Conf. Comput.
Vis. Pattern Recog., pages 10819–10829, 2022. 1
[60] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. In IEEE Conf. Comput. Vis. Pattern
Recog., pages 6848–6856, 2018. 4
[61] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fus-
ing transformers and cnns for medical image segmentation.
In Int. Conf. Med. Image Comput. Comput. Assist. Interv.,
pages 14–24. Springer, 2021. 1, 3, 6
[62] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
Tajbakhsh, and Jianming Liang. Unet++: A nested u-net ar-
chitecture for medical image segmentation. In Deep Learn.
Med. Image Anal. Multimodal Learn. Clin. Decis. Support,
pages 3–11. Springer, 2018. 1, 2, 6

11
EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical
Image Segmentation
Supplementary Material
7. Experimental Details 7.2. Evaluation metrics
This section extends our Section 4 in the original paper by We use the DICE score to evaluate performance on all
describing the datasets and evaluation metrics, followed by the datasets. However, we also use 95% Hausdorff Dis-
additional experimental results. tance (HD95) and mIoU as additional evaluation metrics
for Synapse multi-organ segmentation. The DICE score
7.1. Datasets DSC(Y, P ), IoU (Y, P ), and HD95 distance DH (Y, P ) are
calculated using Equations 12, 13, and 14, respectively:
To evaluate the performance of our EMCAD decoder, we
carry out experiments across 12 datasets that belong to six
\label {eq:dice} DSC(Y, P) = \frac {2 \times \lvert Y \cap P \rvert }{\lvert Y \rvert + \lvert P \rvert }\times 100 (12)
medical image segmentation tasks, as described next.
Polyp segmentation: We use five polyp segmentation
datasets: Kvasir [29] (1,000 images), ClinicDB [3] (612 im- \label {eq:iou} IoU(Y, P) = \frac {\lvert Y \cap P \rvert }{\lvert Y \cup P \rvert }\times 100 (13)
ages), ColonDB [51] (379 images), ETIS [51] (196 images),
and BKAI [40] (1,000 images). These datasets contain im-
ages from different imaging centers/clinics, having greater \label {eq:95hd} \small D_H(Y, P) = \max \{\max _{y \in Y} \min _{p \in P}d(y, p), \{\max _{p \in P} \min _{y \in Y}d(y, p)\} (14)
diversity in image nature as well as size and shape of polyps. where Y and P are the ground truth and predicted segmen-
Abdomen organ segmentation: We use the Synapse tation map, respectively.
multi-organ dataset1 for abdomen organ segmentation. This
dataset contains 30 abdominal CT scans which have 3,779
axial contrast-enhanced slices. Each CT scan has 85-198 7.3. Qualitative results
slices of 512 × 512 pixels. Following TransUNet [8], we This subsection describes the qualitative results of differ-
use the same 18 scans for training (2,212 axial slices) and ent methods including our EMCAD. From, the qualitative
12 scans for validation. We segment only eight abdominal results on Synapse Multi-organ dataset in Figure 4, we can
organs, namely aorta, gallbladder (GB), left kidney (KL), see that most of the methods face challenges segmenting the
right kidney (KR), liver, pancreas (PC), spleen (SP), and left kidney (orange) and part of the pancreas (pink). How-
stomach (SM). ever, our PVT-EMCAD-B0 (Figure 4g) and PVT-EMCAD-
Cardiac organ segmentation: We use ACDC dataset2 B2 (Figure 4h) can segment those organs more accurately
for cardiac organ segmentation. It contains 100 cardiac (see red rectangular box) with significantly lower computa-
MRI scans having three sub-organs, namely right ventricle tional costs. Similarly, qualitative results of polyp segmen-
(RV), myocardium (Myo), and left ventricle (LV). Follow- tation on a representative image from ClinicDB dataset in
ing TransUNet [8], we use 70 cases (1,930 axial slices) for Figure 5 show that predicted segmentation outputs of our
training, 10 for validation, and 20 for testing. PVT-EMCAD-B0 (Figure 5p) and PVT-EMCAD-B2 (Fig-
Skin lesion segmentation: We use ISIC17 [15] (2,000 ure 5q) have strong overlaps with the GroundTruth mask
training, 150 validation, and 600 testing images) and (Figure 5r), while existing SOTA methods exhibit false seg-
ISIC18 [14] (2,594 images) for skin lesion segmentation. mentation of polyp (see red rectangular box).
Breast cancer segmentation: We use BUSI [1] dataset for
breast cancer segmentation. Following [50], we use 647
(437 benign and 210 malignant) images from this dataset. 8. Additional Ablation Study
Cell nuclei/structure segmentation: We use the DSB18
This section further elaborates on Section 5 by detailing five
[4] (670 images) and EM [6] (30 images) datasets of bio-
additional ablation studies related to our architectural de-
logical imaging for cell nuclei/structure segmentation.
sign and experimental setup.
We use a train-val-test split of 80:10:10 in ClinicDB,
Kvasir, ColonDB, ETIS, BKAI, ISIC18, DSB18, EM, and 8.1. Parallel vs. sequential depth-wise convolution
BUSI datasets. For ISIC17, we use the official train-val-test
We have conducted another set of experiments to decide
sets provided by the competition organizer.
whether we use multi-scale depth-wise convolutions in par-
1 https://www.synapse.org/#!Synapse:syn3193805/wiki/217789 allel or sequential. Table 7 presents the results of these ex-
2 https://www.creatis.insa-lyon.fr/Challenge/acdc/ periments which show that there is no significant impact of

1
Figure 4. Qualitative results of multi-organ segmentation on Synapse Multi-organ dataset. The red rectangular box highlights incorrectly
segmented organs by SOTA methods.

Figure 5. Qualitative results of polyp segmentation. The red rectangular box highlights incorrectly segmented polyps by SOTA methods.

Architectures Depth-wise convolutions Synapse ClinicDB Architectures Module Params(K) FLOPs(M) Synapse
PVT-EMCAD-B0 Sequential 81.82±0.3 94.57±0.2 PVT-EMCAD-B0 AG 31.62 15.91 81.74
PVT-EMCAD-B0 Parallel 81.97±0.2 94.60±0.2 PVT-EMCAD-B0 LGAG 5.51 5.24 81.97
PVT-EMCAD-B2 Sequential 83.54±0.3 95.15±0.3 PVT-EMCAD-B2 AG 124.68 61.68 83.51
PVT-EMCAD-B2 Parallel 83.63±0.2 95.21±0.2
PVT-EMCAD-B2 LGAG 11.01 10.47 83.63
Table 7. Results of parallel and sequential depth-wise convolution Table 8. LGAG vs. AG (Attention gate) [41] on Synapse multi-
in MSDC on Synapse multi-organ and ClinicDB datasets. All re- organ dataset. The total #Params and #FLOPs of three AG/LGAGs
sults are averaged over five runs. Best results are in bold. in our decoder are reported for an input resolution of 256 × 256.
All results are averaged over five runs. Best results are in bold.

8.2. Effectiveness of our large-kernel grouped at-

the arrangements though the parallel convolutions provide a tention gate (LGAG) over attention gate (AG)
slightly improved performance (0.03% to 0.15%). We also
observe higher standard deviations among runs in the case Table 8 presents experimental results of EMCAD with orig-
of sequential convolutions. Hence, in all our experiments, inal AG [41] and our LGAG. We can conclude that our
we use multi-scale depth-wise convolutions in parallel. LGAG achieves better DICE scores with significant re-

2
Average
Architectures Pretrain Aorta GB KL KR Liver PC SP SM
DICE↑ HD95↓ mIoU↑
PVT-EMCAD-B0 No 77.47 19.93 66.72 81.96 69.41 83.88 74.82 93.45 54.41 88.97 72.85
PVT-EMCAD-B0 Yes 81.97 17.39 72.64 87.21 66.62 87.48 83.96 94.57 62.00 92.66 81.22
PVT-EMCAD-B2 No 80.18 18.83 70.21 85.98 68.10 84.62 79.93 93.96 61.61 90.99 76.23
PVT-EMCAD-B2 Yes 83.63 15.68 74.65 88.14 68.87 88.08 84.10 95.26 68.51 92.17 83.92

Table 9. Effect of transfer learning from ImageNet pre-trained weights on Synapse multi-organ dataset. ↑ (↓) denotes the higher (lower)
the better. All results are averaged over five runs. Best results are in bold.

DS EM BUSI Clinic Kvasir ISIC18 Synapse ACDC the smaller PVT-EMCAD-B0 model than the larger PVT-
No 95.74 79.64 94.96 92.51 90.74 82.03 92.08
Yes 95.53 80.25 95.21 92.75 90.96 83.63 92.12
EMCAD-B2 model. For individual organs, transfer learn-
ing significantly boosts the performance of all organ seg-
Table 10. Effect of deep supervision (DS). PVT-EMCAD-B2 with mentation, except the Gallbladder (GB).
DS achieves slightly better DICE scores in 6 out of 7 datasets.
8.4. Effect of deep supervision
Architectures Resolutions FLOPs(G) DICE We have conducted an ablation study that drops the Deep
Supervision (DS). Results of our PVT-EMCAD-B2 on
PVT-EMCAD-B0 224 × 224 0.64 81.97 seven datasets are given in Table 10. Our PVT-EMCAD-
PVT-EMCAD-B0 256 × 256 0.84 82.63 B2 with DS achieves slightly better DICE scores in six out
PVT-EMCAD-B0 384 × 384 1.89 84.81 of seven datasets. Among all the datasets, the DS has the
PVT-EMCAD-B0 512 × 512 3.36 85.52 largest impact on the Synapse Multi-organ dataset.
PVT-EMCAD-B2 224 × 224 4.29 83.63
8.5. Effect of input resolutions
PVT-EMCAD-B2 256 × 256 5.60 84.47
PVT-EMCAD-B2 384 × 384 12.59 85.78 Table 11 presents the results of our PVT-EMCAD-B0 and
PVT-EMCAD-B2 512 × 512 22.39 86.53 PVT-EMCAD-B2 architectures with different input resolu-
tions. From this table, it is evident that the DICE scores im-
Table 11. Effect of input resolutions on Synapse multi-organ prove with the increase in input resolution. However, these
dataset. All results are averaged over five runs.
improvements in DICE score come with the increment in
#FLOPs. Our PVT-EMCAD-B0 achieves an 85.52% DICE
ductions in #Params (82.57% for PVT-EMCAD-B0 and score with only 3.36G FLOPs when using 512 × 512 in-
91.17% for PVT-EMCAD-B2) and #FLOPs (67.06% for puts. On the other hand, our PVT-EMCAD-B2 achieves
PVT-EMCAD-B0 and 83.03% for PVT-EMCAD-B2) than the best DICE score (86.53%) with 22.39G FLOPs when
AG. The reduction in #Params and #FLOPs is bigger for using 512 × 512 inputs. We also observe that our PVT-
the larger models. Therefore, our LGAG demonstrates im- EMCAD-B2 with 5.60G FLOPs when using 256 × 256 in-
proved scalability with models that have a greater number puts shows a 1.05% lower DICE score than PVT-EMCAD-
of channels, yielding enhanced DICE scores. B0 with 3.36G FLOPs. Therefore, we can conclude that
PVT-EMCAD-B0 is more suitable for larger input resolu-
8.3. Effect of transfer learning from ImageNet pre- tions than PVT-EMCAD-B2.
trained weights
We conduct experiments on the Synapse multi-organ dataset
to show the effect of transfer learning from the ImageNet
pre-trained encoder. Table 9 reports the results of these ex-
periments which show that transfer learning from ImageNet
pre-trained PVT-v2 encoders significantly boosts the per-
formance. Specifically, for PVT-EMCAD-B0, the DICE,
mIoU, and HD95 scores are improved by 4.5%, 5.92%,
and 2.54, respectively. Likewise, for PVT-EMCAD-B2, the
DICE, mIoU, and HD95 scores are improved by 3.45%,
4.44%, and 3.15, respectively. We can also conclude that
transfer learning has a comparatively greater impact on

Rahman 24 A
No ratings yet
Rahman 24 A
19 pages
Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper
No ratings yet
Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper
10 pages
H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
No ratings yet
H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
13 pages
Multi-Axis Vision Transformer for Medical Image Segmentation
No ratings yet
Multi-Axis Vision Transformer for Medical Image Segmentation
49 pages
Miccal 2022 ConTrans Improving Transformer With Convolutional Attention For Medical Image Segmentation
No ratings yet
Miccal 2022 ConTrans Improving Transformer With Convolutional Attention For Medical Image Segmentation
11 pages
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
No ratings yet
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
21 pages
MTANet_Multi-Task_Attention_Network_for_Automatic_Medical_Image_Segmentation_and_Classification
No ratings yet
MTANet_Multi-Task_Attention_Network_for_Automatic_Medical_Image_Segmentation_and_Classification
12 pages
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
No ratings yet
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
18 pages
2211.08564v1
No ratings yet
2211.08564v1
5 pages
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
No ratings yet
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
452 pages
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
No ratings yet
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
12 pages
Hierarchical Attention
No ratings yet
Hierarchical Attention
10 pages
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
No ratings yet
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
22 pages
2...............EFFResNet-ViT a Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
No ratings yet
2...............EFFResNet-ViT a Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
29 pages
Transattunet: Multi-Level Attention-Guided U-Net With Transformer For Medical Image Segmentation
No ratings yet
Transattunet: Multi-Level Attention-Guided U-Net With Transformer For Medical Image Segmentation
13 pages
Medical Image Segmentation
No ratings yet
Medical Image Segmentation
12 pages
Multi-Scale Feature Pyramid Fusion Network For Medical Image Segmentation
No ratings yet
Multi-Scale Feature Pyramid Fusion Network For Medical Image Segmentation
13 pages
CA Net
No ratings yet
CA Net
12 pages
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
No ratings yet
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
14 pages
Medical Image Segmentation With 3D Convolutional Neural Networks: A Survey
No ratings yet
Medical Image Segmentation With 3D Convolutional Neural Networks: A Survey
34 pages
Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
No ratings yet
Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
14 pages
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
No ratings yet
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
11 pages
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
No ratings yet
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
14 pages
U-Net-Based Medical Image Segmentation
No ratings yet
U-Net-Based Medical Image Segmentation
16 pages
s12859-023-05196-1
No ratings yet
s12859-023-05196-1
22 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
2302.09462v1 (1)
No ratings yet
2302.09462v1 (1)
15 pages
COMA Netpreprint
No ratings yet
COMA Netpreprint
16 pages
Medical Image Segmentation With Deep Learning
No ratings yet
Medical Image Segmentation With Deep Learning
42 pages
PUBLICATION
No ratings yet
PUBLICATION
26 pages
5 VNet PDF
No ratings yet
5 VNet PDF
11 pages
4
No ratings yet
4
5 pages
SAM-VMNet
No ratings yet
SAM-VMNet
12 pages
FANet_A_Feedback_Attention_Network_for_Improved_Biomedical_Image_Segmentation
No ratings yet
FANet_A_Feedback_Attention_Network_for_Improved_Biomedical_Image_Segmentation
14 pages
2009.13120v3
No ratings yet
2009.13120v3
23 pages
Bioengineering 12 00140 v2
No ratings yet
Bioengineering 12 00140 v2
16 pages
s10916-024-02105-8
No ratings yet
s10916-024-02105-8
22 pages
1-s2.0-S1361841524002548-main
No ratings yet
1-s2.0-S1361841524002548-main
14 pages
Brain Tumor Segmentation
No ratings yet
Brain Tumor Segmentation
15 pages
Medical Image Segmentation
No ratings yet
Medical Image Segmentation
13 pages
2505.08259v1
No ratings yet
2505.08259v1
9 pages
2024_Real-Time-Multi-Organ-Classification-onComputed-Tomography-Images
No ratings yet
2024_Real-Time-Multi-Organ-Classification-onComputed-Tomography-Images
11 pages
1 s2.0 S0010482523000914 Main
No ratings yet
1 s2.0 S0010482523000914 Main
10 pages
Depp Learning For Medical Image Processing
No ratings yet
Depp Learning For Medical Image Processing
57 pages
A Comprehensive Analysis of Medical Image Segmentation Using Deep Learning
No ratings yet
A Comprehensive Analysis of Medical Image Segmentation Using Deep Learning
10 pages
Explainable AI for Medical Image Diagnosis Using Hybrid Lightweight CNN-Transformer With Explainability Techniques
No ratings yet
Explainable AI for Medical Image Diagnosis Using Hybrid Lightweight CNN-Transformer With Explainability Techniques
11 pages
Bharath Simha Reddy 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012020
No ratings yet
Bharath Simha Reddy 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012020
11 pages
Transunet: Transformers Make Strong Encoders For Medical Image Segmentation
No ratings yet
Transunet: Transformers Make Strong Encoders For Medical Image Segmentation
13 pages
An Explainable AI System for Medical Image Segmentation With Preserved Local Resolution Mammogram Tumor Segmentation
No ratings yet
An Explainable AI System for Medical Image Segmentation With Preserved Local Resolution Mammogram Tumor Segmentation
19 pages
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
No ratings yet
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
13 pages
1 s2.0 S0097849320300546 Main
No ratings yet
1 s2.0 S0097849320300546 Main
10 pages
Vision Transformers in Medical Imaging- A Review
No ratings yet
Vision Transformers in Medical Imaging- A Review
31 pages
IncARMAG a Convolutional Neural Network With Multi-level Autoregressive Moving Average Graph Convolutional Processing Framework for Medical Image Classification
No ratings yet
IncARMAG a Convolutional Neural Network With Multi-level Autoregressive Moving Average Graph Convolutional Processing Framework for Medical Image Classification
12 pages
2211.01784v1
No ratings yet
2211.01784v1
7 pages
1 s2.0 S0031320322007075 Main
No ratings yet
1 s2.0 S0031320322007075 Main
12 pages
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
No ratings yet
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
11 pages
Medical Image Analysis Using Improved SAM-Med2D Se
No ratings yet
Medical Image Analysis Using Improved SAM-Med2D Se
18 pages
a robust volumetric transformer for accurate 3D tumor segmentation paper
No ratings yet
a robust volumetric transformer for accurate 3D tumor segmentation paper
12 pages
Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning
No ratings yet
Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning
12 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Fuzzy Set Theory: 07IC3E8 Soft Computing 3 0 0 100 Unit I 10
No ratings yet
Fuzzy Set Theory: 07IC3E8 Soft Computing 3 0 0 100 Unit I 10
2 pages
AI NOTES
No ratings yet
AI NOTES
32 pages
Finlayson Et Al 2019
No ratings yet
Finlayson Et Al 2019
4 pages
Autoencoder
No ratings yet
Autoencoder
2 pages
Download full Introduction to algorithms for data mining and machine learning Yang ebook all chapters
100% (2)
Download full Introduction to algorithms for data mining and machine learning Yang ebook all chapters
51 pages
Advancing Placement: Andrew B. Kahng
No ratings yet
Advancing Placement: Andrew B. Kahng
8 pages
Health Prediction System Using Machine Learning & Python
No ratings yet
Health Prediction System Using Machine Learning & Python
17 pages
Unit5_AI_Top AIML Tools
No ratings yet
Unit5_AI_Top AIML Tools
15 pages
Class 10th Pre-Board-1
No ratings yet
Class 10th Pre-Board-1
9 pages
Detection and Prediction of Insider Threats To Cyber Security
No ratings yet
Detection and Prediction of Insider Threats To Cyber Security
29 pages
Research Outputs
No ratings yet
Research Outputs
137 pages
2022 - Martakis Et Al. - A Semi-Supervised Interpretable Machine Learning Framework For Sensor Fault Detection
No ratings yet
2022 - Martakis Et Al. - A Semi-Supervised Interpretable Machine Learning Framework For Sensor Fault Detection
16 pages
AI Unit V and II PPT
No ratings yet
AI Unit V and II PPT
40 pages
I D L A R: Mbalanced ATA Earning Pproaches Eview
No ratings yet
I D L A R: Mbalanced ATA Earning Pproaches Eview
19 pages
Bajaj Finance Limited - 16 - 07 - 2020 - 13 - 36 - 15
No ratings yet
Bajaj Finance Limited - 16 - 07 - 2020 - 13 - 36 - 15
1 page
CS8691 Unit 5
No ratings yet
CS8691 Unit 5
43 pages
Final Project - Big Data
No ratings yet
Final Project - Big Data
6 pages
Machine Learning in Geoscience
No ratings yet
Machine Learning in Geoscience
22 pages
Ppt_222[1] Mini Project
No ratings yet
Ppt_222[1] Mini Project
21 pages
AIML IMP QUESTIONS MECH
No ratings yet
AIML IMP QUESTIONS MECH
2 pages
1 s2.0 S0301420722005529 Main
No ratings yet
1 s2.0 S0301420722005529 Main
15 pages
Professional - Practise Assignment
No ratings yet
Professional - Practise Assignment
74 pages
CV of Great PDF
No ratings yet
CV of Great PDF
3 pages
Chapter 9
No ratings yet
Chapter 9
9 pages
User Behavior Simulation
No ratings yet
User Behavior Simulation
2 pages
Kanchan Malhotra: TH TH
No ratings yet
Kanchan Malhotra: TH TH
2 pages
WEP+AIDML++2024-196-203
No ratings yet
WEP+AIDML++2024-196-203
8 pages
Zameer Usman - AI Resume
No ratings yet
Zameer Usman - AI Resume
4 pages
Convolutional Neural Networks in Python - Master Data Science and Machine Learning With Modern Deep Learning in Python, Theano, and TensorFlow (Machine Learning in Python) (PDFDrive) PDF
No ratings yet
Convolutional Neural Networks in Python - Master Data Science and Machine Learning With Modern Deep Learning in Python, Theano, and TensorFlow (Machine Learning in Python) (PDFDrive) PDF
75 pages
Sentiment Prediction for Market Volatility
No ratings yet
Sentiment Prediction for Market Volatility
25 pages

EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Uploaded by

EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Uploaded by

EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical

Md Mostafijur Rahman, Mustafa Munir, and Radu Marculescu

mostafijur.rahman, mmunir, [email protected]

Table 4. Effect of different components of EMCAD with PVTv2-

hibits the worst performance in all five polyp segmenta-

8.2. Effectiveness of our large-kernel grouped at-

You might also like