0% found this document useful (0 votes)

3 views

LET_NET_Semantic_Segmentation

The paper presents LET-Net, a locally enhanced transformer network designed for medical image segmentation, particularly focusing on small target accuracy. It incorporates a pyramid vision transformer as its encoder and introduces two innovative modules: a feature-aligned local enhancement module and a progressive local-induced decoder, which improve feature representation and spatial recovery. Experimental results indicate that LET-Net achieves state-of-the-art performance in polyp and breast lesion segmentation tasks, addressing challenges related to small lesions and pixel imbalance.

Uploaded by

chisi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

LET_NET_Semantic_Segmentation

Uploaded by

chisi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Multimedia Systems (2023) 29:3847–3861

https://doi.org/10.1007/s00530-023-01165-z

SPECIAL ISSUE PAPER

LET‑Net: locally enhanced transformer network for medical image

segmentation
Na Ta1,2,3 · Haipeng Chen1,3 · Xianzhu Liu4 · Nuo Jin5

Received: 13 April 2023 / Accepted: 12 August 2023 / Published online: 5 September 2023
© The Author(s) 2023

Abstract
Medical image segmentation has attracted increasing attention due to its practical clinical requirements. However, the
prevalence of small targets still poses great challenges for accurate segmentation. In this paper, we propose a novel locally
enhanced transformer network (LET-Net) that combines the strengths of transformer and convolution to address this issue.
LET-Net utilizes a pyramid vision transformer as its encoder and is further equipped with two novel modules to learn more
powerful feature representation. Specifically, we design a feature-aligned local enhancement module, which encourages
discriminative local feature learning on the condition of adjacent-level feature alignment. Moreover, to effectively recover
high-resolution spatial information, we apply a newly designed progressive local-induced decoder. This decoder contains
three cascaded local reconstruction and refinement modules that dynamically guide the upsampling of high-level features
by their adaptive reconstruction kernels and further enhance feature representation through a split-attention mechanism.
Additionally, to address the severe pixel imbalance for small targets, we design a mutual information loss that maximizes
task-relevant information while eliminating task-irrelevant noises. Experimental results demonstrate that our LET-Net pro-
vides more effective support for small target segmentation and achieves state-of-the-art performance in polyp and breast
lesion segmentation tasks.

Keywords Medical image segmentation · Feature alignment · Local-induced decoder · Mutual information · Transformer

1 Introduction ultrasound images [2], and focal cortical dysplasia lesions

from magnetic resonance images [3]. It has been an essential
Multimodal medical image segmentation aims to accu- procedure for computer-aided diagnosis [4], which assists
rately identify and annotate regions of interest from images clinicians in making accurate diagnoses, planning surgical
produced by various medical devices, such as segmenting procedures, and proposing treatment strategies. Hence, the
polyps from colonoscopy images [1], breast lesions from development of automatic, accurate, and robust medical
image segmentation methods is of great value to clinical
practice.
* Nuo Jin However, medical image segmentation still encounters
[email protected]
some challenges, one of which is the prevalence of small
1
College of Computer Science and Technology, Jilin lesions. Figure 1 illustrates small lesion samples and size
University, Changchun 130012, China distribution histograms for several different benchmarks,
2
College of Computer, Hulunbuir University, where the ratio of lesion area to whole image is signifi-
Hulunbuir 021008, China cantly concentrated in a smaller range, with proportions in
3
Key Laboratory of Symbolic Computation and Knowledge descending order: 0−0.1 first, 0.1−0.2 s. Specifically, a vast
Engineering of Ministry of Education, Jilin University, majority of polyps and breast lesions occupy only a small
Changchun 130012, China proportion of the entire medical image. Meanwhile, some
4
National and Local Joint Engineering Research Center small lesions, e.g., early stage polyps, exhibit an inconspicu-
of Space Optoelectronics Technology, Changchun University ous appearance. These small targets inevitably pose great
of Science and Technology, Changchun 130022, China
difficulties for accurate segmentation for several reasons.
5
Southampton Business School, University of Southampton, First, small targets are prone to being lost during repeated
Southampton SO17 1BJ, United Kingdom

13
Vol.:(0123456789)
3848 N. Ta et al.

Fig. 1 An illustration of small lesion samples and size distributions sents the proportion of the entire image occupied by the lesion area,
for different medical image datasets, including polyp coloscopy while the vertical axis indicates the proportion of samples with a par-
images and breast ultrasound images. Ground truth for each image is ticular lesion size relative to the total sample
represented by a green line. In a histogram, the horizontal axis repre-

downsampling operations and are hard to recover. Second, destroy part of local features when modeling global contexts,
there is a significant class imbalance problem in the number which may result in imprecise predictions for small objects.
of pixels between the foreground and background, leading In the field of small target segmentation, a couple of
to a biased network and suboptimal performance. Whereas, approaches have been devised to improve the sensitivity
the ability of computer-aided diagnosis to identify small of small objects. They overcome the segmentation difficul-
objects is highly desired, as early detection and diagnosis ties brought by small objects from multiple aspects, such as
of small lesions are crucial for successful cancer prevention exploiting the complementarity between low-level spatial
and treatment. details and high-level semantics [20], multi-scale feature
Nowadays, the development of medical image segmenta- learning [21, 22], and augmenting spatial dimension strat-
tion has greatly advanced due to the efficient feature extrac- egies [23–25]. Although their skip connections can com-
tion ability of convolutional neural networks (CNNs) [5–7]. pensate for detail loss to some extent and even eliminate
Modern CNN-based methods typically utilize a U-shaped somewhat irrelevant noises by extra equipping with attention
encoder–decoder structure, where the encoder extracts mechanisms, these methods are still insufficient, as some
semantic information and the decoder restores resolution to local contexts may be overwhelmed by dominant seman-
facilitate segmentation. Additionally, skip connections are tics due to feature misalignment issues. In addition, another
employed to compensate for detailed information. Some important factor that has been overlooked is how to effec-
advanced U-shaped works focus on the following studies, tively restore spatial information of downsampled features.
which include designing novel encoding blocks [8–10] to Most methods adopt common upsampling operations, such
enhance feature representation ability, adopting attention as nearest-neighbor interpolation and bilinear interpolation,
mechanisms to further recalibrate features [11, 12], extract- which may still lack local spatial awareness to handle small
ing and fusing multi-scale reasonable context information to object positions. As a result, they are not compatible with
improve accuracy [13–15], and so on. Despite their promis- the recovery of target objects and produce suboptimal seg-
ing performance, these methods share a common flaw, i.e., mentation performance.
lacking global contexts essential for better recognition of In this paper, we propose a novel locally enhanced trans-
target objects. former network (LET-Net) for medical image segmentation.
Due to their superior ability to model global contexts, By leveraging the merits of both Transformer and CNN,
Transformer-based architectures have become popular in our LET-Net can accurately segment small objects and pre-
segmentation tasks while achieving promising performance. cisely sharpen local details. First, the PVT-based encoder
Recent works [16–18] utilize vision transformers (ViT) as produces hierarchical multi-scale features where low-level
a backbone to incorporate global information. Despite their features tend to retain local details, while high-level features
good performance, ViT produces single-scale low-resolu- provide strong global representations. Second, to further
tion features and has a very high computational cost, which emphasize detailed local contexts, we propose a feature-
hampers their performance in dense prediction. In contrast aligned local enhancement (FLE) module, which can learn
to ViT, pyramid vision transformer (PVT) [19] inherits the discriminative local cues from adjacent-level features on
advantages of both CNN and Transformer and produces the condition of feature alignment and then utilize the local
hierarchical multi-scale features that are more favorable for enhancement block equipped with local receptive fields to
segmentation. Unfortunately, Transformer-based methods further recalibrate features. Third, we design a progressive

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3849

local-induced decoder that contains cascaded local recon- skip connection to facilitate the recovery of spatial informa-
struction and refinement (LRR) modules to achieve effective tion. UNeXt [28] proposed an encoder–decoder structure
spatial recovery of high-level features under the adaptive involving convolutional stages and tokenized MLP stages,
guidance of reconstruction kernels and optimization of a achieving better segmentation performance while also
split-attention mechanism. Moreover, to alleviate the class improving the inference speed. However, these methods
imbalance between foreground and background, we design directly fuse unaligned features from different levels, which
a mutual loss based on an information-theoretic objective, may hamper accuracy, especially for small objects. In this
which can impose task-relevant restrictions while reducing paper, we propose a powerful feature-aligned local enhance-
task-irrelevant noises. ment module, which ensures that feature maps at adjacent
The contributions of this paper mainly include: levels can be well aligned and then explore substantial local
cues to optimally enhance the discriminative details.
(1) We put forward a novel LET-Net, which combines the
strengths of Transformer and CNN for accurate medical
2.2 Feature alignment
image segmentation.
(2) We propose two novel modules, FLE and LRR, to
Feature alignment has drawn much attention and is now
enhance the sensitivity of small objects. FLE can
an active research topic in computer vision. Numerous
extract discriminative local cues under the alignment
researchers have devoted considerable effort to addressing
of adjacent-level features, while LRR enables effective
this challenge [6, 31–37]. For instance, SegNet [6] utilized
spatial recovery by guiding upsampling of high-level
max-pooling indices computed in the encoder to perform an
features via its adaptive reconstruction kernels and
upsampling operation in the corresponding decoder stage.
recalibrating features through a split-attention mecha-
Mazzini et al. [32] proposed a guided upsampling module
nism.
(GUM) that generates learnable guided offsets to enhance
(3) To mitigate the class imbalance caused by small targets,
the upsampling operation. IndexNet [33] built a novel index-
we design a mutual information loss, which enables our
guided encoder–decoder structure in which pooling and
model to extract task-relevant information while reduc-
upsampling operators are guided by self-learned indices.
ing task-irrelevant noises.
AlignSeg [34] learned 2D transformation offsets by a simple
(4) By evaluating our LET-Net in challenging colorectal
learnable interpolation strategy to alleviate feature misalign-
polyp segmentation and ultrasound breast segmenta-
ment. Huang et al. [35] designed an FaPN framework con-
tion, we demonstrate its state-of-the-art segmentation
sisting of feature alignment and feature selection modules,
ability and strong generalization capability.
achieving substantial and consistent performance improve-
ments on dense prediction tasks. SFNet [31] presented a
flow alignment module that effectively broadcasts high-level
2 Related work
semantic features to high-resolution detail features by its
semantic flow. Our method shares a similar aspect with the
2.1 Medical image segmentation
work [31], in which efficient spatial alignment is achieved by
learning offsets. However, unlike these methods, we further
With the great development of deep learning, especially
enhance discriminative representations by subtraction under
convolutional neural networks (CNNs), various CNN-based
the premise of aligning low-resolution and high-resolution
methods, such as U-Net [7], have significantly improved
features, which facilitates excavating imperceptible local
the performance of medical image segmentation. These
cues related to small objects.
approaches possess the popular U-shaped encoder–decoder
structure. To further assist precise segmentation, a battery of
innovative improvements based on encoder–decoder archi- 2.3 Attention mechanism
tecture has emerged [26–30]. One direction is to design a
new module for enhancing the encoder or decoder ability. Attention-based algorithms have been developed to assist
For instance, Dai et al. [26] designed Ms RED network, in segmentation. In general, attention mechanisms can be
which, respectively, employs a multi-scale residual encod- categorized into channel attention, spatial attention, and self-
ing fusion module (MsR-EFM) and a multi-scale residual attention according to different focus perspectives. Inspired
decoding module (MsR-DFM) in the encoder and decoder by the success of SENet [38], various networks [39–41]
stages to improve skin lesion segmentation. In the work [27], have incorporated the squeeze-and-excitation (SE) module
a selective receptive filed module (SRFM) was designed to to recalibrate features by modeling channel relationships,
obtain suitable sizes of receptive fields, thereby boosting thereby improving segmentation performance. K. Wang
breast mass segmentation. Another direction is optimizing et al. [42] proposed a dual attention network (DANet), which

13
3850 N. Ta et al.

combines position spatial attention and channel attention 3.1 PVT‑based encoder

modules to capture rich contexts.
Additionally, Transformer networks based on self-atten- Although CNN-based methods have achieved great success
tion have been popular in medical image segmentation in medical image segmentation, they have general limita-
[43–48]. For instance, TransUnet [43] inserted Transformer tions in modeling global contexts. In contrast, pyramid
layers between CNN-based encoder and decoder stages to vision transformer (PVT) [19] inherits the advantages of
model global contexts, achieving excellent performances in both Transformer and CNN while proving to be more effec-
multi-organ and cardiac segmentation. Wu et al. [48] pro- tive for segmentation. Thus, we choose PVT as the backbone
posed FAT-Net with a dual encoder that is, respectively, to obtain global receptive fields and learn effective multi-
based on CNNs and Transformers for skin lesion segmenta- scale features.
tion. However, the loss of local contexts may still hinder the As shown in Fig. 2, the PVT-based encoder has four
prediction accuracy of Transformer-based methods. In this stages with a similar architecture. Each stage contains a
paper, we propose a feature-aligned local enhancement mod- patch embedding layer and multiple Transformer layers.
ule and progressive local-induced decoder, which, respec- Benefiting from its progressive shrinking pyramid and spa-
tively, emphasize local information and adaptively recover tial-reduction attention strategy, the PVT-based encoder can
spatial information to improve predictions. produce multi-scale feature maps with fewer memory costs.
Specifically,
{ given an input
} image X ∈ ℝ H∕2,i+1it×W∕2
H×W×3 produces
features Ei |1 ≤ i ≤ 4 , in which Ei ∈ ℝ i.
i+1 ×C

3 Method Therefore, we obtain high-resolution detail features and

low-resolution semantic features, which are beneficial for
Figure 2 illustrates our proposed LET-Net, which combines segmentation.
Transformer and CNN architectures to achieve accurate
segmentation. In the encoder stage, we utilize a pre-trained 3.2 Feature‑aligned local enhancement module
pyramid vision transformer (PVT) [19] as the backbone to
extract hierarchical multi-scale features. Then, three feature- The powerful global receptive field of PVT-based encoder
aligned local enhancement (FLE) modules are inserted in the makes it challenging for our model to adequately capture
skip connections to enhance discriminative local features. critical local details. Although low-level features can pro-
Afterward, we employ a novel progressive local-induced vide some local context, directly transmitting them to the
decoder composed of cascaded local reconstruction and decoder via a simple skip connection is problematic, as
refinement (LRR) modules to effectively recover spatial this may introduce a large amount of irrelevant background
resolution and produce the final segmentation maps. In what information. As a solution, leveraging high-level features
follows, we elaborate on the key components of our model. is an effective way, but one significant issue, i.e., feature

Input PVT-based encoder PVT

Stage i
Patch
PVT E1 PVT E2 PVT E3 PVT E4 Embedding
Stage 1 Stage 2 Stage 3 Stage 4

Transformer
Norm

Spacial
FLE Reduction
FLE FLE
Prediction
F1 F2 F3 Multi-Head
Attention N

D1 D2 D3
LRR LRR LRR Norm

Feed Forward
Progressive local-induced decoder
LPPA LVSD Reshape
FLE Feature-aligned local enhancement module LRR Local reconstruction and refinement module
Supervision Positional embedding

Fig. 2 The pipeline of our proposed LET-Net

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3851

alignment, should be fully considered in this procedure to layers are first employed to compress adjacent-level fea-
prevent local contexts from being overshadowed by global tures (i.e., Ei and Ei−1) into the same channel depth. Then,
contexts. To this end, we propose a feature-aligned local a semantic flow field is calculated by a 3 × 3 convolution
enhancement (FLE) module, in which informative detailed operation, as described in Eq. 1
features are effectively captured under the premise of fea- ( ( ) ( ( )))
ture alignment, producing discriminative representation. 𝛥i−1 = f3×3 f1×1 Ei−1 ©U f1×1 Ei , (1)
The internal structure of FLE is illustrated in Fig. 3, and it
where fs×s (⋅) indicates s × s convolution layer followed by
consists of two steps: feature-aligned discriminative learning
batch normalization and a ReLU activation function, while
and local enhancement.
© and U(⋅), respectively, represent concatenation and upsam-
Feature-aligned discriminative learning Due to the
pling operation. Next, according to learned semantic flow
information gap between semantics and resolution, feature
𝛥i−1, we obtain a feature-aligned high-resolution feature Ẽ i
representation is still suboptimal when directly upsam-
with semantic cues, Mathematically
pling high-level feature maps to guide low-level features.
( )
To obtain strong feature representations, more attention Ẽ i = Warp(f1×1 Ei , 𝛥i−1 ), (2)
and effort should be given to position offset between low-
level and high-level features. Inspired by previous work where Warp(⋅) indicates the mapping function, Ei is a Ci
[31], we propose a feature-aligned discriminative learning dimensional feature
( map defined) on the spatial grid 𝛺i of
(FDL) block that aligns adjacent-level features and further the specific size H∕2i+1 , W∕2i+1 . Schematically as shown
excavates discriminative features, leading to high sensitiv- in Fig. 4, the warp procedure consists of two steps. In the
ity to small objects. Within FDL, two 1 × 1 convolution

Fig. 3 The architecture of feature-aligned local enhancement module, constructs discriminative representation using subtraction and a resid-
which performs two steps: First, feature-aligned discriminative learn- ual connection. Second, local enhancement with a dense connection
ing initially produces a flow field to align adjacent features and then structure is adopted to highlight local details

Fig. 4 An illustration of the

warp procedure

13
3852 N. Ta et al.

first step, each point pi−1 on the spatial grid 𝛺i−1 is mapped dense connections are added to encourage feature reuse and
to pi on low-resolution feature, which is formulated by Eq. 3 strengthen local feature propagation. As a result, the fea-
ture map obtained by LE contains rich local contexts. Let
pi−1 + 𝛥i−1 (pi−1 )
pi = . (3) x0 denote the initial input, and the outputs of ith stage within
2 LE can be formulated as follows:
It is worth mentioning that due to the resolution gap between { ( )
the flow field and features (see Fig. 4), Eq. 3 contains a xi =
f3×3 ([x0 , ]) i = 1, (5)
f3×3 x0 , ⋯ , xi−1 , 2 ≤ i ≤ 4,
halved operation to reduce the resolution. In the second step,
we adopt the differentiable bilinear sampling mechanism where [ ] represents the concatenation operation. In sum-
[49] to approximate the final feature Ẽ i by linearly inter- mary, LE utilizes the local receptive field of the convo-
polating the scores of four neighboring points (top-right, lution operation and dense connections to achieve local
top-left, bottom-right, and bottom-left) of pi. enhancement.
After that, to enhance the discriminative local context
representation, we further utilize subtraction, absolute value,
and residual learning procedures. Conclusively, the final 3.3 Progressive local‑induced decoder
optimized feature Ê i−1 can be expressed as follows:
Efficient recovery of spatial information is critical in medical
Ê i−1 =∣ Ei−1 − Ẽ i ∣ +Ei−1 . (4)
image segmentation, especially for small objects. Inspired
Local enhancement In the PVT-based encoder, attention is by previous works [50, 51], we propose a progressive local-
established between each patch, allowing information to be induced decoder to adaptively restore feature resolution and
blended from all other patches, even if their correlation is detailed information. As shown in Fig. 2, the decoder con-
not high. Meanwhile, since small targets only occupy a por- sists of three cascaded local reconstruction and refinement
tion of the entire image, the global interaction in transformer (LRR) modules. The internal structure of LRR is illustrated
architecture cannot fully meet the requirements of small in Fig. 5, where two steps are performed: local-induced
target segmentation where more detailed local contexts are reconstruction (LR) and split-attention-based refinement
needed. Considering that the convolution operation with a (SAR).
fixed receptive field can blend the features of each patch’s Local-induced reconstruction LR aims to transfer the
neighboring patches, we construct a local enhancement spatial detail information from low-level features into high-
(LE) block to increase the weights associated with adja- level features, thereby facilitating accurate spatial recovery
cent patches to the center patch using convolution, thereby of high-level features. As shown in Fig. 5, LR first produces
a reconstruction kernel 𝜅 ∈ ℝk ×Hi−1 ×Wi−1 based on low-level
2
emphasizing the local features of each patch.
As shown in Fig. 3, LE has a convolution-based struc- feature Fi−1 and high-level feature Di , in which k indicates
ture and consists of four stages. Each stage includes a 3 × 3 the neighborhood size for reconstructing local features. The
convolutional layer followed by batch normalization and procedure of generating the reconstruction kernel 𝜅 can be
a ReLu activation layer (denoted as f3×3 (⋅)). Additionally, expressed as follows:

Fig. 5 The structure of local reconstruction and refinement module. It contains two blocks: local-induced reconstruction and split-attention-
based refinement

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3853

( ( ( ) ( )))
𝜅 = Soft f3×3 U(f1×1 Di )©f1×1 𝐅i−1 , (6) in which parameters W1 and b1 are used for scaling and shift-
ing S.
where fs×s (⋅) represents an s × s convolution layer followed In spatial attention block, spatial-wise statistics are cal-
by batch normalization and a ReLU activation function. U(⋅), culated using Group Norm (GN) [52] on Mi2. The pixel-wise
©, and Soft(⋅), respectively, indicate upsampling, concatena- representation is then strengthened by another compact fea-
tion, and Softmax activation operations. Meanwhile, another ture calculated by two parameters W2 and b2 and a Sigmoid
3 × 3 convolution and upsampling operation are applied on function. This process can be expressed as
Di to obtain D ̃ i with the same resolution size as Fi−1. Math-
( ( )) ( ( ) )
ematically, Di = U f3×3 Di . Note that, D4 = E4 here. Next,
̃ M̃ 2 = Sig W2 × GN M 2 + b2 × M 2 .
i i i (10)
we optimize pixel D ̃ i [u, v] under the guidance of reconstruc-
Next, M̃ 1 and M
̃ 2 are optimized by an additional consistency
tion kernel 𝜅[u,v] ∈ ℝk×k , producing refined local feature i i
̂ i [u, v]. This can be written as Eq. 7, where r = ⌊k∕2⌋ embedding path and then concatenated. This procedure is
D
represented as
r r
∑ ∑ ( 1 ) ( 2 )
̂ i [u, v] =
D ̃ i [u + m, v + n].
𝜅[u,v] [m, n] × D (7) M̃i = M ̃ + M1 × M2 © M
i i i
̃ + M1 × M2 .
i i i (11)
m=−r n=−r
After aggregating all sub-features, a channel shuffle [53]
Subsequently, D ̂ i and Fi−1 are concatenated together and then
is performed to facilitate cross-group information exchange
passed through two convolutional layers to produce an opti- along the channel dimension.
mized feature. Conclusively, LR overcomes the limitations
of traditional upsampling operations in precisely recover-
ing pixel-wise prediction, since it takes full advantage of 3.4 Mutual information loss
low-level features to adaptively predict reconstruction kernel
and then effectively combines semantic contexts with spa- As stated in the previous study [54], training models with
tial information toward accurate spatial recovery. This can only pixel-wise loss may limit segmentation performance,
strengthen the recognition of small objects. especially resulting in prediction errors for small objects.
Split-attention-based refinement To enhance feature This is due to class imbalance between foreground and back-
representation, we implement an SAR block in which ground, such that task-relevant information is overwhelmed
grouped sub-features are further split and fed into two par- by irrelevant noise. Therefore, to facilitate preserving task-
allel branches to capture channel dependencies and pixel- relevant information, we explore novel supervision at the
level pairwise relationships through two types of attention feature level to further assist accurate segmentation. Let X
mechanisms. As shown in Fig. 5, SAR is composed of two and Y denote the input medical image and its correspond-
basic components: a spatial attention block and a chan- ing ground truth, respectively. Z represents the deep feature
nel attention block. Given an input feature map M, SAR extracted from input X.
first divides it along }the channel dimension to produce Mutual information (MI) Mutual information is a funda-
{ mental quantity that measures the amount of information
M = M1 , M2 , ⋅ ⋅ ⋅, MG . For each Mi , valuable responses
are specified by attention mechanisms. Specifically, Mi is shared between two random variables. Mathematically, the
split into two features, denoted as Mi1 and Mi2 , which are statistical dependency of Y and Z can be quantified by MI,
separately fed into the channel attention block and spatial which is expressed as
attention block to reconstruct features. This allows our [
p(Y, Z)
]
model to focus on “what” and “where” are valuable through I(Y;Z) = 𝔼p(Y,Z) log , (12)
p(Y)p(Z)
these two blocks.
In channel attention block, global average pooling where p(Y, Z) is the probability distribution between Z and
(denoted as GAP(⋅)) is performed to produce channel-wise Y, while p(Z) and p(Y) are their marginals.
statistics, which can be formulated as Mutual Information Loss Our primary objective is to
H W maximize the amount of task-relevant information about Y
1 ∑∑ 1 in the latent feature Z while reducing irrelevant information.
S = GAP(Mi1 ) = M (m, n). (8)
H × W m=1 n=1 i This is achieved by two mutual information terms [55, 56].
Formally
Then, channel-wise dependencies are captured according to
the guidance of a compact feature, which is generated by a IB(Y, Z) = Max I(Z;Y) − I(Z;X). (13)
Sigmoid function (i.e., Sig(⋅)). Mathematically
Owing to the notorious difficulty of the conditional MI com-
( )
M̃ 1 = Sig W1 × S + b1 × M 1 ,
i i (9) putations, these terms are estimated by existing MI estima-
tors [56, 57]. In detail, the first term is accomplished through

13
3854 N. Ta et al.

the use of Pixel Position Aware (PPA) loss [57] ( LPPA). Since other state-of-the-art methods on polyp benchmarks. For
PPA loss assigns different weights to different positions, it breast lesion segmentation, we adopt four widely used met-
can better explore task-relevant structure information and rics, including Accuracy, Jaccard index, Precision, and Dice
give more attention to important details. The second term is to validate the segmentation performance in our study. Theo-
estimated by Variational Self-Distillation (VSD) [56] ( LVSD) retically, high scores for all metrics indicate better results.
that uses KL-divergence to compress Z and remove irrel-
evant noises, thereby addressing the effect of imbalances in 4.2 Experimental results
the number of foreground and background pixels caused by
small targets. Thus, our total loss can be expressed as To investigate the effectiveness of our proposed method, we
validate LET-Net in two applications: polyp segmentation
Ltotal = LPPA + LVSD . (14) from coloscopy images and breast lesion segmentation from
ultrasound images.

4 Experiments 4.2.1 Polyp segmentation

4.1 Experimental setup Quantitative comparison To demonstrate the effectiveness

of our LET-Net, we compare it to several state-of-the-art
4.1.1 Implementation details methods on five polyp benchmarks. Table 1 summarizes
the quantitative experimental results in detail. From it, we
We implement our experiments based on the hardware envi- can see that our LET-Net outperforms the other methods
ronment with NVIDIA GeForce RTX 3090. The AdamW on all datasets. Concretely, on the seen CVC-ClinicDB
algorithm is chosen to optimize our model’s parameters, and dataset, it achieves significantly higher mDice and mIoU
the initial learning rate is set to 1e-4. During training, a multi- scores (94.5% and 89.9%, respectively). On Kvasir dataset,
scale training strategy is employed, in which input images our method exceeds SANet [20] and BLE-Net [61] by 2.2%
are reshaped according to a ratio of [0.75, 1, 1.25]. The total and 2.1% mDice improvements, respectively. The underly-
number of epochs and batch size are set to 200 and 16, respec- ing reason for their limited performance is that these two
tively. In the pre-processing step, all images and corresponding methods follow a pure CNN architecture, which lacks global
ground truths are resized to 352 × 352 in our experiments. long-range dependencies. By contrast, our method captures
global contexts by its PVT-based encoder, and further exca-
4.1.2 Datasets vates valuable local information using FLE module, demon-
strating superior segmentation ability. Most importantly, our
To verify the capability of our proposed model, we evalu- LET-Net still exhibits excellent generalization capabilities
ate LET-Net in two medical image segmentation tasks. For when applied to unseen datasets (i.e., CVC-ColonDB, ETIS-
polyp segmentation, we utilize five public benchmarks: CVC- LaribPolypDB, and CVC-300). Specifically, LET-Net gets
ClinicDB [62], Kvasir [63], CVC-ColonDB [64], ETIS- ahead of the CNN-based SOTA CaraNet [22] by 2.2% and
LaribPolypDB [65], and CVC-300 [66]. To ensure a fair 2.8% in terms of mDice and mIoU on CVC-ColonDB. Com-
comparison, we follow the work [59] and divide large-scale pared with other Transformer-based approaches, our LET-
CVC-ClinicDB and Kvasir datasets into training, validation, Net also presents excellent segmentation and generalization
and testing datasets in a ratio of [8:1:1], while the remaining abilities. Concretely, on ETIS-LaribPolypDB dataset, we can
three datasets are used only for testing to evaluate the model’s observe that LET-Net achieves 4.7% and 4.2% higher mDice
generalization abilities. For breast lesion segmentation task, than SETR-PUP [18] and TransUnet [43], respectively. This
we choose the public breast ultrasound dataset (BUSIS) [67] to performance improvement can be attributed to two factors.
assess the effectiveness of our LET-Net. This dataset includes One is that our proposed FLE module compensates for the
133 normal cases, 437 benign cases, and 210 malignant cases. loss of local details in the Transformer architecture. The
We follow the same settings as work [2] to separately conduct other is that the LRR module effectively recovers spatial
experiments on benign and malignant samples. information.
Visual Comparison To further evaluate the proposed
4.1.3 Evaluation metrics LET-Net intuitively, we visualize some segmentation maps
produced by our model and other methods in Fig. 6. It is
As done in recent related work of polyp segmentation [20], apparent that our LET-Net can not only clearly highlight
we employ both mean Dice (mDice) and mean IoU (mIoU) polyp regions but also identify small polyps more accu-
to quantitatively evaluate the performance of our model and rately than other counterparts. This is mainly because our
method effectively leverages and combines global and local

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3855

Table 1 Comparisons between Method Seen dataset Unseen dataset

different method in polyp
segmentation task. The best CVC-ClinicDB Kvasir CVC-ColonDB ETIS- Larib- CVC-300
results are highlighted in bold PolypDB
mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU

FCN [5] 0.825 0.747 0.775 0.686 0.578 0.481 0.379 0.313 0.660 0.558
U-Net [7] 0.842 0.775 0.818 0.746 0.512 0.444 0.398 0.335 0.710 0.627
UNet++ [30] 0.846 0.774 0.821 0.743 0.599 0.499 0.456 0.375 0.707 0.624
AttentionU-Net [11] 0.809 0.744 0.782 0.694 0.614 0.524 0.440 0.360 0.686 0.580
DCRNet [58] 0.896 0.844 0.886 0.825 0.704 0.631 0.556 0.496 0.856 0.788
SegNet [8] 0.915 0.857 0.878 0.814 0.647 0.570 0.612 0.529 0.841 0.773
SFA [1] 0.700 0.607 0.723 0.611 0.469 0.347 0.297 0.217 0.467 0.329
PraNet [59] 0.899 0.849 0.898 0.840 0.709 0.640 0.628 0.567 0.871 0.797
ACSNet [39] 0.912 0.858 0.907 0.850 0.709 0.643 0.609 0.537 0.862 0.784
EU-Net [60] 0.902 0.846 0.908 0.854 0.756 0.681 0.687 0.609 0.837 0.765
SANet [20] 0.916 0.859 0.904 0.847 0.753 0.670 0.750 0.654 0.888 0.815
BLE-Net [61] 0.926 0.878 0.905 0.854 0.731 0.658 0.673 0.594 0.879 0.805
CaraNet [22] 0.936 0.887 0.918 0.865 0.773 0.689 0.747 0.672 0.903 0.838
SETR-PUP [18] 0.934 0.885 0.911 0.854 0.773 0.690 0.726 0.646 0.889 0.814
TransUnet [43] 0.935 0.887 0.913 0.857 0.781 0.699 0.731 0.660 0.893 0.824
LET-Net(Ours) 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839

contexts. In addition, we introduce mutual information loss in terms of Jaccard. Meanwhile, in malignant lesion seg-
as an assistant to learning task-relevant representation. Fur- mentation, we obtain an Accuracy score of 93% and a Dice
thermore, we find that our LET-Net successfully deals with score of 72.7%, respectively, demonstrating the superiority
other challenging cases, including cluttered backgrounds of our LET-Net over other methods. In particular, LET-Net
(Fig. 6 (b),(c), (g), (i)) and low contrast (Fig. 6 (a),(h)). For presents a significant improvement of 1.8% in Jaccard and
example, as illustrated in Fig. 6 (b),(i), ACSNet [39] and 2.8% in Dice compared with C-Net [2]. The reason behind
PraNet [59] misidentify background tissues as polyps, but this is that although C-Net constructs a bidirectional atten-
our LET-Net overcomes this drawback. Due to combining tion guidance network to capture both global and local
the strengths of CNN and Transformer, our LET-Net pro- features, long-range dependencies are not fully modeled
duces good segmentation performance in these scenarios. due to the limitations of convolution.
Overall, our model achieves leading performance. Visual comparison: To intuitively demonstrate the per-
formance of our model, we present segmentation results of
4.2.2 Breast lesion segmentation different methods in Fig. 7. We observe that other methods
often produce segmentation maps with incomplete lesion
Quantitative comparison To further evaluate the effective- structures or false positives, while our prediction maps
ness of our method, we conduct extensive experiments are superior to others. This is mainly due to our FLE’s
in breast lesion segmentation and perform a comparative ability to facilitate discriminative local feature learning
analysis with ten segmentation approaches. Table 2 pre- and the effectiveness of our proposed LRR module for
sents the detailed quantitative comparison among differ- spatial reconstruction. In addition, it is worth noting that
ent methods on BUSIS dataset. Obviously, our LET-Net our LET-Net performs well in handling various shapes
exhibits excellent performance in both benign and malig- [Fig. 7(a)–(h)] and low-contrast images [Fig. 7(d)(h)],
nant lesion segmentation. In benign lesion segmentation, which can be attributed to the powerful and robust feature
LET-Net achieves 97.7% Accuracy, 74% Jaccard, 83.5% learning ability of LET-Net.
Precision, and 81.5% Dice. Compared with other com-
petitors, LET-Net significantly outperforms them by a
large margin. In detail, it, respectively, excels C-Net [2],
CPF-Net [29], and PraNet [59] by 1.6%, 4.1%, and 4.9%

13
3856 N. Ta et al.

Fig. 6 Visualization results of our LET-Net and several other methods on five polyp datasets. From top to down, the images are from CVC-
ClinicDB, Kvasir, CVC-ColonDB, ETIS-LaribPolypDB, and CVC-300, which are separated by red dashed lines

Table 2 Comparison with Method Benign lesion Malignant lesion

different state-of-the-art
methods on BUSIS dataset Accuracy Jaccard Precision Dice Accuracy Jaccard Precision Dice

U-Net [7] 0.966 0.615 0.750 0.705 0.901 0.511 0.650 0.635
STAN [21] 0.969 0.643 0.744 0.723 0.910 0.511 0.647 0.626
AttentionU-Net [11] 0.969 0.650 0.752 0.733 0.912 0.511 0.616 0.630
Abraham et al. [68] 0.969 0.667 0.767 0.748 0.915 0.541 0.675 0.658
UNet++ [30] 0.971 0.683 0.759 0.756 0.915 0.540 0.655 0.655
UNet3+ [69] 0.971 0.676 0.756 0.751 0.916 0.548 0.658 0.662
SegNet [8] 0.972 0.679 0.770 0.755 0.922 0.549 0.638 0.659
PraNet [59] 0.972 0.691 0.799 0.763 0.925 0.582 0.763 0.698
CPF-Net [29] 0.973 0.699 0.801 0.766 0.927 0.605 0.755 0.716
C-Net [2] 0.975 0.724 0.827 0.794 0.926 0.597 0.757 0.699
LET-Net(Ours) 0.977 0.740 0.835 0.815 0.930 0.615 0.772 0.727

4.3 Ablation study our proposed LET-Net, including FLE, LRR, and mutual
information loss.
In this section, we conduct a series of ablation studies
to verify the effectiveness of each critical component in

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3857

Fig. 7 Visual comparison among different methods in breast lesion segmentation, where the segmentation results of benign and malignant
lesions are separated by a red dashed line

4.3.1 Impact of FLE and LRR modules 2.3% on CVC-ColonDB, ETIS-LaribPolypDB, and CVC-
300 datasets, respectively. These results indicate that our
To validate the effectiveness of FLE and LRR modules, FLE module effectively supports accurate segmentation
we remove them individually from our full net, resulting due to its ability to learn discriminative local features
in two variants, namely w/o FLE and w/o LRR. As shown under the feature alignment condition. Furthermore, when
in Table 3, the variant without FLE (w/o FLE) achieves comparing the second and third lines of Table 3, it can
a 93.6% mDice score on CVC-ClinicDB dataset. When be seen that LRR module is also conducive to segmenta-
we apply the FLE module, the mDice score increases to tion, with performance gains of 1.6% and 1.7% in terms
94.5%. Moreover, it boosts mDice by 1.6%, 2.2%, and of mDice and mIoU on Kvasir dataset. The main reason is

13
3858 N. Ta et al.

Table 3 Ablation analysis w.r.t Method Seen dataset Unseen dataset

the effectiveness of FLE and
LRR modules. The best results CVC-ClinicDB Kvasir CVC-ColonDB ETIS- Larib- CVC-300
are shown in bold PolypDB
mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU

w/o FLE 0.936 0.887 0.918 0.871 0.779 0.698 0.751 0.674 0.884 0.816
w/o LRR 0.940 0.894 0.910 0.859 0.790 0.711 0.759 0.681 0.890 0.821
LET-Net 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839

Table 4 Ablation analysis of Loss setting Seen dataset Unseen dataset

mutual information loss. The
best results are shown in bold CVC-ClinicDB Kvasir CVC-ColonDB ETIS- Larib- CVC-300
PolypDB
mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU

w/o LPPA 0.937 0.888 0.917 0.864 0.782 0.697 0.737 0.663 0.885 0.812
w/o LVSD 0.940 0.892 0.923 0.872 0.785 0.702 0.762 0.688 0.895 0.826
w/o LPPA & LVSD 0.934 0.882 0.914 0.861 0.772 0.692 0.716 0.648 0.879 0.807
LET-Net 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839

that LRR module is capable of effective spatial recovery information loss on CVC-ColonDB dataset. In summary, our
via its dynamic reconstruction kernels and split-attention experimental results fully demonstrate that mutual informa-
mechanism, thereby facilitating segmentation. tion loss is beneficial for LET-Net.

4.3.2 Effectiveness of mutual information loss

5 Conclusion
To validate the effectiveness and necessity of our mutual
information loss, we retrain our proposed LET-Net with In this work, we propose a novel locally enhanced trans-
different loss settings. Specifically, we denote three vari- former network for accurate medical image segmentation.
ants, i.e., w/o LPPA , w/o LVSD, and w/o LPPA & LVSD, each of Our model adopts a PVT-based encoder to extract global
which removes the corresponding loss item. Note that we contexts and utilizes a feature-aligned local enhancement
apply conventional binary-cross entropy loss to supervise module to highlight detailed local contexts while effectively
our model when removing LPPA . Table 4 reports the quan- recovering high-resolution spatial information by its pro-
titative evaluation. Comparing the first and fourth lines in gressive local-induced decoder. In addition, we design a
Table 4, we can observe that our model performs poorly mutual information loss to encourage our LET-Net to learn
without PPA loss supervision, obtaining a 1.1% lower mIoU powerful representations from the task-relevant perspective.
on CVC-ClinicDB dataset. Also, a similar dropping situation LET-Net is validated in polyp and breast lesion segmenta-
occurs with the variant w/o LVSD. Specifically, our model has tion and achieves state-of-the-art performance, especially
witnessed performance degradation without LVSD, decreas- demonstrating its ability for small target segmentation. In
ing mIoU by 1.5%, 1%, and 1.3%, respectively, on CVC- future work, we aim to apply our proposed LET-Net to other
ColonDB, ETIS-LaribPolypDB, and CVC-300 datasets. medical image segmentation tasks with different modali-
This confirms that each term in our total loss is effective for ties or anatomies, thereby developing our model to be more
segmentation. The reasons can be summarized as: first, in robust.
contrast to binary-cross entropy loss, PPA loss can guide our
Acknowledgements This research is supported by the National Natural
model to pay more attention to local details by synthesizing Science Foundation of China (62276112), the National Natural Sci-
local structure information of a pixel, resulting in superior ence Foundation of China Regional Joint Fund of NSFC (U19A2057),
performance. Second, LVSD assists task-relevant feature Jilin Province Science and Technology Development Plan Key R &D
learning, thereby improving the sensitivity of small objects. Project (20230201088GX), and Collaborative Innovation Project of
Anhui Universities (GXXT-2022-044).
In addition, it can be seen that our method outperforms w/o
LPPA & LVSD by a large margin, achieving 2.3% mDice and Open Access This article is licensed under a Creative Commons
2.5% mIoU performance gains with the help of our mutual Attribution 4.0 International License, which permits use, sharing,

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3859

adaptation, distribution and reproduction in any medium or format, 12. Cheng, J., Tian, S., Yu, L., Lu, H., Lv, X.: Fully convolutional
as long as you give appropriate credit to the original author(s) and the attention network for biomedical image segmentation. Artif.
source, provide a link to the Creative Commons licence, and indicate Intell. Med. 107, 101899 (2020)
if changes were made. The images or other third party material in this 13. Wang, X., Jiang, X., Ding, H., Liu, J.: Bi-directional dermoscopic
article are included in the article’s Creative Commons licence, unless feature learning and multi-scale consistent decision fusion for skin
indicated otherwise in a credit line to the material. If material is not lesion segmentation. IEEE Trans. Image Processing 29, 3039–
included in the article’s Creative Commons licence and your intended 3051 (2019)
use is not permitted by statutory regulation or exceeds the permitted 14. Wang, X., Li, Z., Huang, Y., Jiao, Y.: Multimodal medical image
use, you will need to obtain permission directly from the copyright segmentation using multi-scale context-aware network. Neuro-
holder. To view a copy of this licence, visit http://creativecommons. computing 486, 135–146 (2022). https://doi.org/10.1016/j.neu-
org/licenses/by/4.0/. com.2021.11.017
15. Liang, X., Li, N., Zhang, Z., Xiong, J., Zhou, S., Xie, Y.: Incor-
porating the hybrid deformable model for improving the perfor-
mance of abdominal ct segmentation via multi-scale feature fusion
network. Med. Image Anal. 73, 102156 (2021)
References 16. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for
dense prediction. In: Proceedings of the IEEE/CVF International
1. Fang, Y., Chen, C., Yuan, Y., Tong, R.K.: Selective feature aggre- Conference on Computer Vision, pp. 12179–12188 (2021)
gation network with area-boundary constraints for polyp segmen- 17. Li, Y., Wang, Z., Yin, L., Zhu, Z., Qi, G., Liu, Y.: X-net: a dual
tation. In: International Conference on Medical Image Computing encoding–decoding method in medical image segmentation. The
and Computer-Assisted Intervention, pp. 302–310 (2019). https:// Visual Computer, pp. 1–11 (2021)
doi.org/10.1007/978-3-030-32239-7_34 18. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y.,
2. Chen, G., Dai, Y., Zhang, J.: C-net: Cascaded convolutional neural Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmen-
network with global guidance and refinement residuals for breast tation from a sequence-to-sequence perspective with transformers.
ultrasound images segmentation. Comput. Methods. Programs In: Proceedings of the IEEE/CVF Conference on Computer Vision
Biomed. 225, 107086 (2022) and Pattern Recognition, pp. 6881–6890 (2021)
3. Thomas, E., Pawan, S., Kumar, S., Horo, A., Niyas, S., Vinay- 19. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T.,
agamani, S., Kesavadas, C., Rajan, J.: Multi-res-attention unet: a Luo, P., Shao, L.: Pyramid vision transformer: A versatile back-
cnn model for the segmentation of focal cortical dysplasia lesions bone for dense prediction without convolutions. In: Proceedings
from magnetic resonance images. IEEE J. Biomed. Health Infor- of the IEEE/CVF International Conference on Computer Vision,
mat. 25(5), 1724–1734 (2020) pp. 568–578 (2021)
4. Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., Nandi, A.K.: 20. Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow
Medical image segmentation using deep learning: A survey. IET attention network for polyp segmentation. In: International Con-
Image Process. 16(5), 1243–1267 (2022). https://d oi.o rg/1 0.1 049/ ference on Medical Image Computing and Computer-Assisted
ipr2.12419 Intervention, pp. 699–708 (2021). Springer
5. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks 21. Shareef, B., Xian, M., Vakanski, A.: Stan: Small tumor-aware
for semantic segmentation. In: Proceedings of the IEEE Confer- network for breast ultrasound image segmentation. In: 2020
ence on Computer Vision and Pattern Recognition, pp. 3431–3440 IEEE 17th International Symposium on Biomedical Imaging
(2015). https://doi.org/10.1109/CVPR.2015.7298965 (ISBI), pp. 1–5 (2020)
6. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep 22. Lou, A., Guan, S., Ko, H., Loew, M.H.: Caranet: context axial
convolutional encoder-decoder architecture for image segmenta- reverse attention network for segmentation of small medical
tion. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 objects. In: Medical Imaging 2022: Image Processing, vol.
(2017). https://doi.org/10.1109/TPAMI.2016.2644615 12032, pp. 81–92 (2022)
7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional 23. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.:
networks for biomedical image segmentation. In: International Kiu-net: Towards accurate segmentation of biomedical images
Conference on Medical Image Computing and Computer-assisted using over-complete representations. In: International Confer-
Intervention, pp. 234–241 (2015). https://doi.org/10.1007/978-3- ence on Medical Image Computing and Computer-assisted
319-24574-4_28 Intervention, pp. 363–373 (2020). Springer
8. Lou, A., Guan, S., Loew, M.: Cfpnet-m: A light-weight encoder- 24. Pang, Y., Zhao, X., Xiang, T.-Z., Zhang, L., Lu, H.: Zoom in
decoder based network for multimodal biomedical image real-time and out: A mixed-scale triplet network for camouflaged object
segmentation. Comput. Biol. Med. 154, 106579 (2023) detection. In: Proceedings of the IEEE/CVF Conference on
9. Xie, X., Pan, X., Zhang, W., An, J.: A context hierarchical inte- Computer Vision and Pattern Recognition, pp. 2160–2170
grated network for medical image segmentation. Comput. Elect. (2022)
Eng. 101, 108029 (2022). https://doi.org/10.1016/j.compeleceng. 25. Jia, Q., Yao, S., Liu, Y., Fan, X., Liu, R., Luo, Z.: Segment,
2022.108029 magnify and reiterate: Detecting camouflaged objects the hard
10. Wang, R., Ji, C., Zhang, Y., Li, Y.: Focus, fusion, and rectify: way. In: Proceedings of the IEEE/CVF Conference on Computer
Context-aware learning for covid-19 lung infection segmentation. Vision and Pattern Recognition, pp. 4713–4722 (2022)
IEEE Trans. Neural Netw. Learn. Syst. 33(1), 12–24 (2021) 26. Dai, D., Dong, C., Xu, S., Yan, Q., Li, Z., Zhang, C., Luo, N.:
11. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Ms red: A novel multi-scale residual encoding and decoding
Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., network for skin lesion segmentation. Med. Image Anal. 75,
et al.: Attention u-net: Learning where to look for the pancreas. 102293 (2022)
arXiv preprint arXiv:1804.03999 (2018) 27. Xu, C., Qi, Y., Wang, Y., Lou, M., Pi, J., Ma, Y.: Arf-net: An adap-
tive receptive field network for breast mass segmentation in whole

13
3860 N. Ta et al.

mammograms and ultrasound images. Biomed. Signal Process Conference on Medical Image Computing and Computer-Assisted
Control 71, 103178 (2022) Intervention, pp. 14–24 (2021)
28. Valanarasu, J.M.J., Patel, V.M.: Unext: Mlp-based rapid medical 45. He, X., Tan, E.-L., Bi, H., Zhang, X., Zhao, S., Lei, B.: Fully
image segmentation network. arXiv preprint arXiv:2203.04967 transformer network for skin lesion analysis. Med. Image Anal.
(2022) 77, 102357 (2022)
29. Feng, S., Zhao, H., Shi, F., Cheng, X., Wang, M., Ma, Y., Xiang, 46. Yuan, F., Zhang, Z., Fang, Z.: An effective cnn and transformer
D., Zhu, W., Chen, X.: Cpfnet: Context pyramid fusion network complementary network for medical image segmentation. Pattern
for medical image segmentation. IEEE Trans. Med. Imaging Recogn 136, 109228 (2023)
39(10), 3008–3018 (2020). https://doi.org/10.1109/TMI.2020. 47. Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K.,
2983721 Cohen-Adad, J., Merhof, D.: Hiformer: Hierarchical multi-scale
30. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: representations using transformers for medical image segmen-
Redesigning skip connections to exploit multiscale features in tation. In: Proceedings of the IEEE/CVF Winter Conference on
image segmentation. IEEE Trans. Med. Imaging 39(6), 1856– Applications of Computer Vision, pp. 6202–6212 (2023)
1867 (2019) 48. Wu, H., Chen, S., Chen, G., Wang, W., Lei, B., Wen, Z.: Fat-net:
31. Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., Feature adaptive transformers for automated skin lesion segmenta-
Tong, Y.: Semantic flow for fast and accurate scene parsing. In: tion. Med. Image Anal. 76, 102327 (2022)
European Conference on Computer Vision, pp. 775–793 (2020) 49. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial trans-
32. Mazzini, D.: Guided upsampling network for real-time semantic former networks. Adv. Neural Info. Processing Syst 28, (2015)
segmentation. arXiv preprint arXiv:1807.07466 (2018) 50. Song, J., Chen, X., Zhu, Q., Shi, F., Xiang, D., Chen, Z., Fan,
33. Lu, H., Dai, Y., Shen, C., Xu, S.: Indices matter: Learning to index Y., Pan, L., Zhu, W.: Global and local feature reconstruction for
for deep image matting. In: Proceedings of the IEEE/CVF Inter- medical image segmentation. IEEE Trans. Med. Imaging (2022)
national Conference on Computer Vision, pp. 3266–3275 (2019) 51. Zhang, Q.-L., Yang, Y.-B.: Sa-net: Shuffle attention for deep con-
34. Huang, Z., Wei, Y., Wang, X., Liu, W., Huang, T.S., Shi, H.: volutional neural networks. In: ICASSP 2021-2021 IEEE Inter-
Alignseg: Feature-aligned segmentation networks. IEEE Trans. national Conference on Acoustics, Speech and Signal Processing
Pattern Anal. Mach. Intell. 44(1), 550–557 (2021) (ICASSP), pp. 2235–2239 (2021)
35. Huang, S., Lu, Z., Cheng, R., He, C.: Fapn: Feature-aligned pyra- 52. Wu, Y., He, K.: Group normalization. In: Proceedings of the Euro-
mid network for dense image prediction. In: Proceedings of the pean Conference on Computer Vision (ECCV), pp. 3–19 (2018)
IEEE/CVF International Conference on Computer Vision, pp. 53. Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: Practical
864–873 (2021) guidelines for efficient cnn architecture design. In: Proceedings
36. Wu, J., Pan, Z., Lei, B., Hu, Y.: Fsanet: Feature-and-spatial- of the European Conference on Computer Vision (ECCV) (2018)
aligned network for tiny object detection in remote sensing 54. Zhao, S., Wang, Y., Yang, Z., Cai, D.: Region mutual information
images. IEEE Trans. Geosci. Remote Sens. 60, 1–17 (2022) loss for semantic segmentation. Adv. Neural Info. Processing Syst.
37. Hu, H., Chen, Y., Xu, J., Borse, S., Cai, H., Porikli, F., Wang, X.: 32, (2019)
Learning implicit feature alignment function for semantic segmen- 55. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep vari-
tation. In: European Conference on Computer Vision, pp. 487–505 ational information bottleneck. arXiv preprint arXiv:1612.00410
(2022) (2016)
38. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 56. Tian, X., Zhang, Z., Lin, S., Qu, Y., Xie, Y., Ma, L.: Farewell to
Proceedings of the IEEE Conference on Computer Vision and mutual information: Variational distillation for cross-modal per-
Pattern Recognition, pp. 7132–7141 (2018) son re-identification. In: Proceedings of the IEEE/CVF Confer-
39. Zhang, R., Li, G., Li, Z., Cui, S., Qian, D., Yu, Y.: Adaptive con- ence on Computer Vision and Pattern Recognition, pp. 1522–1531
text selection for polyp segmentation. In: International Conference (2021)
on Medical Image Computing and Computer-Assisted Interven- 57. Wei, J., Wang, S., Huang, Q.: F 3net: fusion, feedback and focus for
tion, pp. 253–262 (2020). https://doi.org/10.1007/978-3-030- salient object detection. In: Proceedings of the AAAI Conference
59725-2_25 on Artificial Intelligence, vol. 34, pp. 12321–12328 (2020)
40. Tomar, N.K., Jha, D., Riegler, M.A., Johansen, H.D., Johansen, 58. Yin, Z., Liang, K., Ma, Z., Guo, J.: Duplex contextual relation
D., Rittscher, J., Halvorsen, P., Ali, S.: Fanet: A feedback atten- network for polyp segmentation. In: 2022 IEEE 19th Interna-
tion network for improved biomedical image segmentation. IEEE tional Symposium on Biomedical Imaging (ISBI), pp. 1–5 (2022).
Trans. Neural Netw. Learn, Syst (2022) https://doi.org/10.1109/ISBI52829.2022.9761402
41. Shen, Y., Jia, X., Meng, M.Q.-H.: Hrenet: A hard region enhance- 59. Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L.:
ment network for polyp segmentation. In: International Confer- Pranet: Parallel reverse attention network for polyp segmentation.
ence on Medical Image Computing and Computer-Assisted Inter- In: International Conference on Medical Image Computing and
vention, pp. 559–568 (2021) Computer-assisted Intervention, pp. 263–273 (2020). https://doi.
42. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual org/10.1007/978-3-030-59725-2_26
attention network for scene segmentation. In: Proceedings of the 60. Patel, K., Bur, A.M., Wang, G.: Enhanced u-net: A feature
IEEE/CVF Conference on Computer Vision and Pattern Recogni- enhancement network for polyp segmentation. In: 2021 18th Con-
tion, pp. 3146–3154 (2019) ference on Robots and Vision (CRV), pp. 181–188 (2021). https://
43. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., doi.org/10.1109/CRV52889.2021.00032
Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong 61. Ta, N., Chen, H., Lyu, Y., Wu, T.: Ble-net: boundary learning and
encoders for medical image segmentation. arXiv preprint arXiv: enhancement network for polyp segmentation. Multimed. Syst.
2102.04306 (2021) 1–14 (2022)
44. Zhang, Y., Liu, H., Hu, Q.: Transfuse: Fusing transformers 62. Bernal, J., Sánchez, F.J., Fernández-Esparrach, G., Gil, D.,
and cnns for medical image segmentation. In: International Rodríguez, C., Vilariño, F.: Wm-dova maps for accurate polyp

13
LET‑Net: locally enhanced transformer network for medical image segmentation 3861

highlighting in colonoscopy: Validation vs. saliency maps from 67. Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of
physicians. Comput. Med. Imaging Graph. 43, 99–111 (2015). breast ultrasound images. Data in brief 28, 104863 (2020)
https://doi.org/10.1016/j.compmedimag.2015.02.007 68. Abraham, N., Khan, N.M.: A novel focal tversky loss function
63. Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., Lange, T.d., with improved attention u-net for lesion segmentation. In: 2019
Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp IEEE 16th International Symposium on Biomedical Imaging
dataset. In: International Conference on Multimedia Modeling, (ISBI 2019), pp. 683–687 (2019)
pp. 451–462 (2020) 69. Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y.,
64. Tajbakhsh, N., Gurudu, S.R., Liang, J.: Automated polyp detection Han, X., Chen, Y.-W., Wu, J.: Unet 3+: A full-scale connected
in colonoscopy videos using shape and context information. IEEE unet for medical image segmentation. In: ICASSP 2020-2020
Trans. Med. Imaging 35(2), 630–644 (2016). https://doi.org/10. IEEE International Conference on Acoustics, Speech and Signal
1109/TMI.2015.2487997 Processing (ICASSP), pp. 1055–1059 (2020)
65. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B.: Toward
embedded detection of polyps in WCE images for early diagnosis Publisher's Note Springer Nature remains neutral with regard to
of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 9(2), jurisdictional claims in published maps and institutional affiliations.
283–293 (2014). https://doi.org/10.1007/s11548-013-0926-3
66. Vázquez, D., Bernal, J., Sánchez, F.J., Fernández-Esparrach,
G., López, A.M., Romero, A., Drozdzal, M., Courville, A.C.: A
benchmark for endoluminal scene segmentation of colonoscopy
images. J. Healthc. Eng. (2017)