LET_NET_Semantic_Segmentation
LET_NET_Semantic_Segmentation
https://doi.org/10.1007/s00530-023-01165-z
Received: 13 April 2023 / Accepted: 12 August 2023 / Published online: 5 September 2023
© The Author(s) 2023
Abstract
Medical image segmentation has attracted increasing attention due to its practical clinical requirements. However, the
prevalence of small targets still poses great challenges for accurate segmentation. In this paper, we propose a novel locally
enhanced transformer network (LET-Net) that combines the strengths of transformer and convolution to address this issue.
LET-Net utilizes a pyramid vision transformer as its encoder and is further equipped with two novel modules to learn more
powerful feature representation. Specifically, we design a feature-aligned local enhancement module, which encourages
discriminative local feature learning on the condition of adjacent-level feature alignment. Moreover, to effectively recover
high-resolution spatial information, we apply a newly designed progressive local-induced decoder. This decoder contains
three cascaded local reconstruction and refinement modules that dynamically guide the upsampling of high-level features
by their adaptive reconstruction kernels and further enhance feature representation through a split-attention mechanism.
Additionally, to address the severe pixel imbalance for small targets, we design a mutual information loss that maximizes
task-relevant information while eliminating task-irrelevant noises. Experimental results demonstrate that our LET-Net pro-
vides more effective support for small target segmentation and achieves state-of-the-art performance in polyp and breast
lesion segmentation tasks.
Keywords Medical image segmentation · Feature alignment · Local-induced decoder · Mutual information · Transformer
13
Vol.:(0123456789)
3848 N. Ta et al.
Fig. 1 An illustration of small lesion samples and size distributions sents the proportion of the entire image occupied by the lesion area,
for different medical image datasets, including polyp coloscopy while the vertical axis indicates the proportion of samples with a par-
images and breast ultrasound images. Ground truth for each image is ticular lesion size relative to the total sample
represented by a green line. In a histogram, the horizontal axis repre-
downsampling operations and are hard to recover. Second, destroy part of local features when modeling global contexts,
there is a significant class imbalance problem in the number which may result in imprecise predictions for small objects.
of pixels between the foreground and background, leading In the field of small target segmentation, a couple of
to a biased network and suboptimal performance. Whereas, approaches have been devised to improve the sensitivity
the ability of computer-aided diagnosis to identify small of small objects. They overcome the segmentation difficul-
objects is highly desired, as early detection and diagnosis ties brought by small objects from multiple aspects, such as
of small lesions are crucial for successful cancer prevention exploiting the complementarity between low-level spatial
and treatment. details and high-level semantics [20], multi-scale feature
Nowadays, the development of medical image segmenta- learning [21, 22], and augmenting spatial dimension strat-
tion has greatly advanced due to the efficient feature extrac- egies [23–25]. Although their skip connections can com-
tion ability of convolutional neural networks (CNNs) [5–7]. pensate for detail loss to some extent and even eliminate
Modern CNN-based methods typically utilize a U-shaped somewhat irrelevant noises by extra equipping with attention
encoder–decoder structure, where the encoder extracts mechanisms, these methods are still insufficient, as some
semantic information and the decoder restores resolution to local contexts may be overwhelmed by dominant seman-
facilitate segmentation. Additionally, skip connections are tics due to feature misalignment issues. In addition, another
employed to compensate for detailed information. Some important factor that has been overlooked is how to effec-
advanced U-shaped works focus on the following studies, tively restore spatial information of downsampled features.
which include designing novel encoding blocks [8–10] to Most methods adopt common upsampling operations, such
enhance feature representation ability, adopting attention as nearest-neighbor interpolation and bilinear interpolation,
mechanisms to further recalibrate features [11, 12], extract- which may still lack local spatial awareness to handle small
ing and fusing multi-scale reasonable context information to object positions. As a result, they are not compatible with
improve accuracy [13–15], and so on. Despite their promis- the recovery of target objects and produce suboptimal seg-
ing performance, these methods share a common flaw, i.e., mentation performance.
lacking global contexts essential for better recognition of In this paper, we propose a novel locally enhanced trans-
target objects. former network (LET-Net) for medical image segmentation.
Due to their superior ability to model global contexts, By leveraging the merits of both Transformer and CNN,
Transformer-based architectures have become popular in our LET-Net can accurately segment small objects and pre-
segmentation tasks while achieving promising performance. cisely sharpen local details. First, the PVT-based encoder
Recent works [16–18] utilize vision transformers (ViT) as produces hierarchical multi-scale features where low-level
a backbone to incorporate global information. Despite their features tend to retain local details, while high-level features
good performance, ViT produces single-scale low-resolu- provide strong global representations. Second, to further
tion features and has a very high computational cost, which emphasize detailed local contexts, we propose a feature-
hampers their performance in dense prediction. In contrast aligned local enhancement (FLE) module, which can learn
to ViT, pyramid vision transformer (PVT) [19] inherits the discriminative local cues from adjacent-level features on
advantages of both CNN and Transformer and produces the condition of feature alignment and then utilize the local
hierarchical multi-scale features that are more favorable for enhancement block equipped with local receptive fields to
segmentation. Unfortunately, Transformer-based methods further recalibrate features. Third, we design a progressive
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3849
local-induced decoder that contains cascaded local recon- skip connection to facilitate the recovery of spatial informa-
struction and refinement (LRR) modules to achieve effective tion. UNeXt [28] proposed an encoder–decoder structure
spatial recovery of high-level features under the adaptive involving convolutional stages and tokenized MLP stages,
guidance of reconstruction kernels and optimization of a achieving better segmentation performance while also
split-attention mechanism. Moreover, to alleviate the class improving the inference speed. However, these methods
imbalance between foreground and background, we design directly fuse unaligned features from different levels, which
a mutual loss based on an information-theoretic objective, may hamper accuracy, especially for small objects. In this
which can impose task-relevant restrictions while reducing paper, we propose a powerful feature-aligned local enhance-
task-irrelevant noises. ment module, which ensures that feature maps at adjacent
The contributions of this paper mainly include: levels can be well aligned and then explore substantial local
cues to optimally enhance the discriminative details.
(1) We put forward a novel LET-Net, which combines the
strengths of Transformer and CNN for accurate medical
2.2 Feature alignment
image segmentation.
(2) We propose two novel modules, FLE and LRR, to
Feature alignment has drawn much attention and is now
enhance the sensitivity of small objects. FLE can
an active research topic in computer vision. Numerous
extract discriminative local cues under the alignment
researchers have devoted considerable effort to addressing
of adjacent-level features, while LRR enables effective
this challenge [6, 31–37]. For instance, SegNet [6] utilized
spatial recovery by guiding upsampling of high-level
max-pooling indices computed in the encoder to perform an
features via its adaptive reconstruction kernels and
upsampling operation in the corresponding decoder stage.
recalibrating features through a split-attention mecha-
Mazzini et al. [32] proposed a guided upsampling module
nism.
(GUM) that generates learnable guided offsets to enhance
(3) To mitigate the class imbalance caused by small targets,
the upsampling operation. IndexNet [33] built a novel index-
we design a mutual information loss, which enables our
guided encoder–decoder structure in which pooling and
model to extract task-relevant information while reduc-
upsampling operators are guided by self-learned indices.
ing task-irrelevant noises.
AlignSeg [34] learned 2D transformation offsets by a simple
(4) By evaluating our LET-Net in challenging colorectal
learnable interpolation strategy to alleviate feature misalign-
polyp segmentation and ultrasound breast segmenta-
ment. Huang et al. [35] designed an FaPN framework con-
tion, we demonstrate its state-of-the-art segmentation
sisting of feature alignment and feature selection modules,
ability and strong generalization capability.
achieving substantial and consistent performance improve-
ments on dense prediction tasks. SFNet [31] presented a
flow alignment module that effectively broadcasts high-level
2 Related work
semantic features to high-resolution detail features by its
semantic flow. Our method shares a similar aspect with the
2.1 Medical image segmentation
work [31], in which efficient spatial alignment is achieved by
learning offsets. However, unlike these methods, we further
With the great development of deep learning, especially
enhance discriminative representations by subtraction under
convolutional neural networks (CNNs), various CNN-based
the premise of aligning low-resolution and high-resolution
methods, such as U-Net [7], have significantly improved
features, which facilitates excavating imperceptible local
the performance of medical image segmentation. These
cues related to small objects.
approaches possess the popular U-shaped encoder–decoder
structure. To further assist precise segmentation, a battery of
innovative improvements based on encoder–decoder archi- 2.3 Attention mechanism
tecture has emerged [26–30]. One direction is to design a
new module for enhancing the encoder or decoder ability. Attention-based algorithms have been developed to assist
For instance, Dai et al. [26] designed Ms RED network, in segmentation. In general, attention mechanisms can be
which, respectively, employs a multi-scale residual encod- categorized into channel attention, spatial attention, and self-
ing fusion module (MsR-EFM) and a multi-scale residual attention according to different focus perspectives. Inspired
decoding module (MsR-DFM) in the encoder and decoder by the success of SENet [38], various networks [39–41]
stages to improve skin lesion segmentation. In the work [27], have incorporated the squeeze-and-excitation (SE) module
a selective receptive filed module (SRFM) was designed to to recalibrate features by modeling channel relationships,
obtain suitable sizes of receptive fields, thereby boosting thereby improving segmentation performance. K. Wang
breast mass segmentation. Another direction is optimizing et al. [42] proposed a dual attention network (DANet), which
13
3850 N. Ta et al.
Transformer
Norm
Spacial
FLE Reduction
FLE FLE
Prediction
F1 F2 F3 Multi-Head
Attention N
D1 D2 D3
LRR LRR LRR Norm
Feed Forward
Progressive local-induced decoder
LPPA LVSD Reshape
FLE Feature-aligned local enhancement module LRR Local reconstruction and refinement module
Supervision Positional embedding
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3851
alignment, should be fully considered in this procedure to layers are first employed to compress adjacent-level fea-
prevent local contexts from being overshadowed by global tures (i.e., Ei and Ei−1) into the same channel depth. Then,
contexts. To this end, we propose a feature-aligned local a semantic flow field is calculated by a 3 × 3 convolution
enhancement (FLE) module, in which informative detailed operation, as described in Eq. 1
features are effectively captured under the premise of fea- ( ( ) ( ( )))
ture alignment, producing discriminative representation. 𝛥i−1 = f3×3 f1×1 Ei−1 ©U f1×1 Ei , (1)
The internal structure of FLE is illustrated in Fig. 3, and it
where fs×s (⋅) indicates s × s convolution layer followed by
consists of two steps: feature-aligned discriminative learning
batch normalization and a ReLU activation function, while
and local enhancement.
© and U(⋅), respectively, represent concatenation and upsam-
Feature-aligned discriminative learning Due to the
pling operation. Next, according to learned semantic flow
information gap between semantics and resolution, feature
𝛥i−1, we obtain a feature-aligned high-resolution feature Ẽ i
representation is still suboptimal when directly upsam-
with semantic cues, Mathematically
pling high-level feature maps to guide low-level features.
( )
To obtain strong feature representations, more attention Ẽ i = Warp(f1×1 Ei , 𝛥i−1 ), (2)
and effort should be given to position offset between low-
level and high-level features. Inspired by previous work where Warp(⋅) indicates the mapping function, Ei is a Ci
[31], we propose a feature-aligned discriminative learning dimensional feature
( map defined) on the spatial grid 𝛺i of
(FDL) block that aligns adjacent-level features and further the specific size H∕2i+1 , W∕2i+1 . Schematically as shown
excavates discriminative features, leading to high sensitiv- in Fig. 4, the warp procedure consists of two steps. In the
ity to small objects. Within FDL, two 1 × 1 convolution
Fig. 3 The architecture of feature-aligned local enhancement module, constructs discriminative representation using subtraction and a resid-
which performs two steps: First, feature-aligned discriminative learn- ual connection. Second, local enhancement with a dense connection
ing initially produces a flow field to align adjacent features and then structure is adopted to highlight local details
13
3852 N. Ta et al.
first step, each point pi−1 on the spatial grid 𝛺i−1 is mapped dense connections are added to encourage feature reuse and
to pi on low-resolution feature, which is formulated by Eq. 3 strengthen local feature propagation. As a result, the fea-
ture map obtained by LE contains rich local contexts. Let
pi−1 + 𝛥i−1 (pi−1 )
pi = . (3) x0 denote the initial input, and the outputs of ith stage within
2 LE can be formulated as follows:
It is worth mentioning that due to the resolution gap between { ( )
the flow field and features (see Fig. 4), Eq. 3 contains a xi =
f3×3 ([x0 , ]) i = 1, (5)
f3×3 x0 , ⋯ , xi−1 , 2 ≤ i ≤ 4,
halved operation to reduce the resolution. In the second step,
we adopt the differentiable bilinear sampling mechanism where [ ] represents the concatenation operation. In sum-
[49] to approximate the final feature Ẽ i by linearly inter- mary, LE utilizes the local receptive field of the convo-
polating the scores of four neighboring points (top-right, lution operation and dense connections to achieve local
top-left, bottom-right, and bottom-left) of pi. enhancement.
After that, to enhance the discriminative local context
representation, we further utilize subtraction, absolute value,
and residual learning procedures. Conclusively, the final 3.3 Progressive local‑induced decoder
optimized feature Ê i−1 can be expressed as follows:
Efficient recovery of spatial information is critical in medical
Ê i−1 =∣ Ei−1 − Ẽ i ∣ +Ei−1 . (4)
image segmentation, especially for small objects. Inspired
Local enhancement In the PVT-based encoder, attention is by previous works [50, 51], we propose a progressive local-
established between each patch, allowing information to be induced decoder to adaptively restore feature resolution and
blended from all other patches, even if their correlation is detailed information. As shown in Fig. 2, the decoder con-
not high. Meanwhile, since small targets only occupy a por- sists of three cascaded local reconstruction and refinement
tion of the entire image, the global interaction in transformer (LRR) modules. The internal structure of LRR is illustrated
architecture cannot fully meet the requirements of small in Fig. 5, where two steps are performed: local-induced
target segmentation where more detailed local contexts are reconstruction (LR) and split-attention-based refinement
needed. Considering that the convolution operation with a (SAR).
fixed receptive field can blend the features of each patch’s Local-induced reconstruction LR aims to transfer the
neighboring patches, we construct a local enhancement spatial detail information from low-level features into high-
(LE) block to increase the weights associated with adja- level features, thereby facilitating accurate spatial recovery
cent patches to the center patch using convolution, thereby of high-level features. As shown in Fig. 5, LR first produces
a reconstruction kernel 𝜅 ∈ ℝk ×Hi−1 ×Wi−1 based on low-level
2
emphasizing the local features of each patch.
As shown in Fig. 3, LE has a convolution-based struc- feature Fi−1 and high-level feature Di , in which k indicates
ture and consists of four stages. Each stage includes a 3 × 3 the neighborhood size for reconstructing local features. The
convolutional layer followed by batch normalization and procedure of generating the reconstruction kernel 𝜅 can be
a ReLu activation layer (denoted as f3×3 (⋅)). Additionally, expressed as follows:
Fig. 5 The structure of local reconstruction and refinement module. It contains two blocks: local-induced reconstruction and split-attention-
based refinement
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3853
( ( ( ) ( )))
𝜅 = Soft f3×3 U(f1×1 Di )©f1×1 𝐅i−1 , (6) in which parameters W1 and b1 are used for scaling and shift-
ing S.
where fs×s (⋅) represents an s × s convolution layer followed In spatial attention block, spatial-wise statistics are cal-
by batch normalization and a ReLU activation function. U(⋅), culated using Group Norm (GN) [52] on Mi2. The pixel-wise
©, and Soft(⋅), respectively, indicate upsampling, concatena- representation is then strengthened by another compact fea-
tion, and Softmax activation operations. Meanwhile, another ture calculated by two parameters W2 and b2 and a Sigmoid
3 × 3 convolution and upsampling operation are applied on function. This process can be expressed as
Di to obtain D ̃ i with the same resolution size as Fi−1. Math-
( ( )) ( ( ) )
ematically, Di = U f3×3 Di . Note that, D4 = E4 here. Next,
̃ M̃ 2 = Sig W2 × GN M 2 + b2 × M 2 .
i i i (10)
we optimize pixel D ̃ i [u, v] under the guidance of reconstruc-
Next, M̃ 1 and M
̃ 2 are optimized by an additional consistency
tion kernel 𝜅[u,v] ∈ ℝk×k , producing refined local feature i i
̂ i [u, v]. This can be written as Eq. 7, where r = ⌊k∕2⌋ embedding path and then concatenated. This procedure is
D
represented as
r r
∑ ∑ ( 1 ) ( 2 )
̂ i [u, v] =
D ̃ i [u + m, v + n].
𝜅[u,v] [m, n] × D (7) M̃i = M ̃ + M1 × M2 © M
i i i
̃ + M1 × M2 .
i i i (11)
m=−r n=−r
After aggregating all sub-features, a channel shuffle [53]
Subsequently, D ̂ i and Fi−1 are concatenated together and then
is performed to facilitate cross-group information exchange
passed through two convolutional layers to produce an opti- along the channel dimension.
mized feature. Conclusively, LR overcomes the limitations
of traditional upsampling operations in precisely recover-
ing pixel-wise prediction, since it takes full advantage of 3.4 Mutual information loss
low-level features to adaptively predict reconstruction kernel
and then effectively combines semantic contexts with spa- As stated in the previous study [54], training models with
tial information toward accurate spatial recovery. This can only pixel-wise loss may limit segmentation performance,
strengthen the recognition of small objects. especially resulting in prediction errors for small objects.
Split-attention-based refinement To enhance feature This is due to class imbalance between foreground and back-
representation, we implement an SAR block in which ground, such that task-relevant information is overwhelmed
grouped sub-features are further split and fed into two par- by irrelevant noise. Therefore, to facilitate preserving task-
allel branches to capture channel dependencies and pixel- relevant information, we explore novel supervision at the
level pairwise relationships through two types of attention feature level to further assist accurate segmentation. Let X
mechanisms. As shown in Fig. 5, SAR is composed of two and Y denote the input medical image and its correspond-
basic components: a spatial attention block and a chan- ing ground truth, respectively. Z represents the deep feature
nel attention block. Given an input feature map M, SAR extracted from input X.
first divides it along }the channel dimension to produce Mutual information (MI) Mutual information is a funda-
{ mental quantity that measures the amount of information
M = M1 , M2 , ⋅ ⋅ ⋅, MG . For each Mi , valuable responses
are specified by attention mechanisms. Specifically, Mi is shared between two random variables. Mathematically, the
split into two features, denoted as Mi1 and Mi2 , which are statistical dependency of Y and Z can be quantified by MI,
separately fed into the channel attention block and spatial which is expressed as
attention block to reconstruct features. This allows our [
p(Y, Z)
]
model to focus on “what” and “where” are valuable through I(Y;Z) = 𝔼p(Y,Z) log , (12)
p(Y)p(Z)
these two blocks.
In channel attention block, global average pooling where p(Y, Z) is the probability distribution between Z and
(denoted as GAP(⋅)) is performed to produce channel-wise Y, while p(Z) and p(Y) are their marginals.
statistics, which can be formulated as Mutual Information Loss Our primary objective is to
H W maximize the amount of task-relevant information about Y
1 ∑∑ 1 in the latent feature Z while reducing irrelevant information.
S = GAP(Mi1 ) = M (m, n). (8)
H × W m=1 n=1 i This is achieved by two mutual information terms [55, 56].
Formally
Then, channel-wise dependencies are captured according to
the guidance of a compact feature, which is generated by a IB(Y, Z) = Max I(Z;Y) − I(Z;X). (13)
Sigmoid function (i.e., Sig(⋅)). Mathematically
Owing to the notorious difficulty of the conditional MI com-
( )
M̃ 1 = Sig W1 × S + b1 × M 1 ,
i i (9) putations, these terms are estimated by existing MI estima-
tors [56, 57]. In detail, the first term is accomplished through
13
3854 N. Ta et al.
the use of Pixel Position Aware (PPA) loss [57] ( LPPA). Since other state-of-the-art methods on polyp benchmarks. For
PPA loss assigns different weights to different positions, it breast lesion segmentation, we adopt four widely used met-
can better explore task-relevant structure information and rics, including Accuracy, Jaccard index, Precision, and Dice
give more attention to important details. The second term is to validate the segmentation performance in our study. Theo-
estimated by Variational Self-Distillation (VSD) [56] ( LVSD) retically, high scores for all metrics indicate better results.
that uses KL-divergence to compress Z and remove irrel-
evant noises, thereby addressing the effect of imbalances in 4.2 Experimental results
the number of foreground and background pixels caused by
small targets. Thus, our total loss can be expressed as To investigate the effectiveness of our proposed method, we
validate LET-Net in two applications: polyp segmentation
Ltotal = LPPA + LVSD . (14) from coloscopy images and breast lesion segmentation from
ultrasound images.
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3855
FCN [5] 0.825 0.747 0.775 0.686 0.578 0.481 0.379 0.313 0.660 0.558
U-Net [7] 0.842 0.775 0.818 0.746 0.512 0.444 0.398 0.335 0.710 0.627
UNet++ [30] 0.846 0.774 0.821 0.743 0.599 0.499 0.456 0.375 0.707 0.624
AttentionU-Net [11] 0.809 0.744 0.782 0.694 0.614 0.524 0.440 0.360 0.686 0.580
DCRNet [58] 0.896 0.844 0.886 0.825 0.704 0.631 0.556 0.496 0.856 0.788
SegNet [8] 0.915 0.857 0.878 0.814 0.647 0.570 0.612 0.529 0.841 0.773
SFA [1] 0.700 0.607 0.723 0.611 0.469 0.347 0.297 0.217 0.467 0.329
PraNet [59] 0.899 0.849 0.898 0.840 0.709 0.640 0.628 0.567 0.871 0.797
ACSNet [39] 0.912 0.858 0.907 0.850 0.709 0.643 0.609 0.537 0.862 0.784
EU-Net [60] 0.902 0.846 0.908 0.854 0.756 0.681 0.687 0.609 0.837 0.765
SANet [20] 0.916 0.859 0.904 0.847 0.753 0.670 0.750 0.654 0.888 0.815
BLE-Net [61] 0.926 0.878 0.905 0.854 0.731 0.658 0.673 0.594 0.879 0.805
CaraNet [22] 0.936 0.887 0.918 0.865 0.773 0.689 0.747 0.672 0.903 0.838
SETR-PUP [18] 0.934 0.885 0.911 0.854 0.773 0.690 0.726 0.646 0.889 0.814
TransUnet [43] 0.935 0.887 0.913 0.857 0.781 0.699 0.731 0.660 0.893 0.824
LET-Net(Ours) 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839
contexts. In addition, we introduce mutual information loss in terms of Jaccard. Meanwhile, in malignant lesion seg-
as an assistant to learning task-relevant representation. Fur- mentation, we obtain an Accuracy score of 93% and a Dice
thermore, we find that our LET-Net successfully deals with score of 72.7%, respectively, demonstrating the superiority
other challenging cases, including cluttered backgrounds of our LET-Net over other methods. In particular, LET-Net
(Fig. 6 (b),(c), (g), (i)) and low contrast (Fig. 6 (a),(h)). For presents a significant improvement of 1.8% in Jaccard and
example, as illustrated in Fig. 6 (b),(i), ACSNet [39] and 2.8% in Dice compared with C-Net [2]. The reason behind
PraNet [59] misidentify background tissues as polyps, but this is that although C-Net constructs a bidirectional atten-
our LET-Net overcomes this drawback. Due to combining tion guidance network to capture both global and local
the strengths of CNN and Transformer, our LET-Net pro- features, long-range dependencies are not fully modeled
duces good segmentation performance in these scenarios. due to the limitations of convolution.
Overall, our model achieves leading performance. Visual comparison: To intuitively demonstrate the per-
formance of our model, we present segmentation results of
4.2.2 Breast lesion segmentation different methods in Fig. 7. We observe that other methods
often produce segmentation maps with incomplete lesion
Quantitative comparison To further evaluate the effective- structures or false positives, while our prediction maps
ness of our method, we conduct extensive experiments are superior to others. This is mainly due to our FLE’s
in breast lesion segmentation and perform a comparative ability to facilitate discriminative local feature learning
analysis with ten segmentation approaches. Table 2 pre- and the effectiveness of our proposed LRR module for
sents the detailed quantitative comparison among differ- spatial reconstruction. In addition, it is worth noting that
ent methods on BUSIS dataset. Obviously, our LET-Net our LET-Net performs well in handling various shapes
exhibits excellent performance in both benign and malig- [Fig. 7(a)–(h)] and low-contrast images [Fig. 7(d)(h)],
nant lesion segmentation. In benign lesion segmentation, which can be attributed to the powerful and robust feature
LET-Net achieves 97.7% Accuracy, 74% Jaccard, 83.5% learning ability of LET-Net.
Precision, and 81.5% Dice. Compared with other com-
petitors, LET-Net significantly outperforms them by a
large margin. In detail, it, respectively, excels C-Net [2],
CPF-Net [29], and PraNet [59] by 1.6%, 4.1%, and 4.9%
13
3856 N. Ta et al.
Fig. 6 Visualization results of our LET-Net and several other methods on five polyp datasets. From top to down, the images are from CVC-
ClinicDB, Kvasir, CVC-ColonDB, ETIS-LaribPolypDB, and CVC-300, which are separated by red dashed lines
U-Net [7] 0.966 0.615 0.750 0.705 0.901 0.511 0.650 0.635
STAN [21] 0.969 0.643 0.744 0.723 0.910 0.511 0.647 0.626
AttentionU-Net [11] 0.969 0.650 0.752 0.733 0.912 0.511 0.616 0.630
Abraham et al. [68] 0.969 0.667 0.767 0.748 0.915 0.541 0.675 0.658
UNet++ [30] 0.971 0.683 0.759 0.756 0.915 0.540 0.655 0.655
UNet3+ [69] 0.971 0.676 0.756 0.751 0.916 0.548 0.658 0.662
SegNet [8] 0.972 0.679 0.770 0.755 0.922 0.549 0.638 0.659
PraNet [59] 0.972 0.691 0.799 0.763 0.925 0.582 0.763 0.698
CPF-Net [29] 0.973 0.699 0.801 0.766 0.927 0.605 0.755 0.716
C-Net [2] 0.975 0.724 0.827 0.794 0.926 0.597 0.757 0.699
LET-Net(Ours) 0.977 0.740 0.835 0.815 0.930 0.615 0.772 0.727
4.3 Ablation study our proposed LET-Net, including FLE, LRR, and mutual
information loss.
In this section, we conduct a series of ablation studies
to verify the effectiveness of each critical component in
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3857
Fig. 7 Visual comparison among different methods in breast lesion segmentation, where the segmentation results of benign and malignant
lesions are separated by a red dashed line
4.3.1 Impact of FLE and LRR modules 2.3% on CVC-ColonDB, ETIS-LaribPolypDB, and CVC-
300 datasets, respectively. These results indicate that our
To validate the effectiveness of FLE and LRR modules, FLE module effectively supports accurate segmentation
we remove them individually from our full net, resulting due to its ability to learn discriminative local features
in two variants, namely w/o FLE and w/o LRR. As shown under the feature alignment condition. Furthermore, when
in Table 3, the variant without FLE (w/o FLE) achieves comparing the second and third lines of Table 3, it can
a 93.6% mDice score on CVC-ClinicDB dataset. When be seen that LRR module is also conducive to segmenta-
we apply the FLE module, the mDice score increases to tion, with performance gains of 1.6% and 1.7% in terms
94.5%. Moreover, it boosts mDice by 1.6%, 2.2%, and of mDice and mIoU on Kvasir dataset. The main reason is
13
3858 N. Ta et al.
w/o FLE 0.936 0.887 0.918 0.871 0.779 0.698 0.751 0.674 0.884 0.816
w/o LRR 0.940 0.894 0.910 0.859 0.790 0.711 0.759 0.681 0.890 0.821
LET-Net 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839
w/o LPPA 0.937 0.888 0.917 0.864 0.782 0.697 0.737 0.663 0.885 0.812
w/o LVSD 0.940 0.892 0.923 0.872 0.785 0.702 0.762 0.688 0.895 0.826
w/o LPPA & LVSD 0.934 0.882 0.914 0.861 0.772 0.692 0.716 0.648 0.879 0.807
LET-Net 0.945 0.899 0.926 0.876 0.795 0.717 0.773 0.698 0.907 0.839
that LRR module is capable of effective spatial recovery information loss on CVC-ColonDB dataset. In summary, our
via its dynamic reconstruction kernels and split-attention experimental results fully demonstrate that mutual informa-
mechanism, thereby facilitating segmentation. tion loss is beneficial for LET-Net.
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3859
adaptation, distribution and reproduction in any medium or format, 12. Cheng, J., Tian, S., Yu, L., Lu, H., Lv, X.: Fully convolutional
as long as you give appropriate credit to the original author(s) and the attention network for biomedical image segmentation. Artif.
source, provide a link to the Creative Commons licence, and indicate Intell. Med. 107, 101899 (2020)
if changes were made. The images or other third party material in this 13. Wang, X., Jiang, X., Ding, H., Liu, J.: Bi-directional dermoscopic
article are included in the article’s Creative Commons licence, unless feature learning and multi-scale consistent decision fusion for skin
indicated otherwise in a credit line to the material. If material is not lesion segmentation. IEEE Trans. Image Processing 29, 3039–
included in the article’s Creative Commons licence and your intended 3051 (2019)
use is not permitted by statutory regulation or exceeds the permitted 14. Wang, X., Li, Z., Huang, Y., Jiao, Y.: Multimodal medical image
use, you will need to obtain permission directly from the copyright segmentation using multi-scale context-aware network. Neuro-
holder. To view a copy of this licence, visit http://creativecommons. computing 486, 135–146 (2022). https://doi.org/10.1016/j.neu-
org/licenses/by/4.0/. com.2021.11.017
15. Liang, X., Li, N., Zhang, Z., Xiong, J., Zhou, S., Xie, Y.: Incor-
porating the hybrid deformable model for improving the perfor-
mance of abdominal ct segmentation via multi-scale feature fusion
network. Med. Image Anal. 73, 102156 (2021)
References 16. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for
dense prediction. In: Proceedings of the IEEE/CVF International
1. Fang, Y., Chen, C., Yuan, Y., Tong, R.K.: Selective feature aggre- Conference on Computer Vision, pp. 12179–12188 (2021)
gation network with area-boundary constraints for polyp segmen- 17. Li, Y., Wang, Z., Yin, L., Zhu, Z., Qi, G., Liu, Y.: X-net: a dual
tation. In: International Conference on Medical Image Computing encoding–decoding method in medical image segmentation. The
and Computer-Assisted Intervention, pp. 302–310 (2019). https:// Visual Computer, pp. 1–11 (2021)
doi.org/10.1007/978-3-030-32239-7_34 18. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y.,
2. Chen, G., Dai, Y., Zhang, J.: C-net: Cascaded convolutional neural Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmen-
network with global guidance and refinement residuals for breast tation from a sequence-to-sequence perspective with transformers.
ultrasound images segmentation. Comput. Methods. Programs In: Proceedings of the IEEE/CVF Conference on Computer Vision
Biomed. 225, 107086 (2022) and Pattern Recognition, pp. 6881–6890 (2021)
3. Thomas, E., Pawan, S., Kumar, S., Horo, A., Niyas, S., Vinay- 19. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T.,
agamani, S., Kesavadas, C., Rajan, J.: Multi-res-attention unet: a Luo, P., Shao, L.: Pyramid vision transformer: A versatile back-
cnn model for the segmentation of focal cortical dysplasia lesions bone for dense prediction without convolutions. In: Proceedings
from magnetic resonance images. IEEE J. Biomed. Health Infor- of the IEEE/CVF International Conference on Computer Vision,
mat. 25(5), 1724–1734 (2020) pp. 568–578 (2021)
4. Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., Nandi, A.K.: 20. Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow
Medical image segmentation using deep learning: A survey. IET attention network for polyp segmentation. In: International Con-
Image Process. 16(5), 1243–1267 (2022). https://d oi.o rg/1 0.1 049/ ference on Medical Image Computing and Computer-Assisted
ipr2.12419 Intervention, pp. 699–708 (2021). Springer
5. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks 21. Shareef, B., Xian, M., Vakanski, A.: Stan: Small tumor-aware
for semantic segmentation. In: Proceedings of the IEEE Confer- network for breast ultrasound image segmentation. In: 2020
ence on Computer Vision and Pattern Recognition, pp. 3431–3440 IEEE 17th International Symposium on Biomedical Imaging
(2015). https://doi.org/10.1109/CVPR.2015.7298965 (ISBI), pp. 1–5 (2020)
6. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep 22. Lou, A., Guan, S., Ko, H., Loew, M.H.: Caranet: context axial
convolutional encoder-decoder architecture for image segmenta- reverse attention network for segmentation of small medical
tion. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 objects. In: Medical Imaging 2022: Image Processing, vol.
(2017). https://doi.org/10.1109/TPAMI.2016.2644615 12032, pp. 81–92 (2022)
7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional 23. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.:
networks for biomedical image segmentation. In: International Kiu-net: Towards accurate segmentation of biomedical images
Conference on Medical Image Computing and Computer-assisted using over-complete representations. In: International Confer-
Intervention, pp. 234–241 (2015). https://doi.org/10.1007/978-3- ence on Medical Image Computing and Computer-assisted
319-24574-4_28 Intervention, pp. 363–373 (2020). Springer
8. Lou, A., Guan, S., Loew, M.: Cfpnet-m: A light-weight encoder- 24. Pang, Y., Zhao, X., Xiang, T.-Z., Zhang, L., Lu, H.: Zoom in
decoder based network for multimodal biomedical image real-time and out: A mixed-scale triplet network for camouflaged object
segmentation. Comput. Biol. Med. 154, 106579 (2023) detection. In: Proceedings of the IEEE/CVF Conference on
9. Xie, X., Pan, X., Zhang, W., An, J.: A context hierarchical inte- Computer Vision and Pattern Recognition, pp. 2160–2170
grated network for medical image segmentation. Comput. Elect. (2022)
Eng. 101, 108029 (2022). https://doi.org/10.1016/j.compeleceng. 25. Jia, Q., Yao, S., Liu, Y., Fan, X., Liu, R., Luo, Z.: Segment,
2022.108029 magnify and reiterate: Detecting camouflaged objects the hard
10. Wang, R., Ji, C., Zhang, Y., Li, Y.: Focus, fusion, and rectify: way. In: Proceedings of the IEEE/CVF Conference on Computer
Context-aware learning for covid-19 lung infection segmentation. Vision and Pattern Recognition, pp. 4713–4722 (2022)
IEEE Trans. Neural Netw. Learn. Syst. 33(1), 12–24 (2021) 26. Dai, D., Dong, C., Xu, S., Yan, Q., Li, Z., Zhang, C., Luo, N.:
11. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Ms red: A novel multi-scale residual encoding and decoding
Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., network for skin lesion segmentation. Med. Image Anal. 75,
et al.: Attention u-net: Learning where to look for the pancreas. 102293 (2022)
arXiv preprint arXiv:1804.03999 (2018) 27. Xu, C., Qi, Y., Wang, Y., Lou, M., Pi, J., Ma, Y.: Arf-net: An adap-
tive receptive field network for breast mass segmentation in whole
13
3860 N. Ta et al.
mammograms and ultrasound images. Biomed. Signal Process Conference on Medical Image Computing and Computer-Assisted
Control 71, 103178 (2022) Intervention, pp. 14–24 (2021)
28. Valanarasu, J.M.J., Patel, V.M.: Unext: Mlp-based rapid medical 45. He, X., Tan, E.-L., Bi, H., Zhang, X., Zhao, S., Lei, B.: Fully
image segmentation network. arXiv preprint arXiv:2203.04967 transformer network for skin lesion analysis. Med. Image Anal.
(2022) 77, 102357 (2022)
29. Feng, S., Zhao, H., Shi, F., Cheng, X., Wang, M., Ma, Y., Xiang, 46. Yuan, F., Zhang, Z., Fang, Z.: An effective cnn and transformer
D., Zhu, W., Chen, X.: Cpfnet: Context pyramid fusion network complementary network for medical image segmentation. Pattern
for medical image segmentation. IEEE Trans. Med. Imaging Recogn 136, 109228 (2023)
39(10), 3008–3018 (2020). https://doi.org/10.1109/TMI.2020. 47. Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K.,
2983721 Cohen-Adad, J., Merhof, D.: Hiformer: Hierarchical multi-scale
30. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: representations using transformers for medical image segmen-
Redesigning skip connections to exploit multiscale features in tation. In: Proceedings of the IEEE/CVF Winter Conference on
image segmentation. IEEE Trans. Med. Imaging 39(6), 1856– Applications of Computer Vision, pp. 6202–6212 (2023)
1867 (2019) 48. Wu, H., Chen, S., Chen, G., Wang, W., Lei, B., Wen, Z.: Fat-net:
31. Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., Feature adaptive transformers for automated skin lesion segmenta-
Tong, Y.: Semantic flow for fast and accurate scene parsing. In: tion. Med. Image Anal. 76, 102327 (2022)
European Conference on Computer Vision, pp. 775–793 (2020) 49. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial trans-
32. Mazzini, D.: Guided upsampling network for real-time semantic former networks. Adv. Neural Info. Processing Syst 28, (2015)
segmentation. arXiv preprint arXiv:1807.07466 (2018) 50. Song, J., Chen, X., Zhu, Q., Shi, F., Xiang, D., Chen, Z., Fan,
33. Lu, H., Dai, Y., Shen, C., Xu, S.: Indices matter: Learning to index Y., Pan, L., Zhu, W.: Global and local feature reconstruction for
for deep image matting. In: Proceedings of the IEEE/CVF Inter- medical image segmentation. IEEE Trans. Med. Imaging (2022)
national Conference on Computer Vision, pp. 3266–3275 (2019) 51. Zhang, Q.-L., Yang, Y.-B.: Sa-net: Shuffle attention for deep con-
34. Huang, Z., Wei, Y., Wang, X., Liu, W., Huang, T.S., Shi, H.: volutional neural networks. In: ICASSP 2021-2021 IEEE Inter-
Alignseg: Feature-aligned segmentation networks. IEEE Trans. national Conference on Acoustics, Speech and Signal Processing
Pattern Anal. Mach. Intell. 44(1), 550–557 (2021) (ICASSP), pp. 2235–2239 (2021)
35. Huang, S., Lu, Z., Cheng, R., He, C.: Fapn: Feature-aligned pyra- 52. Wu, Y., He, K.: Group normalization. In: Proceedings of the Euro-
mid network for dense image prediction. In: Proceedings of the pean Conference on Computer Vision (ECCV), pp. 3–19 (2018)
IEEE/CVF International Conference on Computer Vision, pp. 53. Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: Practical
864–873 (2021) guidelines for efficient cnn architecture design. In: Proceedings
36. Wu, J., Pan, Z., Lei, B., Hu, Y.: Fsanet: Feature-and-spatial- of the European Conference on Computer Vision (ECCV) (2018)
aligned network for tiny object detection in remote sensing 54. Zhao, S., Wang, Y., Yang, Z., Cai, D.: Region mutual information
images. IEEE Trans. Geosci. Remote Sens. 60, 1–17 (2022) loss for semantic segmentation. Adv. Neural Info. Processing Syst.
37. Hu, H., Chen, Y., Xu, J., Borse, S., Cai, H., Porikli, F., Wang, X.: 32, (2019)
Learning implicit feature alignment function for semantic segmen- 55. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep vari-
tation. In: European Conference on Computer Vision, pp. 487–505 ational information bottleneck. arXiv preprint arXiv:1612.00410
(2022) (2016)
38. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 56. Tian, X., Zhang, Z., Lin, S., Qu, Y., Xie, Y., Ma, L.: Farewell to
Proceedings of the IEEE Conference on Computer Vision and mutual information: Variational distillation for cross-modal per-
Pattern Recognition, pp. 7132–7141 (2018) son re-identification. In: Proceedings of the IEEE/CVF Confer-
39. Zhang, R., Li, G., Li, Z., Cui, S., Qian, D., Yu, Y.: Adaptive con- ence on Computer Vision and Pattern Recognition, pp. 1522–1531
text selection for polyp segmentation. In: International Conference (2021)
on Medical Image Computing and Computer-Assisted Interven- 57. Wei, J., Wang, S., Huang, Q.: F 3net: fusion, feedback and focus for
tion, pp. 253–262 (2020). https://doi.org/10.1007/978-3-030- salient object detection. In: Proceedings of the AAAI Conference
59725-2_25 on Artificial Intelligence, vol. 34, pp. 12321–12328 (2020)
40. Tomar, N.K., Jha, D., Riegler, M.A., Johansen, H.D., Johansen, 58. Yin, Z., Liang, K., Ma, Z., Guo, J.: Duplex contextual relation
D., Rittscher, J., Halvorsen, P., Ali, S.: Fanet: A feedback atten- network for polyp segmentation. In: 2022 IEEE 19th Interna-
tion network for improved biomedical image segmentation. IEEE tional Symposium on Biomedical Imaging (ISBI), pp. 1–5 (2022).
Trans. Neural Netw. Learn, Syst (2022) https://doi.org/10.1109/ISBI52829.2022.9761402
41. Shen, Y., Jia, X., Meng, M.Q.-H.: Hrenet: A hard region enhance- 59. Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L.:
ment network for polyp segmentation. In: International Confer- Pranet: Parallel reverse attention network for polyp segmentation.
ence on Medical Image Computing and Computer-Assisted Inter- In: International Conference on Medical Image Computing and
vention, pp. 559–568 (2021) Computer-assisted Intervention, pp. 263–273 (2020). https://doi.
42. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual org/10.1007/978-3-030-59725-2_26
attention network for scene segmentation. In: Proceedings of the 60. Patel, K., Bur, A.M., Wang, G.: Enhanced u-net: A feature
IEEE/CVF Conference on Computer Vision and Pattern Recogni- enhancement network for polyp segmentation. In: 2021 18th Con-
tion, pp. 3146–3154 (2019) ference on Robots and Vision (CRV), pp. 181–188 (2021). https://
43. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., doi.org/10.1109/CRV52889.2021.00032
Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong 61. Ta, N., Chen, H., Lyu, Y., Wu, T.: Ble-net: boundary learning and
encoders for medical image segmentation. arXiv preprint arXiv: enhancement network for polyp segmentation. Multimed. Syst.
2102.04306 (2021) 1–14 (2022)
44. Zhang, Y., Liu, H., Hu, Q.: Transfuse: Fusing transformers 62. Bernal, J., Sánchez, F.J., Fernández-Esparrach, G., Gil, D.,
and cnns for medical image segmentation. In: International Rodríguez, C., Vilariño, F.: Wm-dova maps for accurate polyp
13
LET‑Net: locally enhanced transformer network for medical image segmentation 3861
highlighting in colonoscopy: Validation vs. saliency maps from 67. Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of
physicians. Comput. Med. Imaging Graph. 43, 99–111 (2015). breast ultrasound images. Data in brief 28, 104863 (2020)
https://doi.org/10.1016/j.compmedimag.2015.02.007 68. Abraham, N., Khan, N.M.: A novel focal tversky loss function
63. Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., Lange, T.d., with improved attention u-net for lesion segmentation. In: 2019
Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp IEEE 16th International Symposium on Biomedical Imaging
dataset. In: International Conference on Multimedia Modeling, (ISBI 2019), pp. 683–687 (2019)
pp. 451–462 (2020) 69. Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y.,
64. Tajbakhsh, N., Gurudu, S.R., Liang, J.: Automated polyp detection Han, X., Chen, Y.-W., Wu, J.: Unet 3+: A full-scale connected
in colonoscopy videos using shape and context information. IEEE unet for medical image segmentation. In: ICASSP 2020-2020
Trans. Med. Imaging 35(2), 630–644 (2016). https://doi.org/10. IEEE International Conference on Acoustics, Speech and Signal
1109/TMI.2015.2487997 Processing (ICASSP), pp. 1055–1059 (2020)
65. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B.: Toward
embedded detection of polyps in WCE images for early diagnosis Publisher's Note Springer Nature remains neutral with regard to
of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 9(2), jurisdictional claims in published maps and institutional affiliations.
283–293 (2014). https://doi.org/10.1007/s11548-013-0926-3
66. Vázquez, D., Bernal, J., Sánchez, F.J., Fernández-Esparrach,
G., López, A.M., Romero, A., Drozdzal, M., Courville, A.C.: A
benchmark for endoluminal scene segmentation of colonoscopy
images. J. Healthc. Eng. (2017)
13