0% found this document useful (0 votes)
34 views

MISSFormer An Effective Transformer For 2D Medical Image Segmentation

This paper proposes MISSFormer, a hierarchical encoder-decoder network for medical image segmentation. MISSFormer introduces two key designs: 1) ReMix-FFN, which explores global dependencies and local context for better feature discrimination; 2) A ReMixed Transformer Context Bridge to extract correlations of global dependencies and local context in multi-scale features.

Uploaded by

njucsorange1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

MISSFormer An Effective Transformer For 2D Medical Image Segmentation

This paper proposes MISSFormer, a hierarchical encoder-decoder network for medical image segmentation. MISSFormer introduces two key designs: 1) ReMix-FFN, which explores global dependencies and local context for better feature discrimination; 2) A ReMixed Transformer Context Bridge to extract correlations of global dependencies and local context in multi-scale features.

Uploaded by

njucsorange1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in IEEE Transactions on Medical Imaging.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022 1

MISSFormer: An Effective Transformer for 2D


Medical Image Segmentation
Xiaohong Huang, Zhifang Deng, Dandan Li, Xueguang Yuan, and Ying Fu

Abstract— Transformer-based methods are recently pop- a great contribution to accurate medical image segmentation
ular in vision tasks because of their capability to model results. Since the fully convolutional networks (FCNs) [1]
global dependencies alone. However, it limits the perfor- opened a door for semantic segmentation, as one of its variants,
mance of networks due to the lack of modeling local
context and global-local correlations of multi-scale fea- the U-shaped networks [2], [3] got a promising performance
tures. In this paper, we present MISSFormer, a Medical in medical image segmentation by the improvement of skip
Image Segmentation tranSFormer. MISSFormer is a hier- connection. According to this elegant U-shaped architecture,
archical encoder-decoder network with two appealing de- the variants of U-Net [4]–[6] achieved good performance and
signs: 1) a feed-forward network in transformer block of impressive results. They suffer from a limitation in modeling
U-shaped encoder-decoder structure is redesigned, ReMix-
FFN, which explore global dependencies and local context the global dependencies because of the locality of convo-
for better feature discrimination by re-integrating the local lution operation [7], [8], which failed to achieve the goal
context and global dependencies; 2) a ReMixed Trans- of precise medical image analysis, although above methods
former Context Bridge is proposed to extract the correla- brought positive performance and prevalence. To overcome
tions of global dependencies and local context in multi- the limitation, a type of method uses dilated convolution [9],
scale features generated by our hierarchical transformer
encoder. The MISSFormer shows a solid capacity to cap- [10] and pyramid pooling [11] to enlarge the receptive field
ture more discriminative dependencies and context in as much as possible, and another type [12], [13] tries to
medical image segmentation. The experiments on multi- employ few self-attention layers in high-level semantic feature
organ, cardiac segmentation and retinal vessel segmenta- maps, due to the quadratic relationship between self-attention
tion tasks demonstrate the superiority, effectiveness and computational complexity and feature map size. However,
robustness of our MISSFormer. Specifically, the experimen-
tal results of MISSFormer trained from scratch even outper- these methods make it insufficient to capture the abundant
form state-of-the-art methods pre-trained on ImageNet, and global dependencies.
the core designs can be generalized to other visual seg- Recently, the success of transformers [14] that capture
mentation tasks. The code has been released on Github: global dependencies makes it possible to solve the above
https://github.com/ZhifangDeng/MISSFormer. problems. Especially, the research on visual transformer [15]–
Index Terms— Context bridge, ReMix-FFN, global depen- [21] are in full swing and have got a promising performance in
dencies, local context, segmentation. vision tasks, encouraged by the great success of transformer
in natural language processing (NLP). Corresponding to the
transformer in NLP, vision transformer (ViT) [16] fed the im-
I. I NTRODUCTION
age into a standard transformer with positional embeddings by

W ITH the improvement of medical treatment and peo-


ple’s health awareness, the requirements of precise
medical image analysis (such as preoperative evaluation and
dividing an image into non-overlapping patches and achieved
comparable performance with CNN-based methods. Pyramid
vision transformer (PVT) [17] and Swin transformer [15]
auxiliary diagnosis) have become more urgent. As a crucial proposed hierarchical transformer to explore the vision trans-
step of them, medical image segmentation arouses broad former with spatial reduction attention (SRA) and window-
concern. The purpose of medical image segmentation is to based attention, respectively. Besides, the attempts of SEg-
classify the pixels of lesions or organs in a given medical mentation TRansformer (SETR) [21] in semantic segmentation
image by algorithms. proved the potential of transformer in visual tasks once again.
There exists much work in the medical image segmentation Inspired by these work, the transformer has also been
field. As classical methods, the CNN-based methods made studied in the field of medical image segmentation. TransUNet
[22] and CoTr [8] complement global dependencies by adding
This work is supported by the Research Fund of Innovation and
Transformation of Haidian District (No. HDCXZHKC2021201). Authors transformer layers to CNN. Swin-Unet [7] is a pure trans-
Xiaohong Huang, Zhifang Deng and Dandan Li are with the School of former to capture the global dependencies by referring to Swin
Computer Science (National Pilot Software Engineering School), Beijing Transformer [15]. However, simply adding the transformer
University of Posts and Telecommunications, Beijing 100876, China.
Email:{huangxh, dengzfong, dandl}@bupt.edu.cn. Author Xueguang layer to the high-level semantic layer of CNN is difficult
Yuan is with the School of Electronic Engineering, Beijing Univer- to capture sufficient global dependencies in medical images.
sity of Posts and Telecommunications, Beijing 100876, China. Email: Secondly, window-based attention will also lead to the lack of
[email protected]. Author Ying Fu is with Department of Ultra-
sound, Peking University Third Hospital, Beijing 100191, China. Email: global dependencies due to its window mechanism. Thirdly,
[email protected]. (Corresponding Author: Dandan Li.) some transformer blocks make multi-layer perceptron (MLP)

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022

as their feed-forward neural network (FFN), which ignores the B. Vision Transformers
local context modeling. Recently, vision transformers have attracted extensive at-
In this paper, MISSFormer, an effective and powerful Med- tention of researchers. A. Dosovitskiy et.al [16] introduced
ical Image Segmentation tranSFormer, is proposed to integrate transformer [14] into visual tasks for the first time and
the global dependencies and local context to produce accurate achieved impressive performance because of the capacity for
medical image segmentation results. MISSFormer is based on global dependencies of the transformer. Vision tasks developed
the U-shaped architecture, whose redesigned transformer block to a new stage inspired by ViT. For example, H. Touvron
enhances the feature representations. MISSFormer captures et.al proposed Data-efficient image Transformers (DeiT) [29]
global dependencies by applying global attention in the feature to explore the efficient training strategies for ViT. To reduce
map, and integrating global information and local context the computational complexity, Wang et.al [17] and Liu et.al
through the redesigned ReMix-FFN layer. Finally, it models proposed an efficient and effective hierarchical vision trans-
the global and local information at full-scale feature maps former with SRA and window-based mechanism, respectively.
through the ReMixed Transformer Context Bridge, so as to Some methods [30]–[32] studied the position embedding and
generate pixel-wise segmentation predictions by decoder. The locality in transformer. For other specific tasks, SETR [21]
main contributions of this paper can be summarized as follows: was a semantic segmentation network based on the transformer
• We propose MISSFormer, a position-free and hierarchical and made ViT a backbone, Xie et.al [20] introduced a simple
U-shaped transformer for medical image segmentation. and efficient design for semantic segmentation powered by
• To explore global dependencies and local context for transformer, N. Carion et.al [33] proposed an end-to-end
better feature discrimination, we propose ReMix-FFN by object detection framework with transformer, Wang et.al [34]
re-integrating the local context and global dependencies. built a general U-shaped transformer based on window-based
• To improve the responsiveness of relevant features, we self-attention on non-overlapped image patches for image
devise a ReMixed Transformer Context Bridge by cap- restoration.
turing the global-local correlations of multi-scale features.
• To demonstrate the effectiveness and robustness of the C. Medical Image Segmentation Transformers.
proposed MISSFormer, we carry out experiments and ob-
Researchers borrowed the transformer for medical image
tained competitive results on medical image segmentation
segmentation inspired by the rapid development of vision
datasets.
transformers. Chen et.al proposed TransUNet [22], employed
The remainder of this paper is organized as follows. Sec- some transformer layers into the low-resolution encoder fea-
tion II gives a review of the medical image segmentation ture maps to capture the global dependencies. A. Hatamizadeh
methods and transformers. Section III introduces the proposed et.al [35] applied transformer to make a powerful encoder for
MISSFormer in detail. Section IV presents the performance 3d medical image segmentation with CNN-based decoder. Xie
of MISSFormer, including the experiments settings, ablation et.al [8] and Wang et.al [36] bridged the CNN-based encoder
studies and comparisons. Section V draws the conclusion of and decoder with the transformer to improve the segmenta-
this paper. tion performance in the low-resolution stage. Besides these
methods which are the combination of CNN and transformer,
Cao et.al [7] proposed Swin-Unet with pre-trained encoder
II. R ELATED WORK based on Swin transformer [15], to demonstrate the application
potential of pure transformer in medical image segmentation.
A. Medical Image Segmentation
Different from these methods, the proposed MISSFormer is
The U-shaped networks [2] played a cornerstone role in conducted to explore the global dependencies and local context
medical image segmentation tasks because of its superior for more discriminative features with pure transformer trained
performance and elegant structure. Due to the rapid develop- on the medical image datasets from scratch.
ment of computer vision tasks [23], [24], the medical image
segmentation drew lessons from their key insight. For example, III. M ETHOD
ResNet [23] architecture became a general encoder backbone This section describes the overall pipeline and the specific
for medical image segmentation networks, the dilated convolu- structure of MISSFormer first, and then we show the details
tion and pyramid pooling were utilized to enlarge the receptive of the improved transformer block with ReMix-FFN, which
field for lesion and organ segmentations [9], [10]. Besides, is the basic unit of MISSFormer. After that, we introduce the
various attention mechanisms were effective to promote seg- proposed ReMixed Transformer Context Bridge, which models
mentation performance, reverse attention [25] was applied the local and global correlations of hierarchical multi-scale
to accurate polyp segmentation [26], squeeze-and-excitation information.
attention [27] was integrated into the module to refine the
channel information to segment vessels in retina images [28],
and some work [12], [13] employed self-attention mechanism A. Overall Pipeline
to supplement the global dependencies for segmentation tasks, The proposed MISSFormer is shown in Fig. 1(a), an
whereas these methods still lack enough global dependencies encoder-decoder architecture with a remixed transformer con-
for better performance. text bridge module appended between encoder and decoder.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 3

H  W  class

Overlap Patch N C N C
Linear Projection
H W  C

Embedding
Patch Expanding
H W
 C
4 4 ReMix-FFN
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
 C
4 4
Overlap Patch Merging Patch Expanding
H W Layer Norm
  2C
8 8

ReMixed Transformer
Context Bridge
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
  2C
8 8

Overlap Patch Merging Patch Expanding
H W
  4C
16 16 Efficient Self-Attention
x4
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
  4C
16 16
Overlap Patch Merging Patch Expanding
Layer Norm
H W
  8C
32 32
H W
Transformer Block   8C
32 32
with ReMix-FFN
x2

(a) (b)

Fig. 1. The overall structure of the proposed MISSFormer. (a) The proposed MISSFormer framework. (b) The structure of Transformer Block with
ReMix-FFN.

Specifically, given an input image of size Hin ×Win ×Cin , the adjacent feature maps to twice the original resolution
MISSFormer first divides it into overlapping patches of size except that the last one is four times. Last, the pixel-wise
4*4 pixels to preserve its local continuity with convolutional segmentation prediction is output by a linear projection.
layers of 7*7 kernel size and 4 pixels stride. Then, the
overlapping patches are fed into the encoder to produce the
multi-scale features. Here, the encoder is hierarchical, and
each stage includes transformer blocks with ReMix-FFN and B. Transformer Block
patch merging layer. The transformer block learns the global
Global dependencies and local context are effective for
dependencies and local context with limited computational
accurate medical image segmentation. Transformer and con-
complexity. The patch merging layer is applied to generate
volution are good choices for global dependencies and locality
the downsampling features, the output size of i-th block is
Hin Win i−1 at present, respectively. At the same time, the computational
2i+1 × 2i+1 × 2 C1 (i > 0), C1 is the channel dimension
complexity of the original transformer block is quadratic with
of the first block.
the feature map resolution, making it unsuitable for high-
After that, MISSFormer makes the generated multi-scale resolution feature maps. Second, the transformer lacks the abil-
features pass through the ReMixed Transformer Context ity to extract the local context [30]–[32], although Uformer,
Bridge to capture the local and global correlations of different SegFormer and PVTv2 tried to overcome the limitation by
scale features. Given multi-scale features, they are flattened embedding a convolutional layer in feed-forward network
in spatial dimension and reshaped to make channel dimension directly, we argue that this approach limits the discrimination
equal to C1 , for example, the features with size 2Hi+1 in
× 2Wi+1
in
of features, although some improved performance is achieved
i−1 Hin Win
× 2 C1 will be rearranged into 2i+3 × C1 . And then by them.
concatenate them in flattened spatial dimension and feed into To solve the above problems, we proposed Transformer
the Remixed Transformer Context Bridge with d-depth. Last, Block with ReMix-FFN. As is shown in Fig. 1(b), the
we split and restore them to their original shape 2Hi+1
in
× 2Wi+1
in
× Transformer Block is composed of LayerNorm, Efficient Self-
i−1
2 C1 , and obtain the discriminative hierarchical multi-scale Attention and ReMix-FFN.
features. 1) Efficient Self-Attention: Efficient self-attention is a spatial
For the segmentation prediction, MISSFormer takes the reduction self-attention [17], which can be applied to high-
discriminative features and skip connections as inputs of resolution feature map. Given a feature map F ∈ RH×W ×C ,
decoders. Each decoder stage includes transformer blocks with and H, W, C is the height, width and channel depth, re-
ReMix-FFN and patch expanding layer [7]. Contrary to the spectively. For the original standard multi-head self-attention,
patch merging layer, the patch expanding layer upsamples it makes Q, K, V have the same shape N × C, where

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022


FC_out
FC_out Layer Norm
Img2Seq

GELU 
GELU
1x1 Conv
Layer Norm
Layer Norm

3x3 Depth-wise conv 


 ReMix Block
Layer Norm
1x1 Conv Depth-wise conv

Seq2Img FC_in FC_in Depth-wise conv

(a) (b) (c)

Fig. 2. The various exploration of locality in feed-forward neural network, from left to right: (a)Residual Block in LocalViT [32], (b)proposed ReMix-
FFN, (c)proposed ReMix-FFN with recursive step.

N = H × W , which can be formulated as:


QK T
Attention(Q, K, V ) = Sof tM ax( √ )V, (1) ReMix-FFN
dhead
and its computational complexity is O(N 2 ). While for the split
ReMix-FFN
Img2Seq
LN-ESA LN
efficient self-attention, it applied a spatial reduction ratio R to Seq2Img
ReMix-FFN
Concatenate

reduce the spatial resolution as follows:


ReMix-FFN

N
new K = Reshape( , C · R)W (C · R, C), (2)
R
it first reshapes K and V to N R × (C · R), and then a linear global local
projection W is used to make channel depth restore to C. After
that, the computational complexity of self-attention reduces to Fig. 3. The ReMixed Transformer Context Bridge.
2
O( NR ), and can be applied to high-resolution feature maps.
The spatial reduction operation is convolution or pooling in
common. where, xin is the output of efficient self-attention, Conv3×3
2) Mix-FFN: For consistency and integrity of the article, is convolution with kernel 3 × 3, we applied depth-wise
we will briefly review Mix-FFN. Xie et.al [20] embedded a convolution for efficiency in this paper. We will show that
convolution layer between the fully-connected layers in the our designs are essential for Mix-FFN in Section IV.B.
FFN to capture the local information as a supplement to global Inspired by [37], we extend our design to a general form
dependencies, and it can be denoted as: with the help of layer norm, which facilitates the optimization
xout = F C(GELU (Conv3×3 (F C(xin )))) + xin , (3) of skip connection [14]. As shown in Fig. 2(c), we make an
ReMix block embedded in the original feed-forward network.
although some improvements have been made, we find that the We introduced recursive skip connection in the ReMix block,
direct embedding of convolution layer can not assign weights given an input feature map xin , a depth-wise convolution layer
well and limit the discrimination of features sometimes ac- is applied to capture the local context, and then a recursive skip
cording to the attention heatmap, which is shown in Fig. 4, connection followed, and it can be defined as:
Fig. 6 and Table II.
3) ReMix-FFN: Different from previous methods, we re- yi = LN (F C(xin ) + yi−1 ),
designed the structure of Mix-FFN [20] to align features and (5)
xout = F C(GELU (yi )) + xin ,
make discriminative representations. As shown in Fig. 2(b),
we add a skip connection before the depth-wise convolution where y1 = LN (Conv3×3 (F C(xin )) + F C(xin )). After
for better feature fusion. And then, we applied layer norm that, the model makes more expressive power due to the
after the skip connection for feature distribution, which can construction of different feature distribution and consistency
be denoted as: by each recursive step.
y1 = LN(Conv3×3 (F C(xin )) + FC(xin )),
(4)
xout = F C(GELU (y1 )) + xin ,

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 5

C. ReMixed Transformer Context Bridge A. Experiments Settings


Multi-scale information fusion has been proved to be crucial 1) Datasets: We perform experiments on three different
for accurate semantic segmentation in the CNN-based method formats of datasets: Synapse multi-organ segmentation dataset
[13], [24]. In this part, we explore the multi-scale feature (Synapse), Automated cardiac diagnosis challenge dataset
fusion for the Transformer-based method with the aid of the (ACDC) and the Digital Retinal Images for Vessel Extraction
hierarchical structure of MISSFormer. The key motivation is (DRIVE) dataset. The Synapse dataset includes 30 abdominal
to integrate the local and global correlations between different CT scans with 3779 axial abdominal clinical CT images, and
scale feature maps by utilizing the detailed features and high- the dataset is divided into 18 scans for training and 12 for
level semantic features including global dependencies and testing randomly, following the [7], [22]. We evaluate our
local context. Given multi-scale features Fi , we first flatten method with the average Dice Similarity Coefficient (DSC)
and rearrange them to make them have the same channel and average Hausdorff Distance (HD) on 8 abdominal organs
dimension C1 which equals the greatest common divisor of (aorta, gallbladder, spleen, left kidney, right kidney, liver,
all-scale features channel dimensions, the size is Hin Win
2i+3 × pancreas, spleen, stomach). The ACDC dataset includes 100
C1 , then concatenate them in spatial dimension and feed the MRI scans collected from different patients, and each scan
concatenated features into the efficient self-attention module to labeled three organs, left ventricle (LV), right ventricle (RV)
capture global attention, which is more advantageous to focus and myocardium (MYO). Consistent with the previous method
the target area and suppress irrelevant information. Secondly, [7], [22], 70 cases are used for training, 10 for validation and
the concatenated attentive features are split into different scale 20 for testing, and the average DSC is applied to evaluate the
image feature maps and sent into the ReMix-FFN module, method. The DRIVE dataset is a retinal vessel segmentation
which is responsible for the remixing of global dependencies dataset, which contains 40 color fundus images. The whole
and different local context. After several cycles of the above dataset is divided into 20 images for training set and 20
two steps, the more accurate global and local correlations images for test set and the first manual annotations are chosen
of multi-scale features will be integrated. The process of as the ground truth to evaluate performance by officials.
ReMixed Transformer Context Bridge is shown in Fig. 3. The sensitivity (Sen), the accuracy (Acc) and the area under
And then we feed them into our transformer-based decoder receiver operation characteristic curve (AUC) is utilized to
with a corresponding skip connection to predict the pixel-wise evaluate the method performance in the DRIVE dataset.
segmentation map. 2) Implementation details: The MISSFormer is implemented
based on PyTorch and trained on Nvidia GeForce RTX 3090
GPU with 24 GB memory. Different from previous work [7],
D. Relationship to SegFormer
[22], whose model is initialized by the pre-trained model
MISSFormer is based on SegFormer, but it is more powerful on ImageNet, the MISSFormer is initialized randomly and
in medical image segmentations compared with SegFormer trained from scratch, so the moderate data augmentation is
[20]: conducted for all datasets, besides the commonly used random
• SegFormer proposes Mix-FFN which embeds the convo- flipping and rotation, gaussian noise, scaling, and contrast
lution between FC-layer directly and limits the network transformation are employed in data augmentation as well.
performance (it can be seen from the second of Section We set the input image size as 224×224 for Synapse and
IV.B), while the proposed ReMix-FFN can capture more ACDC, the initial learning rate is 0.05 and poly learning rate
discriminative feature (from Fig. 6) and boost perfor- policy is used, the max training epoch is 400 with batch
mance by remixing the global and local information size of 24. SGD optimizer with momentum 0.9 and weight
(from Table II), although they share same overlap patch decay 1e-4 is adopted for MISSFormer. And the details on
embedding. DRIVE dataset, MISSFormer keep consistent with work [9].
• MISSFormer uses U-shaped encoder-decoder architec- Besides, we performed 5-fold cross-validation on Synapse and
ture, while the decoder of SegFormer is all-mlp decoder ACDC datasets to evaluate the performance of our algorithm
which is worse than U-shaped architecture in medical to prevent overfitting.
image segmentation (from Table I).
• We propose the ReMixed Transformer Context Bridge
B. Ablation Studies
to extract the global-local correlations of multi-scale
features, while SegFormer captures the multi-scale infor- We conduct ablation studies on the Synapse and ACDC
mation by its all-mlp decoder. datasets to verify the effectiveness of the essential components
in our approach. We set the naive U-shaped transformer
with mlp-FFN as the baseline method, and the number of
IV. E XPERIMENTS transformer blocks in every stage of encoder and decoder is set
In this section, we first introduce the datasets and the to 2 to keep the same with other methods for a fair comparison,
implementation details briefly, then we conduct the exper- and the DSC is our first choice compared with HD. All
iments of ablation studies to validate the effectiveness of experiments are performed with the same super parameter
each component in MISSFormer, and last the comparison settings and trained from scratch.
results with previous state-of-the-art methods are reported to 1) Architecture selection: In order to verify the priority of
demonstrate the superiority of the proposed MISSFormer. the U-shaped structure, We compare the baseline and Seg-

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022

15

12
6

5
9

grad norm
grad norm

3 6

1 U-mlpFormer
U-SegFormer
U-mlpFormer
MISSFormer_S U-SegFormer
0 MISSFormer_S
0 0.5 1 1.5 2 2.5 3 3.5 4 0
0 0.5 1 1.5 2 2.5 3 3.5 4
step 104 step 104

(a) 1st-layer. (b) 3rd-layer.


60

60

50

50

40
40

grad norm
grad norm

30
30

20
20

10
10

U-mlpFormer
U-mlpFormer
U-SegFormer
U-SegFormer
MISSFormer_S
MISSFormer_S
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
step 104 step 104

(c) 6th-layer. (d) 7th-layer.

Fig. 4. The average L1 norm of gradients to the second fully connected weight in FFN for layer 1,3,6,7

TABLE I further experiments will be based on the U-shaped structure.


ACCURACY ON S YNAPSE AND ACDC DATASET OF S EG F ORMER AND
U-S EG F ORMER
TABLE II
Synapse ACDC
Architecture E FFECTIVENESS OF R E M IX -FFN, CAT DENOTES CONCATENATION
DSC↑ HD↓ DSC↑
OPERATION FOR SKIP CONNECTION , ADD DENOTES SUMMATION .
U-mlpFormer [16] 75.88 27.22 89.04
SegFormer B1 [20] 75.24 25.07 88.95 Synapse ACDC
Architecture skip LN
SegFormer B5 [20] 75.74 21.62 89.34 DSC↑ HD↓ DSC↑
U-SegFormer B1 [20] 76.10 26.97 89.21 U-SegFormer [20] – – 76.10 26.97 89.21
U-SegFormer w/skip [20] cat – 78.14 28.77 89.28
U-SegFormer w/skip [20] add – 78.74 20.20 90.09
MISSFormer S X X 79.73 20.14 90.61
Former B1 with its U-shaped variant, called “U-SegFormer”,
which makes the transformer block of SegFormer and patch 2) Ablation studies of ReMix-FFN: Based on the U-shaped
expanding as decoder to replace the original MLP decoder. transformer, we further perform the experiments to validate
Besides, the SegFormer B5 with stronger encoder also partic- the impact of the proposed ReMix-FFN components. We first
ipated in the comparison. The results are shown in Table I. As design different skip connections: concatenation and sum-
we can see, the U-SegFormer achieve better DSC performance mation, to explore the global and local feature fusion of
than SegFormer because the U-shaped model can fuse more embedding of convolution. Then, we explore the gap caused
corresponding details information with skip connection in each by the direct embedding of convolution. A layer norm is
stage, although the SegFormer integrates multi-level infor- applied to redistribute the feature. We integrate it after the
mation. SegFormer B5 has not achieved breakthrough results skip connection, which is called MISSFormer S. Table II
because of the limitation of a huge number of parameters reports the comparison results. Table II shows that both skip
and medical dataset size. The results of Table I demonstrate connections boost the model performance considerably, and
the priority of U-shaped network structure based on the the summation skip connection improved more than 2.6% in
transformer for medical image segmentation as before, and our Synapese dataset and 0.88% in ACDC dataset, the core reason

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 7

is that our design fuses the global feature caused by self- TABLE III
attention and local feature extracted by convolution, and makes C OMPARISON OF DIFFERENT LOCAL MODULES IN TRANSFORMER
BLOCK
features more discriminative. And our MISSFormer S has 1%
and 0.52% improvements based on U-SegFormer w/skip in Synapse ACDC
Architecture
DSC↑ HD↓ DSC↑
two datasets because of the feature redistribution. Finally, with
U-mlpFormer [16] 75.88 27.22 89.04
the help of the redesigned feed-forward network, we improved U-SegFormer [20] 76.10 26.97 89.21
feature distributions and enhanced feature representations to U-LocalViT [32] 76.92 23.62 86.34
generate an increasing promotion of 3.63 DSC in Synapse MISSFormer S 79.73 20.14 90.61
dataset and 1.4 DSC in ACDC dataset, compared with the
U-SegFormer baseline. In order to explore the reasons for the 3) Comparison of different local modules in Transformer: In
effectiveness of above improvements, we observe the tendency order to prove the necessity of supplementing local informa-
of gradients of the second fully connected weight in FFN tion and the effectiveness of the proposed method, we compare
for different models with 8 encoder layers in Synapse. Fig. it with other methods of supplementing local information.
4 shows the average L1 norm of gradients of different layers The experiment is carried out by replacing the FFN in the
of models. We observe that the gradients of the shallow layer, transformer block with different modules, such as Mix-FFN
middle layer and deep layer in U-SegFormer are all smaller in SegFormer [20], residual blocks in LocalViT [32] and
than the corresponding ones in U-mlpFormer, and even the the proposed ReMix-FFN in MISSFormer S. The results are
gradients of the middle layer in U-SegFormer become 1/3 shown in Table III. Our MISSFormer S takes a big advantage
compared with U-mlpFormer in Fig. 4(b), which indicates over other methods with different local modules because of
that direct embedding of 3×3 convolution between fully the more discriminative representation of ReMix-FFN.
connected layers makes the update of the middle layer slow
and may not optimize better weights, although it supplements TABLE IV
local information and makes slight improvements, while our I MPACT OF RECURSIVE SKIP CONNECTION IN T RANSFORMER B LOCK ,
method not only retain local information but also promote STEP MEANS RECURSIVE STEP.

network optimization to better weight. In comparison, the two Architecture step


Synapse ACDC
proposed components can solve this problem and make better DSC↑ HD↓ DSC↑
1 79.73 20.14 90.61
convergence and evaluation results, as shown in Fig. 5 MISSFormer R 2 79.91 21.33 90.40
3 80.74 19.65 90.49
MLP[16]
Mix-FFN[20]
0.6 ReMix-FFN
4) Impact of further feature redistribution in Transformer
0.5 Block: Inspired by the above exploration and [37], we ex-
0.4
tend the redesigned FFN of MISSFormer S to make it more
total loss

general. We call it MISSFormer R due to the recursive step


0.3
and the absence of multi-scale feature integration. We design
0.2
experiments to assess the influence of feature consistency
and redistribution caused by different recursive steps, and its
0.1
results are recorded in Table IV. The performance is improved
0 with the increase of the recursive step in Synapse, and the
0 0.5 1 1.5 2 2.5 3 3.5 4
step 10 4 performance in ACDC has some minor improvements from
(a) The loss convergence of different transformer FFN step 2, but all results of MISSFormer R are better than the
blocks. U-SegFormer, which further improved the insufficient feature
0.9 discrimination when convolution is embedded directly in FFN.
0.8

0.7 TABLE V
dice similarity coefficient

0.6
I MPACT OF R E M IXED T RANSFORMER C ONTEXT B RIDGE ON
RECURSIVE SKIP CONNECTION OF MISSF ORMER .
0.5
Synapse ACDC
0.4 Architecture step bridge 4
DSC↑ HD↓ DSC↑
0.3
1 – 79.73 20.14 90.61
0.2 2 – 79.91 21.33 90.40
MLP[16] MISSFormer R
Mix-FFN[20] 3 – 80.74 19.65 90.49
0.1 ReMix-FFN
1 X 81.96 18.20 90.98
0 2 X 80.91 19.48 90.86
0 50 100 150 200 250 300 350 400 MISSFormer
epoch
3 X 80.72 23.43 90.82

(b) The dice similarity coefficient of different trans-


former FFN blocks. 5) Influence of ReMixed Transformer Context Bridge: We
Fig. 5. The convergence and evaluation results of different transformer
conducted experiments to explore the role of multi-scale
blocks in Synapese, our ReMix-FFN have better convergence and information in transformer-based methods on account of the
performance. hierarchical features generated by the MISSFormer encoder.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022

TABLE VI TABLE IX
E XPLORATION OF THE BRIDGE DEPTH AND MULTI - SCALE INFORMATION T HE 5- FOLD CROSS - VALIDATION RESULTS ON S YNAPSE DATASET AND
IN MISSF ORMER . T HE 4/6 IN DEPTH DENOTES THE BRIDGE DEPTH IS ACDC DATASET. T HE FORMAT OF NUMERICAL RECORD IS MEAN±STD.
4 FOR S YNAPSE AND 6 FOR ACDC.
Synapse ACDC
method
Synapse ACDC DSC↑ HD↓ DSC↑
Architecture depth stage
DSC↑ HD↓ DSC↑ TransUNet [22] 75.92±4.48 29.12±8.77 90.31±0.64
2 4/3/2/1 80.19 18.88 90.97 Swin-Unet [7] 74.87±7.58 22.25±7.63 90.46±0.58
4 4/3/2/1 81.96 18.20 90.98 MISSFormer 79.05±4.28 24.24±7.62 91.00±0.62
6 4/3/2/1 81.03 21.36 91.19 P-values <0.02 >0.05 <0.01
MISSFormer
4/6 4/3/2 80.65 18.39 90.69
4/6 4/3 79.86 20.33 90.50
4/6 4 79.56 20.95 90.88

TABLE VII
C OMPARISON OF DIFFERENT MODULES IN T RANSFORMER C ONTEXT
B RIDGE .
Synapse ACDC
Architecture Context Bridge
DSC↑ HD↓ DSC↑
– 79.73 20.14 90.61
mlp-FFN [16] 79.54 17.26 90.72
MISSFormer
Mix-FFN [20] 80.18 20.17 90.65
ReMix-FFN 81.96 18.20 91.19

(a) GroundTruth (b) u-SegFormer (c) MISSFormer_S


We embed the ReMixed Transformer Context Bridge into
MISSFormer S and MISSFormer R to integrate the multi- Fig. 6. The visualization of attention heat maps in U-SegFormer and
MISSFormer S on Synapse dataset.
scale information, we call it as MISSFormer. As Table V
shows, we list the results of MISSFormer S for intuitionistic
comparison, and the performance of model has been improved 7) The qualitative analysis of proposed method: In order
to varying degrees except for step equals 3 in Synapse. We to verify the effectiveness of our proposed method more
observe that the model achieved the best performance to have intuitively, we perform qualitative analysis on it and visualize
a 2.26% DSC and 0.37% DSC improvement in two datasets the attention heat maps of U-SegFormer and MISSFormer S
when the step is 1 and the growth rate gradually decreases in cases chosen from Synapse dataset. As is shown in Fig.
with the increase of recursive step, even negative. We guess 6, there are some obvious defects in the focus region of
there is a balance between the recursive step and ReMixed U-SegFormer, the weights assigned to pancreas and spleen
Transformer Context Bridge or between the number of layer are small and greater in some irrelevant region. While the
norm and model capacity, which will be discussed in our future proposed MISSFormer S is more correct and discriminative
work. Besides, we also investigated how the bridge depth and and locates more accurately with the help of the ReMix-FFN.
multi-scale information integration affect model performance,
and the results are saved in Table VI. For the exploration
C. Comparison with state-of-the-art methods
of bridge depth, 4 is a suitable depth in MISSFormer for
Synapse and 6 is for ACDC because of the limited medical This section reports the comparison results of MISSFormer
data. For transformer-based hierarchical features, the more and previous state-of-the-art methods on the Synapse dataset,
scale features are fed into the enhanced transformer context ACDC dataset and DRIVE dataset.
bridge, the more comprehensive the model can be learned for 1) Experiment results on Synapse dataset: Table VIII
global-local correlations. presents the comparison results of proposed MISSFormer and
previous state-of-the-art methods. As shown in Table VIII,
6) The necessity of global-local information in Transformer
the proposed method achieved state-of-the-art performance
Context Bridge: To further explore the effectiveness of the
in almost all measures, and it is worth mentioning that the
proposed module and the impact of each feature component
encoder of Transunet1 and Swin-Unet2 is pre-trained on Ima-
in multi-scale information aggregation, we take MISSFormer
geNet, while the MISSFormer trained on Synapse dataset from
with the depth of 4 and 6 in Synapse and ACDC, respec-
scratch, which indicates that MISSFormer capture the better
tively, as a basis and replace the FFN module in Transformer
global dependencies and local context to make strong feature
Context Bridge with mlp-FFN, Mix-FFN and ReMix-FFN,
representations. We reimplement the above SOTA methods
respectively. The results are recorded in Table VII. We ob-
according to the publicly released codes, and these results are
serve that mlp-FFN Context Bridge has more accurate edge
recorded in Table VIII. To avoid model overfitting, we perform
predictions, Mix-FFN improves the segmentation results due
5-fold cross-validation on Synapse dataset, and the results
to the supplement of local information, while our ReMix-
are recorded in Table IX, and we conduct additional paired
FFN gets the best segmentation performance and competitive
edge prediction because of the discriminative global and local 1 https://github.com/Beckschen/TransUNet

features. 2 https://github.com/HuCaoFighting/Swin-Unet

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 9

TABLE VIII
C OMPARISON WITH OTHER STATE - OF - THE - ART METHODS ON S YNAPSE DATASET.
Methods DSC↑ HD ↓ Aorta Gallbladdr Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
U-Net [2] 76.26 36.62 86.95 64.76 80.98 75.06 93.09 50.24 86.05 72.93
Att-UNet [40] 78.07 32.10 88.41 69.45 83.14 74.55 93.46 54.80 86.61 74.16
R50 U-Net [2] 77.24 31.01 87.30 63.03 80.55 77.59 93.26 53.38 86.34 74.47
R50 Att-Unet [40] 79.25 34.74 87.37 58.84 82.99 78.52 93.98 64.29 87.71 80.32
TransUNet [22] 77.24 31.18 87.32 61.09 80.77 76.15 94.54 55.97 86.15 75.88
Swin-Unet [7] 78.71 23.85 85.29 64.49 83.28 80.62 93.87 56.78 90.42 74.93
MISSFormer 81.96 18.20 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81

(a) GroundTruth (b) MISSFormer (c) MISSFormer_S (d) Swin-Unet (e) TransUNet

Fig. 7. The visual comparison with previous state-of-the-art methods on Synapse dataset. The pictures in last row is a failed case. Our MISSFormer
shows a better performance than other methods.

t-test statistical analysis with the proposed method and the Swin-Unet, even in the bad case. Comparing MISSFormer
best method in each index based on the 5-fold cross-validation and MISSFormer S, MISSFormer has precise results and less
results, the P-values prove the universality of the superiority false segmentation because of the integration of multi-scale
of the proposed method in these datasets. The MISSFormer is information.
significantly ahead of the other two methods, which verifies
the superiority and effectiveness of the proposed method again. At the same time, we have noticed that our method has not
The visualization results are shown in Fig. 7. It can be achieved the best results in the segmentation of the relatively
seen that our MISSFormer achieves better edge predictions small organs Aorta (deep blue in Fig.7) and Galbladdr (green
and hard example segmentations compared to Tranunet and in Fig.7). This may be because our method gives a large
weight to significant targets, which can get more attention

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022

(a) GroundTruth (b) MISSFormer (c) Swin-Unet (d) TransUNet

Fig. 8. The visual comparison with previous state-of-the-art methods on ACDC dataset.

Images GroundTruth CS-Net MISSFormer


Fig. 9. The visual comparison with previous state-of-the-art methods on DRIVE dataset.

because they account for a larger proportion in multi-scale of MISSFormer. The visualization results are shown in Fig. 8
features, which viewed from Fig. 6. Although we supplement
the local context in MISSFormer, the Aorta and Galbladdr TABLE X
results imply that the local context may still be insufficient for C OMPARISON WITH STATE - OF - THE - ART METHODS ON ACDC DATASET.
the small target in our MISSFormer. We will study the further Methods DSC↑ RV Myo LV
supplement of local information in future work to improve this R50 U-Net [2] 90.16 88.67 86.88 94.92
situation. R50 Att-UNet [40] 91.03 89.90 87.98 95.21
TransUNet [22] 90.44 88.82 87.54 94.95
Swin-Unet [7] 90.41 88.41 87.71 95.13
2) Experiment results on ACDC dataset: We evaluate our MISSFormer 91.19 89.85 88.38 95.34
method on the ACDC dataset in the format of MRI. Table
IX and X present the 5-fold cross-validation results and seg- 3) Experiment results on DRIVE dataset: In addition to the
mentation accuracy, respectively. MISSFormer maintains the above two datasets, we also verified the effectiveness of the
first position because of the powerful feature discrimination, proposed method on the retinal vessel segmentation dataset
which indicates the outstanding generalization and robustness DRIVE. As is shown in Table XI, the proposed method MISS-

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE XI [10] S. Feng, H. Zhao, F. Shi, X. Cheng, M. Wang, Y. Ma, D. Xiang, W. Zhu,
C OMPARISON WITH STATE - OF - THE - ART METHODS ON DRIVE DATASET. and X. Chen, “Cpfnet: Context pyramid fusion network for medical
image segmentation,” IEEE transactions on medical imaging, vol. 39,
Methods Acc↑ Sen AUC no. 10, pp. 3008–3018, 2020.
BCOSFIRE [38] 94.42 76.55 96.14 [11] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
WSF [39] 95.80 77.40 97.50 network,” in Proceedings of the IEEE conference on computer vision
DeepVessel [40] 95.23 76.03 97.89 and pattern recognition, 2017, pp. 2881–2890.
U-Net [2] 95.31 75.37 96.01 [12] L. Mou, Y. Zhao, L. Chen, J. Cheng, Z. Gu, H. Hao, H. Qi, Y. Zheng,
R2U-Net [41] 95.56 77.92 97.84 A. Frangi, and J. Liu, “Cs-net: channel and spatial attention network for
CE-Net [9] 95.45 83.09 97.79 curvilinear structure segmentation,” in International Conference on Med-
CS-Net [12] 96.32 81.70 97.98 ical Image Computing and Computer-Assisted Intervention. Springer,
MISSFormer 96.03 84.69 98.44 2019, pp. 721–730.
[13] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical im-
age segmentation,” IEEE journal of biomedical and health informatics,
vol. 25, no. 1, pp. 121–130, 2020.
Former achieved very competitive performance compared with [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
other methods. As is shown in Fig. 9, the proposed method Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
can pay attention to more details of the retinal vessel images, in neural information processing systems, 2017, pp. 5998–6008.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
which benefits from the global-local feature extraction of B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
MISSFormer. windows,” arXiv preprint arXiv:2103.14030, 2021.
[16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
V. C ONCLUSION “An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
In this paper, we presented MISSFormer, a U-shaped medi- [17] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and
cal image segmentation transformer, which explored the global L. Shao, “Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions,” arXiv preprint arXiv:2102.12122,
dependencies and local context capture. The proposed ReMix- 2021.
FFN can enhance the global dependencies and supplement the [18] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou,
local context to make discriminative feature representations. and M. Douze, “Levit: a vision transformer in convnet’s clothing for
faster inference,” arXiv preprint arXiv:2104.01136, 2021.
Based on these core designs, we further investigated the [19] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and
integration of multi-scale features generated by our hierar- C. Shen, “Twins: Revisiting the design of spatial attention in vision
chical transformer encoder with the ReMixed Transformer transformers,” arXiv preprint arXiv:2104.13840, vol. 1, no. 2, p. 3, 2021.
Context Bridge, which is essential for accurate segmentation. [20] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
“Segformer: Simple and efficient design for semantic segmentation with
We evaluated our method on three datasets with different transformers,” arXiv preprint arXiv:2105.15203, 2021.
formats, the superior results demonstrate the effectiveness and [21] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,
robustness of MISSFormer. T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 6881–6890.
R EFERENCES [22] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and
Y. Zhou, “Transunet: Transformers make strong encoders for medical
[1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440. [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
[2] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
and pattern recognition, 2016, pp. 770–778.
for biomedical image segmentation,” in International Conference on
Medical Image Computing and Computer-Assisted Intervention, 2015. [24] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[3] z. iek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d “Deeplab: Semantic image segmentation with deep convolutional nets,
u-net: Learning dense volumetric segmentation from sparse annotation,” atrous convolution, and fully connected crfs,” IEEE transactions on
in Springer, Cham, 2016. pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
[4] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, 2017.
“nnu-net: a self-configuring method for deep learning-based biomedical [25] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob-
image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021. ject detection,” in Proceedings of the European Conference on Computer
[5] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Vision (ECCV), 2018, pp. 234–250.
A nested u-net architecture for medical image segmentation,” in Deep [26] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao,
learning in medical image analysis and multimodal learning for clinical “Pranet: Parallel reverse attention network for polyp segmentation,” in
decision support. Springer, 2018, pp. 3–11. International Conference on Medical Image Computing and Computer-
[6] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Assisted Intervention. Springer, 2020, pp. 263–273.
Y.-W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for [27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
medical image segmentation,” in ICASSP 2020-2020 IEEE International Proceedings of the IEEE conference on computer vision and pattern
Conference on Acoustics, Speech and Signal Processing (ICASSP). recognition, 2018, pp. 7132–7141.
IEEE, 2020, pp. 1055–1059. [28] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao, “Et-net: A
[7] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, generic edge-attention guidance network for medical image segmenta-
“Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” in International Conference on Medical Image Computing and
tion,” arXiv preprint arXiv:2105.05537, 2021. Computer-Assisted Intervention. Springer, 2019, pp. 442–450.
[8] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn [29] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
and transformer for 3d medical image segmentation,” arXiv preprint H. Jégou, “Training data-efficient image transformers & distillation
arXiv:2103.03024, 2021. through attention,” in International Conference on Machine Learning.
[9] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, PMLR, 2021, pp. 10 347–10 357.
and J. Liu, “Ce-net: Context encoder network for 2d medical image [30] M. A. Islam, S. Jia, and N. D. Bruce, “How much position in-
segmentation,” IEEE transactions on medical imaging, vol. 38, no. 10, formation do convolutional neural networks encode?” arXiv preprint
pp. 2281–2292, 2019. arXiv:2001.08248, 2020.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
12 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022

[31] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen, “Con-
ditional positional encodings for vision transformers,” arXiv preprint
arXiv:2102.10882, 2021.
[32] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, “Localvit: Bring-
ing locality to vision transformers,” arXiv preprint arXiv:2104.05707,
2021.
[33] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in
European Conference on Computer Vision. Springer, 2020, pp. 213–
229.
[34] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general u-shaped
transformer for image restoration,” arXiv preprint arXiv:2106.03106,
2021.
[35] A. Hatamizadeh, D. Yang, H. Roth, and D. Xu, “Unetr: Transformers
for 3d medical image segmentation,” arXiv preprint arXiv:2103.10504,
2021.
[36] W. Wang, C. Chen, M. Ding, J. Li, H. Yu, and S. Zha, “Transbts:
Multimodal brain tumor segmentation using transformer,” arXiv preprint
arXiv:2103.04430, 2021.
[37] F. Liu, X. Ren, Z. Zhang, X. Sun, and Y. Zou, “Rethinking skip connec-
tion with layer normalization,” in Proceedings of the 28th International
Conference on Computational Linguistics, 2020, pp. 3586–3598.
[38] G. Azzopardi, N. Strisciuglio, M. Vento, and N. Petkov, “Trainable
cosfire filters for vessel delineation with application to retinal images,”
Medical image analysis, vol. 19, no. 1, pp. 46–57, 2015.
[39] Y. Zhao, Y. Zheng, Y. Liu, Y. Zhao, L. Luo, S. Yang, T. Na, Y. Wang,
and J. Liu, “Automatic 2-d/3-d vessel enhancement in multiple modality
images using a weighted symmetry filter,” IEEE transactions on medical
imaging, vol. 37, no. 2, pp. 438–450, 2017.
[40] H. Fu, Y. Xu, S. Lin, D. W. Kee Wong, and J. Liu, “Deepvessel: Retinal
vessel segmentation via deep learning and conditional random field,” in
International conference on medical image computing and computer-
assisted intervention. Springer, 2016, pp. 132–139.
[41] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari,
“Recurrent residual convolutional neural network based on u-net (r2u-
net) for medical image segmentation,” arXiv preprint arXiv:1802.06955,
2018.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.

You might also like