0% found this document useful (0 votes)
5 views

MPFNet_Multiscale_Prediction_Network_with_Cross_Fu

The document presents MPFNet, a multiscale prediction network with cross fusion designed for real-time semantic segmentation, particularly in autonomous driving applications. It addresses challenges in segmenting objects of varying sizes by utilizing a feature pyramid network and a selective attention mechanism to optimize feature combination and enhance computational efficiency. The proposed method achieves significant improvements in segmentation performance and processing speed, outperforming baseline methods on benchmark datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

MPFNet_Multiscale_Prediction_Network_with_Cross_Fu

The document presents MPFNet, a multiscale prediction network with cross fusion designed for real-time semantic segmentation, particularly in autonomous driving applications. It addresses challenges in segmenting objects of varying sizes by utilizing a feature pyramid network and a selective attention mechanism to optimize feature combination and enhance computational efficiency. The proposed method achieves significant improvements in segmentation performance and processing speed, outperforming baseline methods on benchmark datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

MPFNet: Multiscale Prediction Network


with Cross Fusion for Real-Time
Semantic Segmentation
VAN TOAN QUYEN1 , MIN YOUNG KIM1, 2 (Member, IEEE)
1
School of Electronic and Electrical Engineering, IT College, Kyungpook National University, Daegu 41566, Republic of Korea
2
Research Center for Neurosurgical Robotic System, IT College, Kyungpook National University, Daegu 41566, Republic of Korea
Corresponding author: Min Young Kim (e-mail: [email protected]).
This research was supported by Basic Science Research Program through the National Research Institute Foundation of Korea (NRF)
funded by the Ministry of Education(2021R1A6A1A03043144) and the National Research Foundation of Korea(NRF) grant funded by the
Korea government(MSIT) (No. 2022R1A2C2008133). And this work was supported by Korea Institute for Advancement of Technology
(KIAT) grant (P0020536, The Competency Development Program for Industry Specialist) funded by the Korea Government (MOTIE).

ABSTRACT Semantic segmentation currently plays an important role in computer vision and is widely
applied in both industry and human life. The self-driving car is one of the most trending applications, which
assists humans in making informed decisions. The self-driving application has to interpret visual information
from street scenes. However, how to effectively segment a long range of objective sizes is still a challenging
problem. A feature pyramid network (FPN) builds up an architecture by processing four different features to
contribute contextual and spatial information to the final map. Each feature can suitably process a specific
range of objective sizes. Nevertheless, the final feature combination is not optimal when they raise the
computation cost and reduce the semantic weights. We propose a multi-scale prediction network with cross-
fusion in order to address the aforementioned drawbacks. The prediction module consists of three different
predictions that allow the architecture to efficiently extract information of various sizes. Each prediction
is generated from a pair of feature pyramids used to predict object classes. Furthermore, the cross-scale
fusion is designed to enhance the weight aggregation of the final score map. The core component of the
cross-fusion is the selective attention mechanism that determines uncertain weights of the lower prediction
and then selects the complement from the adjacent prediction. By implementing this proposed scheme, we
have achieved good results 78.3% mIoU and 45 FPS on Cityscapes and 45.9% mIoU on Mapillary Vistas
datasets. Our method outperforms the baseline method with 7.0 mIoU improvement and a 27 FPS speedup
on Cityscapes dataset. The experiment results demonstrate that the proposed model achieves a reasonable
balance between performance and efficiency.

INDEX TERMS Real-time semantic segmentation, attention mechanism, multi-scale prediction, context
fusion, feature pyramid network.

I. INTRODUCTION extraordinary promotion with Convolution Neural Network


EMANTIC segmentation is a promising method for the
S computer vision field and also a challenging task. The
goal of this method is to deeply understand the images at a
(CNN) [4]. CNN method used a series of convolutions to
extract features from the input image so that it can obtain
rich information. The final map of the CNN method is
pixel level and assign each pixel to a corresponding class. one dimension (1D) which is useful for classification but is
Alongside the development of digital technologies, semantic not suitable for segmentation application. Fully Convolution
segmentation is widely applied to real-life applications such Network (FCN) [5] improved CNN method by adjusting
as self-driving cars [1], robot vision [2], or medical image the fully connected layer to satisfy semantic segmentation.
processing [3]. In the field of autonomous driving, semantic The deep convolution network is effective to extract rich
segmentation method needs to meet two important require- context information from inputs [6]. When the feature is
ments which are real-time processing and high accuracy. processed deeper, it leads to the loss of more spatial informa-
In recent years, image processing tasks have achieved

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

tion. Some approaches addressed this limitation by utilizing novel cross-scale fusion with selective-attention mechanism
transposed convolution [7] or deep residual learning [8]. is employed to combine weights from all predictions.
As the aforementioned requirements for autonomous driv- The contribution of this paper can be summarized:
ing, the previous methods still consume too much time for • We designed multi-scale prediction (MSP) to improve
computation. ENet [9] method adapts ResNet and reduces the performance and efficiency of semantic segmen-
the size of shallower features to quickly extract information. tation. Prominent characteristics of each layer will be
They can process 18 times faster than previous methods effectively processed by incorporating contextual and
while achieving similar accuracy. ERFNet [10] continuously spatial paths. The module can handle a wide range of
improved ENet architecture by using the residual connection object sizes. Additionally, the proposed MSP can also
and convolutions with 1D kernels to gain better performance. reduce the burden on the computational system.
Autonomous driving application is a tough task because • The cross-scale fusion (CSF) module is proposed to en-
they have to deal with various sizes of objects, which can be hance the fusion of weights from predictions at different
thin like poles and traffic lights, or huge such as trucks and scales. Rather than relying on traditional methods such
bridges. Fields of view (FOV) directly affect the extracted as concatenation or average pooling, the CSF module
information. In the case, FOV is large enough to capture selectively selects the highlighted weights across all
huge objects, but it can contain many different small objects predictions based on the attention mask of the selective
as inputs to generate one-pixel value. Oppositely, FOV is attention mechanism (SAM).
small and suitable for gathering narrow structures, so it can • Selective-attention mechanism (SAM) is the core com-
lose global information of large objects. Some approaches ponent of CSF module. The SAM is used to determine
focused on converting the traditional convolution to dilated the best and worst areas of the lower-scale prediction
convolution which can obtain different FOVs by changing the and then generates the attention mask. This mask re-
rates [11]–[13]. We also can have multiple FOVs by adjusting quires the complementary information from the higher-
input-image sizes [14], [15]. By deploying the multi-scale scale prediction. Therefore, the SAM module can refine
inference, objects are extracted by suitable FOVs from begin- the contribution of the higher-scale prediction.
ning to end, but the network needs to process multiple times • MPFNet has achieved outstanding results 78.3% mIoU
for a single image. Feature pyramid network (FPN) is applied at 45 FPS on Cityscapes and 45.9% mIoU on Mapillary
not only for object detection but also effective for multi-class Vistas. Especially, Our method dominates the baseline
semantic segmentation [16]. This method can obtain spatial method in both terms of segmentation performance and
and semantic information from different feature layers and inference speed with 7.0% mIuU and 27 FPS improve-
has various receptive fields to efficiently extract a wide range ment on Cityscapes dataset.
of objective sizes. However, the decoder part of this method
Our paper structure is organized as follows. Related works
is not optimal, and there are some reasons for limitations:
addressing the same problems for semantic segmentation
• Reduce semantic weights: The encoder of the FPN are mentioned in Section II, we discussed our architecture
network effectively extracts different levels of semantic and proposed components in Section III, Experiment results
information. However, the decoder component concate- are analyzed in Section IV, and conclusions are shown in
nates all features together to obtain contextual and spa- Section V.
tial information for the final feature, which causes the
semantic-weights reduction. The rich semantic-weights II. RELATED WORKS
layer is averaged by the poorer one from the other A. MULTI-SCALE INFERENCE METHODS
layers.
The receptive field is the main factor to extract information
• Burden the computational system: this method burdens
from input and it directly affects the quality of the out-
the system when the decoder uses 3x3 convolution to
put. One single receptive field can lead to some limitations
continuously extract semantic-weights information and
in capturing a wide range of objective sizes. In order to
generate the third column of the feature pyramid. Addi-
have multiple FOVs and still remain a lightweight network,
tionally, the final feature map having 512 channel with
MSCFNet method [15] resizes each image to four different
a high resolution also burdens the computation system.
sizes as inputs. Each image size is inserted into the pipeline at
The decoder scheme is illustrated in Figure 3a.
different stages to capture multi-scale semantic information.
A novel multi-scale prediction network with cross fusion, The method can achieve good performance while having a
called MPFNet, is proposed in this paper in order to over- small number of parameters. MSMA approach [17] deploys
come the aforementioned limitations. We adapt the backbone an asymptotic neural architecture network to encode the
from the FPN method [16] and improve the feature combi- image input and two different sizes to contribute spatial
nation. We design multi-scale predictions to properly process features to the final map. By this way, the method can ob-
the characteristics of each feature. One prediction is supplied tain rich information. Previous approaches have reduced the
by two feature layers which are one containing rich coarse workload for systems but performance needs to be improved.
information and another containing rich fine information. A Method [18] designs a heavy network to extract information.
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

FPN5 C Prediction 1 U
256
2048 2xUp SAM
D
1x1ConV A
+
S Out
FPN4 +
256 P
C Prediction 2
* U

1024
SAM
P D

FPN3
256 C Prediction 3
* U

512
D

FPN2
256
256

FPN backbone Multi-scale predictions Cross-scale Fusion

No. cls
W/16
Image W/32 W/32

Pooling
U

H/32
H/32 H/32 H/16
3x3 Conv
W/32
rate 18
Atrous Conv Prediction 1 Attn 1 - Attn
3x3 Conv
rate 12 C

*
No. cls Prediction 2

3x3 Conv
rate 6

H/16

1x1 Conv
W/16

Prediction 2
Atrous spatial pyramid pooling (ASPP) Selective-attention mechanism (SAM)

+
*
Addition C Concatenation Multiplication U Upsample D Downsample

FIGURE 1: Proposed architecture of the multi-scale prediction network with cross fusion.

They use two sizes of images to train the network and then approach [23] proposed a novel multi-scale decoder to obtain
can test the model with three sizes of images. They achieve highlighted weights from different resolutions of hierarchical
high accuracy and have the flexibility of inference scales. features. On the other hand, multi-scale context intertwin-
ing [24] processes each pair of feature maps to determine
B. MULTI-LEVEL FEATURE FUSION METHODS together. The features can share information between them
Semantic segmentation is the task requiring both spatial and then enhance highlighted weights.
and semantic information to generate outputs. FCN [5] and
other deep learning methods [19], [20] are effective to C. ATTENTION MECHANISM
extract rich contextual information for the final map, but We realize that multi-level feature methods can obtain rich
these approaches dramatically lose information of objective spatial and contextual information from different layers of the
boundaries at deeper layers. In order to solve this hindrance, backbone. However, the feature aggregation using concate-
the final map should be contributed from different stages nation operation or averaging pooling affects the prominent
of features. MLFNet [21] designs two branches to extract weights of each feature. Attention mechanism is proposed to
information. A context branch is fed into ResNet 18 to extract address this drawback. The algorithm uses a biased weight
semantic information, and a spatial is simply processed by a distribution to control the contribution of each feature map.
series of 3x3 convolution, max pooling, and average pooling MSFFM [25] method calculated the gap between two fea-
to quickly extract and maintain rich spatial information. tures in spatial dimension to improve the weight at the
SGCPNet [22] utilized shallow features to guide the context boundaries. SABNet [26] deployed an attention framework
propagation, and then the final information is reconstructed to decrease the semantic gap and refine information from
by a scalar-weighted fusion module. FPN method [16] uses high to low scales. Compared to previous approaches that
all features of the backbone to generate a feature pyramid only used a single-kernel size to generate an attention mask,
and then processes to contribute to the final score. MSDNet MFENet approach [27] used two attention vectors from
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

paper, we utilize a FPN backbone for semantic segmentation


which is proposed by Seferbekov [16].
The FPN structure uses all feature layers to contribute
to the final map shown in Figure 1. Four stages of the
top-down pathway not only have different dimensions but
also contain different characteristics. Firstly, we indicate the
spatial information of the backbone features. The layers have
the same 256 channels, but they obviously differ in width
and length. As the backbone, the resolutions of four stages
are 14 , 18 , 16
1 1
, and 32 with respect to input size. Secondly, we
elaborate feature maps at different levels depicted in Figure 2.
The FPN 2 contains more coarse information than the others.
The heatmap indicates that pixel values at the objective
boundaries are represented by brighter colors. Oppositely,
the FPN 5 consists of rich semantic information. This stage
has a much higher level of detail, but the boundary informa-
tion is lost. The FPN 3 and FPN 4 are in the middle stages
of the backbone, so they contain both contextual and spatial
information at medium values.
The baseline method [16] concatenates information from
all stages and then upsamples to recover the feature resolu-
tion. By directly gathering information from four stages, the
baseline method causes some drawbacks where the rich se-
FIGURE 2: Feature maps at different levels of the FPN mantic characteristics would be reduced by the contribution
backbone on the Cityscapes. In the heatmaps, the bright color of a poorer one.
represents highlighted weights of features.
B. MULTI-SCALE PREDICTION (MSP)
Semantic segmentation is a complex task and requires accu-
different-size kernels to refine and fuse feature information. racy at the pixel-wise level. In order to meet this requirement,
The kernel sizes of 3x3 convolution with different dilated a final map should include both coarse and fine information.
rates can determine the most suitable FOV for objects. Lastly, The traditional FPN method is a promising structure when
an attention pyramid [28] is used to control the contribution extracting information from different stages. In Figure 3a,
of features. Each pixel is processed by three different weight four layers are processed by 3 ×3 convolution filters to
maps to enhance the prediction. generate the third feature pyramid. This step provides a
deeper understanding of the input content, but it critically
III. OUR METHOD causes a burden for the computational system. Inspired by
The proposed architecture of the multi-scale prediction net- the FPN structure, we devise multi-scale predictions to im-
work with cross fusion is illustrated in Figure 1. We devise prove performance and reduce time consumption for the
a novel scheme, called MPFNet, that deploys three different system shown in Figure 3b. As in the above elaboration,
predictions to contribute to the final semantic weights. The the FPN 2 contains the coarsest information, so we assume
overall architecture of MPFNet includes four main parts: it as a coarse branch for all predictions. The other layer
Feature pyramid Network [16], MSP, CSF, and SAM . We will separately combine with the coarse branch to completely
elaborate on the details of these components respectively. generate one prediction shown in Figure 1.
In the proposed architecture, we design three predictions.
A. FEATURE CHARACTERISTICS The process for each prediction is similar. We analyze the
Deep learning is a useful method that is used to extract prediction 1 in detail as an example. The feature inputs are
features of input images, and it includes several stages of FPN 5 and FPN 2. The FPN 5 branch has a main function
features. There are two main viewpoints at each stage which to contribute contextual information for the prediction 1.
are spatial information and contextual information. At the Therefore, this branch applies atrous spatial pyramid pool-
first stages, the feature would maintain the high spatial infor- ing (ASPP) to collect denser feature maps by different recep-
mation and fundamental patterns. The high-level information tive fields. The ASPP method utilizes different dilated rates to
is extracted by a series of computations. Rich semantic seg- have more receptive fields while not increasing the parameter
mentation can distinguish between different classes, but the number shown in the bottom-left Figure 1. The FPN 2 is
resolution of features is dramatically reduced. Conclusively, downsampled to have exactly the same size as the FPN 5.
two characteristics are inversely proportional. We can build For the sake of sharing information, two features are concate-
up an effective architecture for a specific application. In our nated to complement each other. Finally, the concatenated
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

128 FPN 5
FPN 5 C FC
Prediction 1
128 A D

FPN backbone
512
FPN backbone

FPN 4 S

Upsample
FPN 4 C FC

3x3Conv
P
Prediction 2
128 D
P

FPN 3 FPN 3 C FC
128 D Prediction 3

FPN 2 FPN 2

(a) A single prediction from the baseline FPN [16] (b) Proposed multi-scale predictions.
FIGURE 3: Architecture comparison for the different number of predictions. In Figure 3b, FC is the fully convolution, and a
shared fully convolution includes a series of 3x3 conv, Batch Normalization, ReLU, 3x3 conv, Batch Normalization, ReLU,
and 1x1 conv.

feature is processed by a shared fully convolution network C. CROSS-SCALE FUSION (CSF)


including a series of 3x3 conv, Batch Normalization, ReLU, We present the proposed Cross-Scale Fusion (CSF) module
3x3 conv, Batch Normalization, ReLU, 1x1 conv to predict scheme in Figure 4, which takes three predictions as inputs
the objective classes. The procedure can be formulated as and effectively aggregates the information. Different fusion
equation 1. strategies are employed to combine the predictions alto-
The other predictions have the same process in which gether. In the visualization, bright colors indicate areas with
the FPN 5 is replaced by the FPN 4 or the FPN 3. The a high probability of accuracy. Each prediction is processed
dimension of each prediction is totally dependent on the size by different contextual or spatial features, so a prediction
of the contextual branch. In section III-B, we will visualize enables to generate good results in different areas or object
and analyze the differences of each prediction in detail. classes. As shown in Figure 4, prediction 1 performs well
in the far-away regions, while the other predictions perform
better in other regions. Specifically, prediction 1 generates
x(F ) = ρ(N orm(Cm×n (Concat[D(Fc ); Cm×n,d (Ff )])))) the most accurate score map for almost all classes except
(1) the front areas. The visualization shows that the best areas
where x(F ) represent the result of each prediction, (F ∈ of the other predictions correspond to the worst areas in
RC×H×W ), and Fc and Ff are the coarse and fine infor- prediction 1. In such an analysis, we design a proposed
mation branches. Cm×n represents convolution with kernel CSF module, where prediction 1 predominantly contributes
size m × n, d is the dilated rate of the ASPP module, to the final score. The module incorporates complementary
ρ is the ReLu activation function. N orm, Concat, and D information from the other predictions to refine the results
are Normalization, Concatenation, and Downsampling func- and maximize the probabilities for each pixel. This approach
tions, respectively. highlights the key role of the CSF module in leveraging
multi-scale predictions for improved segmentation accuracy.
In the first scenario of a standard fusion approach, three
U
predictions are upsampled to recover the resolution of the
Attention mask
input images and then simply added together to aggregate
BatchNorm

Prediction 1
3x3 Conv

3x3 Conv
Signoid

(1 - α)
Relu

the information. However, this straightforward fusion method


can lead to a reduction in the highlighted weights when
U
Out combining them with lower-quality predictions. To address
* U +
this, we apply the CSF module to regulate the contributions
Attention mask
Prediction 2 from the three predictions. Each prediction is refined before
BatchNorm
3x3 Conv

3x3 Conv
Signoid

(1 - α)
contributing to the final map. The core component of our
Relu

fusion method is the selective-attention mechanism (SAM).


U
The lower-scale prediction is passed through the SAM mod-
* U
ule to control the output of the higher-scale prediction. The
SAM module identifies the best and worst areas of the lower-
Prediction 3
SAM module + Element-wise addition
scale prediction and then adjusts the contribution from the
U Upsample
* Element-wise multiplication
higher-scale prediction. Based on the attention masks gener-
FIGURE 4: Proposed architecture of the cross-scale fusion
ated by the SAM, this mask decides to select or neglect the
module.
information of the higher-scale prediction to provide the final
map. After this refinement process, the three predictions are
combined to obtain prominent and accurate weights.
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

The procedure of the CSF module can be formulated as module almost neglects these regions in prediction_2 by de-
equation 2. grading the pixel values to nearly zero. Conversely, when the
prediction_1 contains some low-confidence values marked
y = U [x(Fl )] + U [x(Fm ) × U (yal (x(Fl )))] with the blue square, the SAM module requires complemen-
(2) tary information from the higher-scale prediction_2. These
+ U [x(Fh ) × U (yam (x(Fm )))]
weights from prediction_2 can enhance the overall prediction
where y denotes the class probabilities of the final predic- accuracy. This analysis provides a clear understanding of
tion, x(F ) is the predictions of MSP module, F ∈ RC×H×W the SAM module and how the SAM effectively controls the
when ’l’, ’m’, ’h’ are low, medium, and high scale predic- contribution of the higher-scale prediction_2. SAM module
tion, respectively. ya is the output of the attention branch, will be witnessed by quantitative results in the Ablation study
ya ∈ R1×H×W . Lastly, U is the upsampling function. section.
The procedure of the SAM module can be formulated as
SAM
equation 3.
U
Prediction 1 yrf (F ) = U (ya (Flap )) × x(F ) (3)

* where yrf (F ) is output of the prediction refinement,


ya (Flap ) is the output of the attention branch and equals
Update of Prediction 2 [1 − σ(N orm(Cm×n (y(Flap ))))] where σ is the sigmoid
function and Flap is a lower adjacent prediction.
The re-weights of the prediction are calculated by
Prediction 2 eqution 4
FIGURE 5: Illustrate prediction_2 refinement by SAM mod- H×W
X C
X
ule. The green rectangular represents the highlighted weights yrf (F ) = ai x(F ) (4)
of prediction 1 and the blue square is the worst area of i=0 j=0
prediction 1. where ’a’ is the attention mask of the SAM module.
’i’, ’j’ ∈ R and denote instant pixel locations.
D. SELECTIVE-ATTENTION MECHANISM (SAM)
IV. EXPERIMENTS
In this section, we provide a detailed analysis of the weight A. DATASETS AND EVALUATION METRICS
selection procedure. The operating principle of the proposed
1) Datasets
module is illustrated in the bottom-right block of Figure 1,
The Cityscapes is a dataset of street scene segmentation.
where an example of the refinement process for prediction_2
Images are collected from 50 different countries around the
is presented. The lower-scale prediction 1 is assumed as the
world and have a high-resolution (1024 × 2048). The data
input to the SAM module, and the SAM-generated mask
contains 20,000 coarse-annotated images and 5,000 fine-
is utilized to control the contribution of the higher-scale
annotated images. In this paper, we use only the fine-
prediction 2. Initially, we apply a series of functions to reduce
annotated set to evaluate the model performance. The fine-
the prediction_1 dimension from 3D to 2D. The SAM mask
annotation has 19 objective classes and is divided into three
produced by the SAM module has a single-channel depth and
subsets which are 2,975 images for the training, 500 images
remains the best weight of prediction 1. The mask puts large
for the validation, and 1,525 images for the test.
weights on specific areas, indicated in red within the block.
Mapillary Vistas is a dataset for computer vision applica-
Following this, the pixel values of the mask are subtracted
tions. The data is a pixel-accurate annotation for semantic
from 1 to update a SAM mask. The updated mask almost re-
segmentation. It is a complex dataset and includes 65 objec-
moves the best weights of prediction 1 and puts large weights
tive classes and has some tough classes such as birds, CCTV
on the other areas. The updated mask is upsampled using
cameras, or potholes. The dataset contains 20,000 images and
linear interpolation with a scaling factor of 2 to align with the
has a wide range of resolutions. The images are split into two
resolution of prediction 2. Finally, the SAM mask is applied
subsets which are a training set with 18,000 images and a
to prediction 2 through element-wise multiplication. It can
validation set with 2,000 images.
directly control the weight contribution of the prediction 2
based on the prediction 1 shortage.
2) Evaluation Metrics
We visualize the prediction_2 refinement in Figure 5. The
prediction_1 directly refines the contribution of the predic- The semantic segmentation performance is measured by
tion_2 output. We have noted some areas by color shapes the intersection-of-union (IoU) algorithm. Accuracy of each
in the figure. First, the green rectangle indicates areas where class is calculated by the following equation 5
prediction_1 provides high-confidence weights, and the pre- T arget ∪ P rediction
diction_2 shows poor performance. In such case, the SAM IoU = (5)
T arget ∩ P rediction
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

TABLE 1: Ablation study for novel components on Cityscapes validation set. MSP represents multi-scale prediction, and CSF
denotes the cross-scale fusion module.
Method BaseNet MSP CSF No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+MSP ResNet50 ✓ 31.4 M 76.5
Ours ResNet50 ✓ ✓ 31.4 M 78.3

TABLE 2: Ablation study for multi-scale prediction on Cityscapes validation set. ASPP is the atrous spatial pyramid pooling
applied to context branch. Spatial is the spatial FPN 2 branch in MSP modules, and CSF prepresents the cross-scale fusion.
Method BaseNet ASPP Spatial CSF No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+Spatial+CSF ResNet50 ✓ ✓ 31.4 M 72.1
FPN+ASPP+CSF ResNet50 ✓ ✓ 31.2 M 74.5
Ours ResNet50 ✓ ✓ ✓ 31.4 M 78.3

TABLE 3: Ablation study for cross-scale fusion on Cityscapes validation set. Concatenation and addition are the concatenate
pooling and addition operations, respectively. SAM represents selective-attention mechanism of the proposed cross-scale
fusion.
Method BaseNet MSP Concatenation Addition SAM No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+MSP+Concatenate ResNet50 ✓ ✓ 32.0 M 76.1
FPN+MSP+Addition ResNet50 ✓ ✓ 31.4 M 76.5
Ours ResNet50 ✓ ✓ 31.4 M 78.3

The mean intersection-of-union (mIoU) is computed by C. ABLATION STUDY


averaging the accuracy of each class. It is the main standard 1) EFFECTIVENESS OF NOVEL COMPONENTS
used to evaluate the model performance. The mIoU of
To evaluate the effectiveness of our novel modules, we con-
C ducted a series of experiments by modifying the decoder
1 X architecture and training all variations under identical condi-
mIoU = (IoUi ) (6)
C i=1 tions on the Cityscapes dataset. Based on the FPN architec-
ture [16], we employed the Multi-Scale Prediction (MSP) and
where C represents the number of dataset classes and ∀i ∈ C Cross-Scale Fusion (CSF) modules to examine their impact.
The results of these ablation experiments with respect to
B. IMPLEMENTATION DETAILS parameter numbers and segmentation accuracy are presented
Training Loss: we use Cross-entropy algorithm to evaluate in Table 1. The baseline FPN model achieves a performance
the learning process of our model. The method measures the of 71.3% mIoU with 31.7 M parameters. By deploying the
performance of how the proposed model is learning. The MSP module, the performance increases to 76.5% mIoU.
formula is shown as eqution 7 The improvement demonstrates the positive effect of multi-
scale predictions. When the FPN backbone is built with
N C all the MSP and CSF components, the network achieves
1 XX
L= yij log(pi j) + (1 − yij log(1 − pij ) (7) a significant improvement with 78.3% mIoU. The results
N i=1 j=1
highlights the effectiveness of our proposed components in
enhancing segmentation performance. In addition to improv-
where L denotes the loss value, N is the pixel number, and
ing accuracy, our architecture has fewer parameters than the
C represents class channels. y is the groundtruth and p is a
baseline method.
corresponding predicted probability. Lastly, ith and j th are
the instant position of pixels, respectively.
Training Setting: we implement all experiments on Py- 2) EFFECTIVENESS OF MULTI-SCALE PREDICTION
torch framework with CUDA and CuDNN backends. We To analyze the impact of the proposed MSP module, we have
train Cityscapes on Nvidia TiTan X GPU with 12 GB and performed experiments on FPN method and three different
Mapillary Vistas on Gefore RTX 3090 with 24 GB. The configurations of MSP. As elaborated in Section III-B, the
training is set up as 150 epochs with a batch size of 2, a MSP module comprises two key branches: the contextual
polynomial learning rate with an initial learning rate of 0.01, path and the spatial path. In our experiments, the network
a momentum of 0.9, and weight decay of 1e−4 . We use the can utilize either a single branch or both branches and then
mini-batch stochastic gradient descent to update the training be fed into the same CSF module. The results demonstrate
parameters, cross-entropy to compute the loss, and mean that all scenarios of MSP achieve higher accuracy and have
intersection-of-union to evaluate the performance. fewer parameters than the baseline method shown in Table 2.
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

TABLE 4: The per-category comparison between proposed method and existing approaches on Cityscapes validation set.
Objective categories are road, side walk, building, wall, vegetation, terrain, sky, truck, bus, train, person, rider, car, fence,
pole, traffic light, traffic sign, motorcycle, and bicycle.
Uncountable classes Large classes Medium classes Small classes
Method mIoU
Road swalk build wall veg. terr sky truck bus train pers rider car fence pole tlight tsign mcle bicle
ENet [9] 96.3 74.2 85.0 32.2 88.6 61.4 90.6 36.9 50.5 48.1 65.5 38.4 90.6 33.2 43.5 64.1 44.0 38.8 55.4 58.3
AGLNet [29] 97.8 81.0 91.0 51.3 92.3 71.3 94.2 48.4 68.1 42.1 80.1 59.6 93.8 50.6 58.3 63.0 68.5 52.4 67.8 70.1
Edgenet [30] 98.1 83.1 91.6 45.4 92.4 69.7 94.9 50.0 60.9 52.5 80.4 61.1 94.3 50.6 62.6 67.2 71.4 55.3 67.7 71.0
MSCFNet [15] 97.7 82.8 91.0 49.0 92.3 70.2 94.3 50.9 66.1 51.9 82.7 62.7 94.1 52.5 61.2 67.1 71.4 57.6 70.2 71.9
DABNet [31] 98.1 83.0 91.4 51.0 92.7 71.1 94.8 62.5 67.7 61.8 82.7 62.4 94.7 52.8 61.0 66.8 56.3 70.7 71.8 73.8
RefineNet [32] 97.9 81.3 90.3 48.8 91.9 69.4 94.2 56.5 67.5 57.5 79.8 59.8 93.7 47.4 49.6 57.9 67.3 57.7 68.8 73.6
RelaxNet [33] 98.9 84.9 92.2 57.2 93.0 71.8 94.8 58.6 72.7 58.2 83.7 64.4 95.1 54.8 64.3 70.6 74.0 59.9 71.8 74.8
FANet [34] 97.9 83.3 91.6 55.5 91.7 61.8 94.7 76.8 85.1 74.5 78.5 58.1 94.1 55.1 60.3 66.2 74.9 50.7 73.9 75.0
CACNet [35] 98.2 83.4 91.2 50.8 92.4 70.2 94.8 44.7 61.3 48.2 79.8 64.2 95.1 49.1 57.4 67.2 70.3 57.5 69.3 70.8
FPN [16] 97.5 81.6 90.9 46.3 91.3 58.8 93.6 54.0 71.9 54.7 78.4 56.0 93.3 54.2 59.1 63.9 74.3 59.6 74.8 71.3
MPFNet 98.1 84.8 92.4 58.1 92.3 64.5 94.6 80.2 90.2 81.7 81.2 62.4 94.8 62.0 63.1 68.2 76.4 66.5 76.5 78.3

When the spatial path is used and the context path do not three subsets based on the objective sizes. As can observed
pass through the ASPP module, it can bring 0.8% slightly from Table 4, the results show that our MPFNet achieves
better accuracy. When only utilizing a single contextual path better reliability for all subsets or different objective sizes.
with ASPP, the network performance improves significantly Most approaches achieve satisfactory results for uncountable
by 3.2% mIoU with 0.5M fewer parameters. When MSP categories, with the road class exceeding 96% accuracy.
includes all proposed components, the network is critically For the large group, we can obtain outstanding results with
increase of 7.0% mIoU compared to the baseline. The re- 80.2% for trucks, 90.2% for buses, and 81.7% for trains
sults demonstrate that MSP plays an important role in our when the others are struggling with these large objects. In
approach. spite of having good overall performance, RefineNet [32] and
RelaxNet [33] only yield accuracies ranging from 56.5% to
3) CROSS-SCALE PREDICTION FOR WEIGHT FUSION 72.7% for the large group. Following, the medium-object car
This section analyzes the impact of the Cross-Scale Fusion can achieve around 90% accuracy for all methods. Compared
(CSF) module on our network’s performance. We evaluate to the famous ENet [9], our method races for the top place for
three different weight fusion strategies such as concatenation, other medium-sized classes when ENet method is struggled
addition, and our proposed SAM. The experimental results to segment these inputs. The most challenging subset consists
are summarized in Table 3. The concatenation fusion ap- of small-object classes, where many methods mislabel ap-
proach has 32.0M parameters and achieves a 4.9% mIoU proximately 50% of the predictions. Despite this challenge,
improvement compared to the baseline method. Alterna- our proposed method maintains a strong performance for
tively, when the network employs an addition operation, the these smaller objects. The quantitative experimental results
network slightly reduces the parameter count to 31.4M while validate the effectiveness of our multi-scale prediction with
reaching a performance of 76.5% mIoU. The result shows cross-scale fusion and then demonstrate the robust perfor-
that addition operation significantly improved the accuracy. mance across a wide range of object sizes.
However, the addition fusion is not optimal and can lead to a
reduction in the highlighted weights when combining them In Table 5, we compare our proposed method to other
with lower-quality predictions. Our proposed SAM is the existing methods in terms of performance and efficiency.
core component of the CSF module and is used to enhancing MPFNet obtains impressive results with 78.3% mIoU and
the weight fusion from different predictions. When SAM 45 FPS on Cityscapes dataset. Compared to the baseline FPN
module is employed to the architecture, the performance is method [16], our approach demonstrates significant improve-
improved by 1.8% mIoU compared to the best result achieved ments, achieving a 7.0% increase in mIoU and a 27 FPS
by the other fusion methods. The results demonstrate that speedup. We can see that MPFNet also outperforms the
the CSF module equipped with SAM can significantly boost previous state-of-the-art networks. Particularly, we surpass
accuracy without increasing the model’s parameter numbers. the ENet method [9] with 20% mIoU enhancement and twice
the inference speed. When compared to ADANet [28], the
D. RESULTS ON CITYSCAPES method with the second-best performance, MPFNet proves
We conduct quantitative results to verify the effectiveness to be far more efficient and achieves three times the inference
of our proposed method on Cityscapes dataset. In Table 4, speed. Although AGLNet [29] addresses time consumption,
we compare our methods with other approaches regarding our method still achieves an 8.2% higher mIoU while nearly
per-category accuracy. We group similar categories into the reaching their inference speed. Furthermore, MPFNet sur-
same subset to evaluate the performance. All stuff classes are passes EdgeNet [30] with a 7.3% improvement in mIoU and
assigned as uncountable set. Thing classes are divided into a significantly faster inference speed. While approaches such
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

as SwiftNet [36], HyperSeg [37], and BiseNetV2 [38] have Although MSSNet [41] achieves competitive accuracy as
remarkable performance, our method still achieves higher one of the leading SOTA methods, our method still have
mIoU scores than them with 2.9%, 2.1%, and 2.5% improve- 3.3% mIoU improvement. Furthermore, our model achieves
ment, respectively. The results demonstrate that MPFNet a 15.2% improvement over AGLNet [29] and a 4.2% im-
achieves an effective balance between performance and ef- provement over RGPNet [42]. These results highlight the
ficiency and make it suitable for real-time semantic segmen- effectiveness of our proposed MPFNet in handling complex
tation applications. datasets and show the higher accuracy compared to both
Figure 6 presents a qualitative comparison between the baseline and existing state-of-the-art approaches.
baseline method and our proposed MPFNet on the Cityscapes
dataset. To complement the quantitative analysis, we selected TABLE 6: Performance comparison between our approach
examples that represent various object sizes for visualization. and other methods on Mapillary Vistas.
Comparative objects are highlighted with yellow boxes, and Method Resolution mIoU (%)
small-class objects are zoomed in for clarity. The results indi- DABNet [40] 1024×2048 29.6
cate that MPFNet accurately predicts their entire structure of AGLNet [29] 1024×2048 30.7
RGPNet [42] 1024×2048 41.7
large-size objects whereas the baseline method misclassifies MSSNet [41] 1024x2048 42.6
some pixels. Both approaches perform well on medium- FPN [16] 1024×2048 40.2
size objects such as cars. When rider and bicycle classes MPFNet (ours) 1024×2048 45.9
are overlapped each other, MPFNet produces sharper and
more precise segmentation. This improvement is attributed The qualitative comparison between the baseline method
by the proposed cross-scale fusion and selective attention and our proposed MPFNet on the Mapillary Vistas dataset
mechanism, which refine predictions before contributing to is illustrated in Figure 7. Yellow boxes are used to high-
the final output. For small classes such as traffic lights, traffic light specific objects for a detailed comparison. Because
signs, and poles, these classes have narrow structures and are this dataset contains some challenging categories assigned
challenge to be detected. The baseline method fails to recog- as noise classes such as ground animals or mailboxes, the
nize these objects or completely loses relevant information. overall performance across all methods is relatively limited.
On the other hand, MPFNet provides smoother and more However, the visual results demonstrate that MPFNet pro-
distinct predictions for these. Overall, the qualitative results duces good results for common classes on street scenes.
demonstrate that MPFNet achieves superior segmentation By employing the SAM module, our approach refines and
performance across multiple object sizes in complex street selects highlighted weights from all predictions. Our seg-
scenes and proves robustness and reliability for real-world mentation achieves significantly better segmentation qual-
semantic segmentation tasks. ity compared to the baseline. For instance, when the base-
line method struggles with misclassifying large objects like
TABLE 5: Performance and efficiency comparison between bridges, the MPFNet visualization shows that practically
our approach and other methods on Cityscapes dataset. match the ground truth labels. In the case of crosswalk and
Method Resolution mIoU FPS car classes, our method provides more precise segmentation
ENet [9] 1024×2048 58.3 21 than the FPN method. Some traffic lights are far from the
AGLNet [29] 1024 × 512 70.1 52 front and very narrow, MPFNet is able to produce clearer
Edgenet [30] 1024 × 512 71.0 30 predictions whereas the baseline method often fails to detect
DualNet [39] 1024×2048 75.5 51
SwiftNet [36] 1024×2048 75.4 39 these objects. Conclusively, the qualitative analysis indicates
ADANet [28] 1024×2048 77.3 15 that MPFNet effectively handles challenging scenarios on
HyperSeg [37] 1024 × 512 76.2 35 complex street scenes.
BiseNetV2 [38] 1024 × 512 75.8 47
FPN [16] 1024×2048 71.3 18
MPFNet (ours) 1024×2048 78.3 45 F. DISCUSSION
Based on Feature Pyramid Network (FPN), we upgraded the
decoder part to enhance the semantic segmentation perfor-
E. RESULTS ON MAPILLARY VISTAS mance and efficiency. The encoder of the FPN can provide
In this section, we conduct experiments on a complex dataset both context and spatial information from multi-resolution
to evaluate the superiority of our model compared to other features, so we re-used this encoder backbone. Firstly, we
state-of-the-art approaches. Table 6 presents the results for deployed multi-scale predictions (MSP) module to improve
all methods on a high-resolution input of 1024×2048 pixels. feature extraction by incorporating contextual and spatial
As a result, we can achieve the performance of 45.9%mIoU. branches. This module enables the network to handle a wide
For the baseline model [16] that does not employ our novel range of object sizes on the street scenes. The proposed MSP
components, the performance has encountered a drop of module not only significantly boosted the segmentation ac-
5.7% compared to our approach. MPFNet also surpassed the curacy but also slightly reduced model parameters compared
accuracy of the other methods. In particular, our approach to the original baseline decoder. Secondly, the cross-scale
surpasses the DABNet method [40] by 16.3% enhancement. fusion (CSF) module further refines the segmentation results
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

FIGURE 6: The comparative results of the baseline FPN method and our MPFNet approach on the Cityscapes validation set.

by effectively fusing weights from different scale predictions. tribution of higher adjacent predictions. The SAM module
When the CSF module is equipped with a selective-attention can remove bad values and select highlighted weights for
mechanism (SAM), our approach selects highlighted features final output. We have evaluated on Cityscapes and Mapil-
and adjusts the contribution of each prediction to the final lary Vistas datasets. As experimental results, MPFNet shows
feature map while remaining computationally efficient. The effectiveness over the baseline and other SOTA approaches.
limitation of our proposed method is that a selective attention Our method not only delivers high performance for a long
mechanism is used multiple times, so it can consume more range of objective sizes but also accelerates speed for real-
resources. However, the proposed approach outperforms the time applications. In the future, we intend to work on the
baseline method and remains efficient for deployment in real- feature extraction in order to enhance the semantic accuracy.
world scenarios.
REFERENCES
[1] G. Rossolini, F. Nesti, G. D’Amico, S. Nair, A. Biondi, and G. Buttazzo,
V. CONCLUSION
“On the real-world adversarial robustness of real-time semantic segmen-
A novel multiscale prediction network with cross-fusion, tation models for autonomous driving,” IEEE Transactions on Neural
called MPFNet, is proposed in this study. The proposed Networks and Learning Systems, 2023.
[2] K. Li, W. Tao, and L. Liu, “Online semantic object segmentation for vision
method adopts the FPN framework and uses four feature robot collected video,” IEEE Access, vol. 7, pp. 107 602–107 615, 2019.
layers to obtain both coarse and fine-grained information. [3] H. J. Lee, J. U. Kim, S. Lee, H. G. Kim, and Y. M. Ro, “Structure boundary
The innovative architecture deployed the MSP module to preserving segmentation for medical image with ambiguous boundary,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern
process prominent characteristics of the feature pyramid. recognition, 2020, pp. 4817–4826.
The MSP component improves the model performance and [4] K. O’Shea and R. Nash, “An introduction to convolutional neural net-
reduces the system computation. The CSF is designed to works,” arXiv preprint arXiv:1511.08458, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
collect weights across multiple-scale predictions. In the CSF for semantic segmentation,” in Proceedings of the IEEE conference on
mechanism, the SAM is utilized to directly control the con- computer vision and pattern recognition, 2015, pp. 3431–3440.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

FIGURE 7: The comparative results of the baseline FPN method and our MPFNet approach on the Mapillary Vistas.

[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for network with multi-scale context aggregation for real-time semantic seg-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. mentation,” Neurocomputing, vol. 521, pp. 27–40, 2023.
[7] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic [13] M. Zhuang, X. Zhong, D. Gu, L. Feng, X. Zhong, and H. Hu, “Lrdnet:
segmentation in street scenes,” in Proceedings of the IEEE conference on A lightweight and efficient network with refined dual attention decorder
computer vision and pattern recognition, 2018, pp. 3684–3692. for real-time semantic segmentation,” Neurocomputing, vol. 459, pp. 349–
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 360, 2021.
recognition,” in Proceedings of the IEEE conference on computer vision [14] M. Shi, J. Shen, Q. Yi, J. Weng, Z. Huang, A. Luo, and Y. Zhou, “Lmffnet:
and pattern recognition, 2016, pp. 770–778. a well-balanced lightweight network for fast and accurate semantic seg-
[9] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural mentation,” IEEE Transactions on Neural Networks and Learning Sys-
network architecture for real-time semantic segmentation,” arXiv preprint tems, 2022.
arXiv:1606.02147, 2016. [15] G. Gao, G. Xu, Y. Yu, J. Xie, J. Yang, and D. Yue, “Mscfnet: a lightweight
[10] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient network with multi-scale context fusion for real-time semantic segmenta-
residual factorized convnet for real-time semantic segmentation,” IEEE tion,” IEEE Transactions on Intelligent Transportation Systems, 2021.
Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263– [16] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, “Feature pyramid
272, 2017. network for multi-class land segmentation,” in Proceedings of the IEEE
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Conference on Computer Vision and Pattern Recognition Workshops,
“Deeplab: Semantic image segmentation with deep convolutional nets, 2018, pp. 272–275.
atrous convolution, and fully connected crfs,” IEEE transactions on pattern [17] B. Xie, Z. Yang, L. Yang, R. Luo, A. Wei, X. Weng, and B. Li, “Multi-scale
analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017. fusion with matching attention model: A novel decoding network coop-
[12] J. Liu, F. Zhang, Z. Zhou, and J. Wang, “Bfmnet: Bilateral feature fusion erated with nas for real-time semantic segmentation,” IEEE Transactions

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454

V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation

on Intelligent Transportation Systems, vol. 23, no. 8, pp. 12 622–12 632, [39] Q. Van Toan and M. Y. Kim, “Dual-inferences mechanism for real-time
2021. semantic segmentation,” in 2022 Thirteenth International Conference on
[18] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for Ubiquitous and Future Networks (ICUFN). IEEE, 2022, pp. 12–17.
semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020. [40] G. Li, I. Yun, J. Kim, and J. Kim, “Dabnet: Depth-wise asymmet-
[19] Z. Wu, C. Shen, and A. v. d. Hengel, “High-performance semantic seg- ric bottleneck for real-time semantic segmentation,” arXiv preprint
mentation using very deep fully convolutional networks,” arXiv preprint arXiv:1907.11357, 2019.
arXiv:1604.04339, 2016. [41] Q. Van Toan and M. Y. Kim, “Multi-scale synergy approach for real-time
[20] Z. Zhong, J. Li, W. Cui, and H. Jiang, “Fully convolutional networks for semantic segmentation,” in 2022 International Conference on Artificial
building and road extraction: Preliminary results,” in 2016 IEEE Interna- Intelligence in Information and Communication (ICAIIC). IEEE, 2022,
tional Geoscience and Remote Sensing Symposium (IGARSS). IEEE, pp. 216–220.
2016, pp. 1591–1594. [42] E. Arani, S. Marzban, A. Pata, and B. Zonooz, “Rgpnet: A real-time
[21] J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao, “Mlfnet: Multi- general purpose semantic segmentation,” in Proceedings of the IEEE/CVF
level fusion network for real-time semantic segmentation of autonomous Winter Conference on Applications of Computer Vision, 2021, pp. 3009–
driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756– 3018.
767, 2022.
[22] S. Hao, Y. Zhou, Y. Guo, R. Hong, J. Cheng, and M. Wang, “Real-
time semantic segmentation via spatial-detail guided context propagation,”
IEEE Transactions on Neural Networks and Learning Systems, 2022.
[23] A. Fateh, M. R. Mohammadi, and M. R. J. Motlagh, “Msdnet: Multi-
scale decoder for few-shot semantic segmentation via transformer-guided
prototyping,” arXiv preprint arXiv:2409.11316, 2024.
[24] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Multi-scale
context intertwining for semantic segmentation,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 603–619.
[25] J. Fan, M. J. Bocus, B. Hosking, R. Wu, Y. Liu, S. Vityazev, and R. Fan, T oan Van Quyen received the B.S. degree in
“Multi-scale feature fusion: Learning better semantic segmentation for Electrical Engineering,Thai Nguyen University of
road pothole detection,” in 2021 IEEE International Conference on Au- Technology, Thai Nguyen, Vietnam in 2019, and
tonomous Systems (ICAS). IEEE, 2021, pp. 1–5. the M.S. degree in Electronic and Electrical Engi-
[26] X. Ding, C. Shen, T. Zeng, and Y. Peng, “Sab net: A semantic attention neering, IT college, Kyungpook National Univer-
boosting framework for semantic segmentation,” IEEE Transactions on
sity, Daegu, South Korea in 2022. He is currently
Neural Networks and Learning Systems, 2022.
pursuing a Ph.D degree in the School of Electronic
[27] B. Zhang, W. Li, Y. Hui, J. Liu, and Y. Guan, “Mfenet: Multi-level feature
enhancement network for real-time semantic segmentation,” Neurocom-
and Electrical Engineering, Computer Science,
puting, vol. 393, pp. 54–65, 2020. Kyungpook National University. His research in-
[28] W. Wang, S. Wang, Y. Li, and Y. Jin, “Adaptive multi-scale dual attention terests include computer vision, deep learning, and
network for semantic segmentation,” Neurocomputing, vol. 460, pp. 39– semantic segmentation.
49, 2021.
[29] Q. Zhou, Y. Wang, Y. Fan, X. Wu, S. Zhang, B. Kang, and L. J. Latecki,
“Aglnet: Towards real-time semantic segmentation of self-driving im-
ages via attention-guided lightweight network,” Applied Soft Computing,
vol. 96, p. 106682, 2020.
[30] H.-Y. Han, Y.-C. Chen, P.-Y. Hsiao, and L.-C. Fu, “Using channel-wise
attention for deep cnn based real-time semantic segmentation with class-
aware edge information,” IEEE Transactions on Intelligent Transportation
Systems, vol. 22, no. 2, pp. 1041–1051, 2020.
[31] G. Li, S. Jiang, I. Yun, J. Kim, and J. Kim, “Depth-wise asymmetric
bottleneck with point-wise aggregation decoder for real-time semantic M in Young Kim (Member, IEEE) received the
segmentation in urban scenes,” IEEE Access, vol. 8, pp. 27 495–27 506, B.S., M.S., and Ph.D. degrees from the Korea Ad-
2020. vanced Institute of Science and Technology, South
[32] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement Korea, in 1996, 1998, and 2004, respectively. He
networks for high-resolution semantic segmentation,” in Proceedings of was a Senior Researcher with Mirae Corporation,
the IEEE conference on computer vision and pattern recognition, 2017, from 2004 to 2005. He was also a Chief Research
pp. 1925–1934. Engineer in artificial vision systems for intelligent
[33] J. Liu, X. Xu, Y. Shi, C. Deng, and M. Shi, “Relaxnet: Residual efficient machines and robots with Kohyoung Corporation,
learning and attention expected fusion network for real-time semantic from 2005 to 2009. Since 2009, he has been with
segmentation,” Neurocomputing, vol. 474, pp. 115–127, 2022.
the School of Electronic and Electrical Engineer-
[34] P. Hu, F. Perazzi, F. C. Heilbron, O. Wang, Z. Lin, K. Saenko, and
ing, Computer Science, Kyungpook National University, as an Assistant
S. Sclaroff, “Real-time semantic segmentation with fast attention,” IEEE
Professor. He was a Visiting Associate Professor with the Department of
Robotics and Automation Letters, vol. 6, no. 1, pp. 263–270, 2020.
Electrical and Computer Engineering and the School of Medicine, Johns
[35] X. Xu, S. Huang, and H. Lai, “Lightweight semantic segmentation network
leveraging class-aware contextual information,” IEEE Access, 2023. Hopkins University, from 2014 to 2016. He is currently an Associate
[36] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained Professor with the School of Electronics Engineering, Kyungpook National
imagenet architectures for real-time semantic segmentation of road-driving University. He is also a Deputy Director with the KNU-LG Convergence
images,” in Proceedings of the IEEE/CVF Conference on Computer Vision Research Center and the Director of the Research Center for Neurosurgi-
and Pattern Recognition, 2019, pp. 12 607–12 616. cal Robotic Systems. His research interests include visual intelligence for
[37] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork robotic perception, recognition of autonomous unmanned ground, and aerial
for real-time semantic segmentation,” in Proceedings of the IEEE/CVF vehicles.
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061–
4070.
[38] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral
network with guided aggregation for real-time semantic segmentation,”
International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068,
2021.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

You might also like