MPFNet_Multiscale_Prediction_Network_with_Cross_Fu
MPFNet_Multiscale_Prediction_Network_with_Cross_Fu
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3540454
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Semantic segmentation currently plays an important role in computer vision and is widely
applied in both industry and human life. The self-driving car is one of the most trending applications, which
assists humans in making informed decisions. The self-driving application has to interpret visual information
from street scenes. However, how to effectively segment a long range of objective sizes is still a challenging
problem. A feature pyramid network (FPN) builds up an architecture by processing four different features to
contribute contextual and spatial information to the final map. Each feature can suitably process a specific
range of objective sizes. Nevertheless, the final feature combination is not optimal when they raise the
computation cost and reduce the semantic weights. We propose a multi-scale prediction network with cross-
fusion in order to address the aforementioned drawbacks. The prediction module consists of three different
predictions that allow the architecture to efficiently extract information of various sizes. Each prediction
is generated from a pair of feature pyramids used to predict object classes. Furthermore, the cross-scale
fusion is designed to enhance the weight aggregation of the final score map. The core component of the
cross-fusion is the selective attention mechanism that determines uncertain weights of the lower prediction
and then selects the complement from the adjacent prediction. By implementing this proposed scheme, we
have achieved good results 78.3% mIoU and 45 FPS on Cityscapes and 45.9% mIoU on Mapillary Vistas
datasets. Our method outperforms the baseline method with 7.0 mIoU improvement and a 27 FPS speedup
on Cityscapes dataset. The experiment results demonstrate that the proposed model achieves a reasonable
balance between performance and efficiency.
INDEX TERMS Real-time semantic segmentation, attention mechanism, multi-scale prediction, context
fusion, feature pyramid network.
VOLUME 4, 2016 1
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
tion. Some approaches addressed this limitation by utilizing novel cross-scale fusion with selective-attention mechanism
transposed convolution [7] or deep residual learning [8]. is employed to combine weights from all predictions.
As the aforementioned requirements for autonomous driv- The contribution of this paper can be summarized:
ing, the previous methods still consume too much time for • We designed multi-scale prediction (MSP) to improve
computation. ENet [9] method adapts ResNet and reduces the performance and efficiency of semantic segmen-
the size of shallower features to quickly extract information. tation. Prominent characteristics of each layer will be
They can process 18 times faster than previous methods effectively processed by incorporating contextual and
while achieving similar accuracy. ERFNet [10] continuously spatial paths. The module can handle a wide range of
improved ENet architecture by using the residual connection object sizes. Additionally, the proposed MSP can also
and convolutions with 1D kernels to gain better performance. reduce the burden on the computational system.
Autonomous driving application is a tough task because • The cross-scale fusion (CSF) module is proposed to en-
they have to deal with various sizes of objects, which can be hance the fusion of weights from predictions at different
thin like poles and traffic lights, or huge such as trucks and scales. Rather than relying on traditional methods such
bridges. Fields of view (FOV) directly affect the extracted as concatenation or average pooling, the CSF module
information. In the case, FOV is large enough to capture selectively selects the highlighted weights across all
huge objects, but it can contain many different small objects predictions based on the attention mask of the selective
as inputs to generate one-pixel value. Oppositely, FOV is attention mechanism (SAM).
small and suitable for gathering narrow structures, so it can • Selective-attention mechanism (SAM) is the core com-
lose global information of large objects. Some approaches ponent of CSF module. The SAM is used to determine
focused on converting the traditional convolution to dilated the best and worst areas of the lower-scale prediction
convolution which can obtain different FOVs by changing the and then generates the attention mask. This mask re-
rates [11]–[13]. We also can have multiple FOVs by adjusting quires the complementary information from the higher-
input-image sizes [14], [15]. By deploying the multi-scale scale prediction. Therefore, the SAM module can refine
inference, objects are extracted by suitable FOVs from begin- the contribution of the higher-scale prediction.
ning to end, but the network needs to process multiple times • MPFNet has achieved outstanding results 78.3% mIoU
for a single image. Feature pyramid network (FPN) is applied at 45 FPS on Cityscapes and 45.9% mIoU on Mapillary
not only for object detection but also effective for multi-class Vistas. Especially, Our method dominates the baseline
semantic segmentation [16]. This method can obtain spatial method in both terms of segmentation performance and
and semantic information from different feature layers and inference speed with 7.0% mIuU and 27 FPS improve-
has various receptive fields to efficiently extract a wide range ment on Cityscapes dataset.
of objective sizes. However, the decoder part of this method
Our paper structure is organized as follows. Related works
is not optimal, and there are some reasons for limitations:
addressing the same problems for semantic segmentation
• Reduce semantic weights: The encoder of the FPN are mentioned in Section II, we discussed our architecture
network effectively extracts different levels of semantic and proposed components in Section III, Experiment results
information. However, the decoder component concate- are analyzed in Section IV, and conclusions are shown in
nates all features together to obtain contextual and spa- Section V.
tial information for the final feature, which causes the
semantic-weights reduction. The rich semantic-weights II. RELATED WORKS
layer is averaged by the poorer one from the other A. MULTI-SCALE INFERENCE METHODS
layers.
The receptive field is the main factor to extract information
• Burden the computational system: this method burdens
from input and it directly affects the quality of the out-
the system when the decoder uses 3x3 convolution to
put. One single receptive field can lead to some limitations
continuously extract semantic-weights information and
in capturing a wide range of objective sizes. In order to
generate the third column of the feature pyramid. Addi-
have multiple FOVs and still remain a lightweight network,
tionally, the final feature map having 512 channel with
MSCFNet method [15] resizes each image to four different
a high resolution also burdens the computation system.
sizes as inputs. Each image size is inserted into the pipeline at
The decoder scheme is illustrated in Figure 3a.
different stages to capture multi-scale semantic information.
A novel multi-scale prediction network with cross fusion, The method can achieve good performance while having a
called MPFNet, is proposed in this paper in order to over- small number of parameters. MSMA approach [17] deploys
come the aforementioned limitations. We adapt the backbone an asymptotic neural architecture network to encode the
from the FPN method [16] and improve the feature combi- image input and two different sizes to contribute spatial
nation. We design multi-scale predictions to properly process features to the final map. By this way, the method can ob-
the characteristics of each feature. One prediction is supplied tain rich information. Previous approaches have reduced the
by two feature layers which are one containing rich coarse workload for systems but performance needs to be improved.
information and another containing rich fine information. A Method [18] designs a heavy network to extract information.
2 VOLUME 4, 2016
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
FPN5 C Prediction 1 U
256
2048 2xUp SAM
D
1x1ConV A
+
S Out
FPN4 +
256 P
C Prediction 2
* U
1024
SAM
P D
FPN3
256 C Prediction 3
* U
512
D
FPN2
256
256
No. cls
W/16
Image W/32 W/32
Pooling
U
H/32
H/32 H/32 H/16
3x3 Conv
W/32
rate 18
Atrous Conv Prediction 1 Attn 1 - Attn
3x3 Conv
rate 12 C
*
No. cls Prediction 2
3x3 Conv
rate 6
H/16
1x1 Conv
W/16
Prediction 2
Atrous spatial pyramid pooling (ASPP) Selective-attention mechanism (SAM)
+
*
Addition C Concatenation Multiplication U Upsample D Downsample
FIGURE 1: Proposed architecture of the multi-scale prediction network with cross fusion.
They use two sizes of images to train the network and then approach [23] proposed a novel multi-scale decoder to obtain
can test the model with three sizes of images. They achieve highlighted weights from different resolutions of hierarchical
high accuracy and have the flexibility of inference scales. features. On the other hand, multi-scale context intertwin-
ing [24] processes each pair of feature maps to determine
B. MULTI-LEVEL FEATURE FUSION METHODS together. The features can share information between them
Semantic segmentation is the task requiring both spatial and then enhance highlighted weights.
and semantic information to generate outputs. FCN [5] and
other deep learning methods [19], [20] are effective to C. ATTENTION MECHANISM
extract rich contextual information for the final map, but We realize that multi-level feature methods can obtain rich
these approaches dramatically lose information of objective spatial and contextual information from different layers of the
boundaries at deeper layers. In order to solve this hindrance, backbone. However, the feature aggregation using concate-
the final map should be contributed from different stages nation operation or averaging pooling affects the prominent
of features. MLFNet [21] designs two branches to extract weights of each feature. Attention mechanism is proposed to
information. A context branch is fed into ResNet 18 to extract address this drawback. The algorithm uses a biased weight
semantic information, and a spatial is simply processed by a distribution to control the contribution of each feature map.
series of 3x3 convolution, max pooling, and average pooling MSFFM [25] method calculated the gap between two fea-
to quickly extract and maintain rich spatial information. tures in spatial dimension to improve the weight at the
SGCPNet [22] utilized shallow features to guide the context boundaries. SABNet [26] deployed an attention framework
propagation, and then the final information is reconstructed to decrease the semantic gap and refine information from
by a scalar-weighted fusion module. FPN method [16] uses high to low scales. Compared to previous approaches that
all features of the backbone to generate a feature pyramid only used a single-kernel size to generate an attention mask,
and then processes to contribute to the final score. MSDNet MFENet approach [27] used two attention vectors from
VOLUME 4, 2016 3
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
128 FPN 5
FPN 5 C FC
Prediction 1
128 A D
FPN backbone
512
FPN backbone
FPN 4 S
Upsample
FPN 4 C FC
3x3Conv
P
Prediction 2
128 D
P
FPN 3 FPN 3 C FC
128 D Prediction 3
FPN 2 FPN 2
(a) A single prediction from the baseline FPN [16] (b) Proposed multi-scale predictions.
FIGURE 3: Architecture comparison for the different number of predictions. In Figure 3b, FC is the fully convolution, and a
shared fully convolution includes a series of 3x3 conv, Batch Normalization, ReLU, 3x3 conv, Batch Normalization, ReLU,
and 1x1 conv.
Prediction 1
3x3 Conv
3x3 Conv
Signoid
(1 - α)
Relu
3x3 Conv
Signoid
(1 - α)
contributing to the final map. The core component of our
Relu
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
The procedure of the CSF module can be formulated as module almost neglects these regions in prediction_2 by de-
equation 2. grading the pixel values to nearly zero. Conversely, when the
prediction_1 contains some low-confidence values marked
y = U [x(Fl )] + U [x(Fm ) × U (yal (x(Fl )))] with the blue square, the SAM module requires complemen-
(2) tary information from the higher-scale prediction_2. These
+ U [x(Fh ) × U (yam (x(Fm )))]
weights from prediction_2 can enhance the overall prediction
where y denotes the class probabilities of the final predic- accuracy. This analysis provides a clear understanding of
tion, x(F ) is the predictions of MSP module, F ∈ RC×H×W the SAM module and how the SAM effectively controls the
when ’l’, ’m’, ’h’ are low, medium, and high scale predic- contribution of the higher-scale prediction_2. SAM module
tion, respectively. ya is the output of the attention branch, will be witnessed by quantitative results in the Ablation study
ya ∈ R1×H×W . Lastly, U is the upsampling function. section.
The procedure of the SAM module can be formulated as
SAM
equation 3.
U
Prediction 1 yrf (F ) = U (ya (Flap )) × x(F ) (3)
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
TABLE 1: Ablation study for novel components on Cityscapes validation set. MSP represents multi-scale prediction, and CSF
denotes the cross-scale fusion module.
Method BaseNet MSP CSF No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+MSP ResNet50 ✓ 31.4 M 76.5
Ours ResNet50 ✓ ✓ 31.4 M 78.3
TABLE 2: Ablation study for multi-scale prediction on Cityscapes validation set. ASPP is the atrous spatial pyramid pooling
applied to context branch. Spatial is the spatial FPN 2 branch in MSP modules, and CSF prepresents the cross-scale fusion.
Method BaseNet ASPP Spatial CSF No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+Spatial+CSF ResNet50 ✓ ✓ 31.4 M 72.1
FPN+ASPP+CSF ResNet50 ✓ ✓ 31.2 M 74.5
Ours ResNet50 ✓ ✓ ✓ 31.4 M 78.3
TABLE 3: Ablation study for cross-scale fusion on Cityscapes validation set. Concatenation and addition are the concatenate
pooling and addition operations, respectively. SAM represents selective-attention mechanism of the proposed cross-scale
fusion.
Method BaseNet MSP Concatenation Addition SAM No. params mIoU (%)
FPN [16] ResNet50 31.7 M 71.3
FPN+MSP+Concatenate ResNet50 ✓ ✓ 32.0 M 76.1
FPN+MSP+Addition ResNet50 ✓ ✓ 31.4 M 76.5
Ours ResNet50 ✓ ✓ 31.4 M 78.3
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
TABLE 4: The per-category comparison between proposed method and existing approaches on Cityscapes validation set.
Objective categories are road, side walk, building, wall, vegetation, terrain, sky, truck, bus, train, person, rider, car, fence,
pole, traffic light, traffic sign, motorcycle, and bicycle.
Uncountable classes Large classes Medium classes Small classes
Method mIoU
Road swalk build wall veg. terr sky truck bus train pers rider car fence pole tlight tsign mcle bicle
ENet [9] 96.3 74.2 85.0 32.2 88.6 61.4 90.6 36.9 50.5 48.1 65.5 38.4 90.6 33.2 43.5 64.1 44.0 38.8 55.4 58.3
AGLNet [29] 97.8 81.0 91.0 51.3 92.3 71.3 94.2 48.4 68.1 42.1 80.1 59.6 93.8 50.6 58.3 63.0 68.5 52.4 67.8 70.1
Edgenet [30] 98.1 83.1 91.6 45.4 92.4 69.7 94.9 50.0 60.9 52.5 80.4 61.1 94.3 50.6 62.6 67.2 71.4 55.3 67.7 71.0
MSCFNet [15] 97.7 82.8 91.0 49.0 92.3 70.2 94.3 50.9 66.1 51.9 82.7 62.7 94.1 52.5 61.2 67.1 71.4 57.6 70.2 71.9
DABNet [31] 98.1 83.0 91.4 51.0 92.7 71.1 94.8 62.5 67.7 61.8 82.7 62.4 94.7 52.8 61.0 66.8 56.3 70.7 71.8 73.8
RefineNet [32] 97.9 81.3 90.3 48.8 91.9 69.4 94.2 56.5 67.5 57.5 79.8 59.8 93.7 47.4 49.6 57.9 67.3 57.7 68.8 73.6
RelaxNet [33] 98.9 84.9 92.2 57.2 93.0 71.8 94.8 58.6 72.7 58.2 83.7 64.4 95.1 54.8 64.3 70.6 74.0 59.9 71.8 74.8
FANet [34] 97.9 83.3 91.6 55.5 91.7 61.8 94.7 76.8 85.1 74.5 78.5 58.1 94.1 55.1 60.3 66.2 74.9 50.7 73.9 75.0
CACNet [35] 98.2 83.4 91.2 50.8 92.4 70.2 94.8 44.7 61.3 48.2 79.8 64.2 95.1 49.1 57.4 67.2 70.3 57.5 69.3 70.8
FPN [16] 97.5 81.6 90.9 46.3 91.3 58.8 93.6 54.0 71.9 54.7 78.4 56.0 93.3 54.2 59.1 63.9 74.3 59.6 74.8 71.3
MPFNet 98.1 84.8 92.4 58.1 92.3 64.5 94.6 80.2 90.2 81.7 81.2 62.4 94.8 62.0 63.1 68.2 76.4 66.5 76.5 78.3
When the spatial path is used and the context path do not three subsets based on the objective sizes. As can observed
pass through the ASPP module, it can bring 0.8% slightly from Table 4, the results show that our MPFNet achieves
better accuracy. When only utilizing a single contextual path better reliability for all subsets or different objective sizes.
with ASPP, the network performance improves significantly Most approaches achieve satisfactory results for uncountable
by 3.2% mIoU with 0.5M fewer parameters. When MSP categories, with the road class exceeding 96% accuracy.
includes all proposed components, the network is critically For the large group, we can obtain outstanding results with
increase of 7.0% mIoU compared to the baseline. The re- 80.2% for trucks, 90.2% for buses, and 81.7% for trains
sults demonstrate that MSP plays an important role in our when the others are struggling with these large objects. In
approach. spite of having good overall performance, RefineNet [32] and
RelaxNet [33] only yield accuracies ranging from 56.5% to
3) CROSS-SCALE PREDICTION FOR WEIGHT FUSION 72.7% for the large group. Following, the medium-object car
This section analyzes the impact of the Cross-Scale Fusion can achieve around 90% accuracy for all methods. Compared
(CSF) module on our network’s performance. We evaluate to the famous ENet [9], our method races for the top place for
three different weight fusion strategies such as concatenation, other medium-sized classes when ENet method is struggled
addition, and our proposed SAM. The experimental results to segment these inputs. The most challenging subset consists
are summarized in Table 3. The concatenation fusion ap- of small-object classes, where many methods mislabel ap-
proach has 32.0M parameters and achieves a 4.9% mIoU proximately 50% of the predictions. Despite this challenge,
improvement compared to the baseline method. Alterna- our proposed method maintains a strong performance for
tively, when the network employs an addition operation, the these smaller objects. The quantitative experimental results
network slightly reduces the parameter count to 31.4M while validate the effectiveness of our multi-scale prediction with
reaching a performance of 76.5% mIoU. The result shows cross-scale fusion and then demonstrate the robust perfor-
that addition operation significantly improved the accuracy. mance across a wide range of object sizes.
However, the addition fusion is not optimal and can lead to a
reduction in the highlighted weights when combining them In Table 5, we compare our proposed method to other
with lower-quality predictions. Our proposed SAM is the existing methods in terms of performance and efficiency.
core component of the CSF module and is used to enhancing MPFNet obtains impressive results with 78.3% mIoU and
the weight fusion from different predictions. When SAM 45 FPS on Cityscapes dataset. Compared to the baseline FPN
module is employed to the architecture, the performance is method [16], our approach demonstrates significant improve-
improved by 1.8% mIoU compared to the best result achieved ments, achieving a 7.0% increase in mIoU and a 27 FPS
by the other fusion methods. The results demonstrate that speedup. We can see that MPFNet also outperforms the
the CSF module equipped with SAM can significantly boost previous state-of-the-art networks. Particularly, we surpass
accuracy without increasing the model’s parameter numbers. the ENet method [9] with 20% mIoU enhancement and twice
the inference speed. When compared to ADANet [28], the
D. RESULTS ON CITYSCAPES method with the second-best performance, MPFNet proves
We conduct quantitative results to verify the effectiveness to be far more efficient and achieves three times the inference
of our proposed method on Cityscapes dataset. In Table 4, speed. Although AGLNet [29] addresses time consumption,
we compare our methods with other approaches regarding our method still achieves an 8.2% higher mIoU while nearly
per-category accuracy. We group similar categories into the reaching their inference speed. Furthermore, MPFNet sur-
same subset to evaluate the performance. All stuff classes are passes EdgeNet [30] with a 7.3% improvement in mIoU and
assigned as uncountable set. Thing classes are divided into a significantly faster inference speed. While approaches such
8 VOLUME 4, 2016
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
as SwiftNet [36], HyperSeg [37], and BiseNetV2 [38] have Although MSSNet [41] achieves competitive accuracy as
remarkable performance, our method still achieves higher one of the leading SOTA methods, our method still have
mIoU scores than them with 2.9%, 2.1%, and 2.5% improve- 3.3% mIoU improvement. Furthermore, our model achieves
ment, respectively. The results demonstrate that MPFNet a 15.2% improvement over AGLNet [29] and a 4.2% im-
achieves an effective balance between performance and ef- provement over RGPNet [42]. These results highlight the
ficiency and make it suitable for real-time semantic segmen- effectiveness of our proposed MPFNet in handling complex
tation applications. datasets and show the higher accuracy compared to both
Figure 6 presents a qualitative comparison between the baseline and existing state-of-the-art approaches.
baseline method and our proposed MPFNet on the Cityscapes
dataset. To complement the quantitative analysis, we selected TABLE 6: Performance comparison between our approach
examples that represent various object sizes for visualization. and other methods on Mapillary Vistas.
Comparative objects are highlighted with yellow boxes, and Method Resolution mIoU (%)
small-class objects are zoomed in for clarity. The results indi- DABNet [40] 1024×2048 29.6
cate that MPFNet accurately predicts their entire structure of AGLNet [29] 1024×2048 30.7
RGPNet [42] 1024×2048 41.7
large-size objects whereas the baseline method misclassifies MSSNet [41] 1024x2048 42.6
some pixels. Both approaches perform well on medium- FPN [16] 1024×2048 40.2
size objects such as cars. When rider and bicycle classes MPFNet (ours) 1024×2048 45.9
are overlapped each other, MPFNet produces sharper and
more precise segmentation. This improvement is attributed The qualitative comparison between the baseline method
by the proposed cross-scale fusion and selective attention and our proposed MPFNet on the Mapillary Vistas dataset
mechanism, which refine predictions before contributing to is illustrated in Figure 7. Yellow boxes are used to high-
the final output. For small classes such as traffic lights, traffic light specific objects for a detailed comparison. Because
signs, and poles, these classes have narrow structures and are this dataset contains some challenging categories assigned
challenge to be detected. The baseline method fails to recog- as noise classes such as ground animals or mailboxes, the
nize these objects or completely loses relevant information. overall performance across all methods is relatively limited.
On the other hand, MPFNet provides smoother and more However, the visual results demonstrate that MPFNet pro-
distinct predictions for these. Overall, the qualitative results duces good results for common classes on street scenes.
demonstrate that MPFNet achieves superior segmentation By employing the SAM module, our approach refines and
performance across multiple object sizes in complex street selects highlighted weights from all predictions. Our seg-
scenes and proves robustness and reliability for real-world mentation achieves significantly better segmentation qual-
semantic segmentation tasks. ity compared to the baseline. For instance, when the base-
line method struggles with misclassifying large objects like
TABLE 5: Performance and efficiency comparison between bridges, the MPFNet visualization shows that practically
our approach and other methods on Cityscapes dataset. match the ground truth labels. In the case of crosswalk and
Method Resolution mIoU FPS car classes, our method provides more precise segmentation
ENet [9] 1024×2048 58.3 21 than the FPN method. Some traffic lights are far from the
AGLNet [29] 1024 × 512 70.1 52 front and very narrow, MPFNet is able to produce clearer
Edgenet [30] 1024 × 512 71.0 30 predictions whereas the baseline method often fails to detect
DualNet [39] 1024×2048 75.5 51
SwiftNet [36] 1024×2048 75.4 39 these objects. Conclusively, the qualitative analysis indicates
ADANet [28] 1024×2048 77.3 15 that MPFNet effectively handles challenging scenarios on
HyperSeg [37] 1024 × 512 76.2 35 complex street scenes.
BiseNetV2 [38] 1024 × 512 75.8 47
FPN [16] 1024×2048 71.3 18
MPFNet (ours) 1024×2048 78.3 45 F. DISCUSSION
Based on Feature Pyramid Network (FPN), we upgraded the
decoder part to enhance the semantic segmentation perfor-
E. RESULTS ON MAPILLARY VISTAS mance and efficiency. The encoder of the FPN can provide
In this section, we conduct experiments on a complex dataset both context and spatial information from multi-resolution
to evaluate the superiority of our model compared to other features, so we re-used this encoder backbone. Firstly, we
state-of-the-art approaches. Table 6 presents the results for deployed multi-scale predictions (MSP) module to improve
all methods on a high-resolution input of 1024×2048 pixels. feature extraction by incorporating contextual and spatial
As a result, we can achieve the performance of 45.9%mIoU. branches. This module enables the network to handle a wide
For the baseline model [16] that does not employ our novel range of object sizes on the street scenes. The proposed MSP
components, the performance has encountered a drop of module not only significantly boosted the segmentation ac-
5.7% compared to our approach. MPFNet also surpassed the curacy but also slightly reduced model parameters compared
accuracy of the other methods. In particular, our approach to the original baseline decoder. Secondly, the cross-scale
surpasses the DABNet method [40] by 16.3% enhancement. fusion (CSF) module further refines the segmentation results
VOLUME 4, 2016 9
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
FIGURE 6: The comparative results of the baseline FPN method and our MPFNet approach on the Cityscapes validation set.
by effectively fusing weights from different scale predictions. tribution of higher adjacent predictions. The SAM module
When the CSF module is equipped with a selective-attention can remove bad values and select highlighted weights for
mechanism (SAM), our approach selects highlighted features final output. We have evaluated on Cityscapes and Mapil-
and adjusts the contribution of each prediction to the final lary Vistas datasets. As experimental results, MPFNet shows
feature map while remaining computationally efficient. The effectiveness over the baseline and other SOTA approaches.
limitation of our proposed method is that a selective attention Our method not only delivers high performance for a long
mechanism is used multiple times, so it can consume more range of objective sizes but also accelerates speed for real-
resources. However, the proposed approach outperforms the time applications. In the future, we intend to work on the
baseline method and remains efficient for deployment in real- feature extraction in order to enhance the semantic accuracy.
world scenarios.
REFERENCES
[1] G. Rossolini, F. Nesti, G. D’Amico, S. Nair, A. Biondi, and G. Buttazzo,
V. CONCLUSION
“On the real-world adversarial robustness of real-time semantic segmen-
A novel multiscale prediction network with cross-fusion, tation models for autonomous driving,” IEEE Transactions on Neural
called MPFNet, is proposed in this study. The proposed Networks and Learning Systems, 2023.
[2] K. Li, W. Tao, and L. Liu, “Online semantic object segmentation for vision
method adopts the FPN framework and uses four feature robot collected video,” IEEE Access, vol. 7, pp. 107 602–107 615, 2019.
layers to obtain both coarse and fine-grained information. [3] H. J. Lee, J. U. Kim, S. Lee, H. G. Kim, and Y. M. Ro, “Structure boundary
The innovative architecture deployed the MSP module to preserving segmentation for medical image with ambiguous boundary,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern
process prominent characteristics of the feature pyramid. recognition, 2020, pp. 4817–4826.
The MSP component improves the model performance and [4] K. O’Shea and R. Nash, “An introduction to convolutional neural net-
reduces the system computation. The CSF is designed to works,” arXiv preprint arXiv:1511.08458, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
collect weights across multiple-scale predictions. In the CSF for semantic segmentation,” in Proceedings of the IEEE conference on
mechanism, the SAM is utilized to directly control the con- computer vision and pattern recognition, 2015, pp. 3431–3440.
10 VOLUME 4, 2016
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
FIGURE 7: The comparative results of the baseline FPN method and our MPFNet approach on the Mapillary Vistas.
[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for network with multi-scale context aggregation for real-time semantic seg-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. mentation,” Neurocomputing, vol. 521, pp. 27–40, 2023.
[7] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic [13] M. Zhuang, X. Zhong, D. Gu, L. Feng, X. Zhong, and H. Hu, “Lrdnet:
segmentation in street scenes,” in Proceedings of the IEEE conference on A lightweight and efficient network with refined dual attention decorder
computer vision and pattern recognition, 2018, pp. 3684–3692. for real-time semantic segmentation,” Neurocomputing, vol. 459, pp. 349–
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 360, 2021.
recognition,” in Proceedings of the IEEE conference on computer vision [14] M. Shi, J. Shen, Q. Yi, J. Weng, Z. Huang, A. Luo, and Y. Zhou, “Lmffnet:
and pattern recognition, 2016, pp. 770–778. a well-balanced lightweight network for fast and accurate semantic seg-
[9] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural mentation,” IEEE Transactions on Neural Networks and Learning Sys-
network architecture for real-time semantic segmentation,” arXiv preprint tems, 2022.
arXiv:1606.02147, 2016. [15] G. Gao, G. Xu, Y. Yu, J. Xie, J. Yang, and D. Yue, “Mscfnet: a lightweight
[10] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient network with multi-scale context fusion for real-time semantic segmenta-
residual factorized convnet for real-time semantic segmentation,” IEEE tion,” IEEE Transactions on Intelligent Transportation Systems, 2021.
Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263– [16] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, “Feature pyramid
272, 2017. network for multi-class land segmentation,” in Proceedings of the IEEE
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Conference on Computer Vision and Pattern Recognition Workshops,
“Deeplab: Semantic image segmentation with deep convolutional nets, 2018, pp. 272–275.
atrous convolution, and fully connected crfs,” IEEE transactions on pattern [17] B. Xie, Z. Yang, L. Yang, R. Luo, A. Wei, X. Weng, and B. Li, “Multi-scale
analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017. fusion with matching attention model: A novel decoding network coop-
[12] J. Liu, F. Zhang, Z. Zhou, and J. Wang, “Bfmnet: Bilateral feature fusion erated with nas for real-time semantic segmentation,” IEEE Transactions
VOLUME 4, 2016 11
V. T. Quyen, M. Y. Kim: MPFNet: Multiscale Prediction Network with Cross Fusion for Real-Time Semantic Segmentation
on Intelligent Transportation Systems, vol. 23, no. 8, pp. 12 622–12 632, [39] Q. Van Toan and M. Y. Kim, “Dual-inferences mechanism for real-time
2021. semantic segmentation,” in 2022 Thirteenth International Conference on
[18] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for Ubiquitous and Future Networks (ICUFN). IEEE, 2022, pp. 12–17.
semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020. [40] G. Li, I. Yun, J. Kim, and J. Kim, “Dabnet: Depth-wise asymmet-
[19] Z. Wu, C. Shen, and A. v. d. Hengel, “High-performance semantic seg- ric bottleneck for real-time semantic segmentation,” arXiv preprint
mentation using very deep fully convolutional networks,” arXiv preprint arXiv:1907.11357, 2019.
arXiv:1604.04339, 2016. [41] Q. Van Toan and M. Y. Kim, “Multi-scale synergy approach for real-time
[20] Z. Zhong, J. Li, W. Cui, and H. Jiang, “Fully convolutional networks for semantic segmentation,” in 2022 International Conference on Artificial
building and road extraction: Preliminary results,” in 2016 IEEE Interna- Intelligence in Information and Communication (ICAIIC). IEEE, 2022,
tional Geoscience and Remote Sensing Symposium (IGARSS). IEEE, pp. 216–220.
2016, pp. 1591–1594. [42] E. Arani, S. Marzban, A. Pata, and B. Zonooz, “Rgpnet: A real-time
[21] J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao, “Mlfnet: Multi- general purpose semantic segmentation,” in Proceedings of the IEEE/CVF
level fusion network for real-time semantic segmentation of autonomous Winter Conference on Applications of Computer Vision, 2021, pp. 3009–
driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756– 3018.
767, 2022.
[22] S. Hao, Y. Zhou, Y. Guo, R. Hong, J. Cheng, and M. Wang, “Real-
time semantic segmentation via spatial-detail guided context propagation,”
IEEE Transactions on Neural Networks and Learning Systems, 2022.
[23] A. Fateh, M. R. Mohammadi, and M. R. J. Motlagh, “Msdnet: Multi-
scale decoder for few-shot semantic segmentation via transformer-guided
prototyping,” arXiv preprint arXiv:2409.11316, 2024.
[24] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Multi-scale
context intertwining for semantic segmentation,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 603–619.
[25] J. Fan, M. J. Bocus, B. Hosking, R. Wu, Y. Liu, S. Vityazev, and R. Fan, T oan Van Quyen received the B.S. degree in
“Multi-scale feature fusion: Learning better semantic segmentation for Electrical Engineering,Thai Nguyen University of
road pothole detection,” in 2021 IEEE International Conference on Au- Technology, Thai Nguyen, Vietnam in 2019, and
tonomous Systems (ICAS). IEEE, 2021, pp. 1–5. the M.S. degree in Electronic and Electrical Engi-
[26] X. Ding, C. Shen, T. Zeng, and Y. Peng, “Sab net: A semantic attention neering, IT college, Kyungpook National Univer-
boosting framework for semantic segmentation,” IEEE Transactions on
sity, Daegu, South Korea in 2022. He is currently
Neural Networks and Learning Systems, 2022.
pursuing a Ph.D degree in the School of Electronic
[27] B. Zhang, W. Li, Y. Hui, J. Liu, and Y. Guan, “Mfenet: Multi-level feature
enhancement network for real-time semantic segmentation,” Neurocom-
and Electrical Engineering, Computer Science,
puting, vol. 393, pp. 54–65, 2020. Kyungpook National University. His research in-
[28] W. Wang, S. Wang, Y. Li, and Y. Jin, “Adaptive multi-scale dual attention terests include computer vision, deep learning, and
network for semantic segmentation,” Neurocomputing, vol. 460, pp. 39– semantic segmentation.
49, 2021.
[29] Q. Zhou, Y. Wang, Y. Fan, X. Wu, S. Zhang, B. Kang, and L. J. Latecki,
“Aglnet: Towards real-time semantic segmentation of self-driving im-
ages via attention-guided lightweight network,” Applied Soft Computing,
vol. 96, p. 106682, 2020.
[30] H.-Y. Han, Y.-C. Chen, P.-Y. Hsiao, and L.-C. Fu, “Using channel-wise
attention for deep cnn based real-time semantic segmentation with class-
aware edge information,” IEEE Transactions on Intelligent Transportation
Systems, vol. 22, no. 2, pp. 1041–1051, 2020.
[31] G. Li, S. Jiang, I. Yun, J. Kim, and J. Kim, “Depth-wise asymmetric
bottleneck with point-wise aggregation decoder for real-time semantic M in Young Kim (Member, IEEE) received the
segmentation in urban scenes,” IEEE Access, vol. 8, pp. 27 495–27 506, B.S., M.S., and Ph.D. degrees from the Korea Ad-
2020. vanced Institute of Science and Technology, South
[32] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement Korea, in 1996, 1998, and 2004, respectively. He
networks for high-resolution semantic segmentation,” in Proceedings of was a Senior Researcher with Mirae Corporation,
the IEEE conference on computer vision and pattern recognition, 2017, from 2004 to 2005. He was also a Chief Research
pp. 1925–1934. Engineer in artificial vision systems for intelligent
[33] J. Liu, X. Xu, Y. Shi, C. Deng, and M. Shi, “Relaxnet: Residual efficient machines and robots with Kohyoung Corporation,
learning and attention expected fusion network for real-time semantic from 2005 to 2009. Since 2009, he has been with
segmentation,” Neurocomputing, vol. 474, pp. 115–127, 2022.
the School of Electronic and Electrical Engineer-
[34] P. Hu, F. Perazzi, F. C. Heilbron, O. Wang, Z. Lin, K. Saenko, and
ing, Computer Science, Kyungpook National University, as an Assistant
S. Sclaroff, “Real-time semantic segmentation with fast attention,” IEEE
Professor. He was a Visiting Associate Professor with the Department of
Robotics and Automation Letters, vol. 6, no. 1, pp. 263–270, 2020.
Electrical and Computer Engineering and the School of Medicine, Johns
[35] X. Xu, S. Huang, and H. Lai, “Lightweight semantic segmentation network
leveraging class-aware contextual information,” IEEE Access, 2023. Hopkins University, from 2014 to 2016. He is currently an Associate
[36] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained Professor with the School of Electronics Engineering, Kyungpook National
imagenet architectures for real-time semantic segmentation of road-driving University. He is also a Deputy Director with the KNU-LG Convergence
images,” in Proceedings of the IEEE/CVF Conference on Computer Vision Research Center and the Director of the Research Center for Neurosurgi-
and Pattern Recognition, 2019, pp. 12 607–12 616. cal Robotic Systems. His research interests include visual intelligence for
[37] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork robotic perception, recognition of autonomous unmanned ground, and aerial
for real-time semantic segmentation,” in Proceedings of the IEEE/CVF vehicles.
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061–
4070.
[38] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral
network with guided aggregation for real-time semantic segmentation,”
International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068,
2021.
12 VOLUME 4, 2016