-
摘要
针对语义分割网络参数量过大导致其难以部署在内存受限的边缘设备等问题,本文提出一种基于BiLevelNet的轻量级实时语义分割算法。首先,利用空洞卷积扩大感受野,并结合特征复用策略增强网络的区域感知能力。接着,嵌入两阶段的PBRA注意力机制,建立远距离相关物体之间的依赖关系以增强网络的全局感知能力。最后,引入结合浅层特征的FADE算子以改善图像上采样效果。实验结果表明,在输入图像分辨率为 512×1024的情况下,本文网络在Cityscapes数据集上以121 f/s的速率获得了75.1%的平均交并比,模型大小仅为0.7 M。同时在输入图像分辨率为360×480的情况下,在Camvid数据集上取得68.2%的平均交并比。同当前其他实时语义分割方法相比,该网络性能取得速度与精度的均衡,符合自动驾驶应用场景对实时性的要求。
Abstract
In response to the problem of the large parameter size of semantic segmentation networks, making it difficult to deploy on memory-constrained edge devices, a lightweight real-time semantic segmentation algorithm is proposed based on BiLevelNet. Firstly, dilated convolutions are employed to augment the receptive field, and feature reuse strategies are integrated to enhance the network's region awareness. Next, a two-stage PBRA (Partial Bi-Level Route Attention) mechanism is incorporated to establish dependencies between distant objects, thereby augmenting the network's global perception capability. Finally, the FADE operator is introduced to combine shallow features to improve the effectiveness of image upsampling. Experimental results show that, at an input image resolution of 512×1024, the proposed network achieves an average Intersection over Union (IoU) of 75.1% on the Cityscapes dataset at a speed of 121 frames per second, with a model size of only 0.7 M. Additionally, at an input image resolution of 360×480, the network achieves an average IoU of 68.2% on the CamVid dataset. Compared with other real-time semantic segmentation methods, this network achieves a balance between speed and accuracy, meeting the real-time requirements for applications like autonomous driving.
-
Key words:
- real-time semantic segmentation /
- autonomous driving /
- deep learning /
- self-attention /
- upsampling
-
Overview
Overview: In response to the challenge posed by the large parameter sizes of semantic segmentation networks, which complicate deployment on memory-constrained edge devices, a lightweight real-time semantic segmentation algorithm based on BiLevelNet is proposed. Initially, dilated convolutions are utilized to broaden the receptive field, and strategies for reusing features are integrated to bolster the network's awareness of regions. Subsequently, a two-stage PBRA (Partial Bi-Level Route Attention) mechanism is adopted to form connections between distant objects, thereby enhancing the network's capability to perceive global contexts. Moreover, the FADE operator is introduced for merging shallow features, thereby augmenting the efficacy of image upsampling.
Within the depicted AFR module in Fig. 4, a variety of hierarchical feature maps are presented, along with descriptions of their characteristics and roles. The distinctions and connections between the input feature map, the local feature map achieved through 3×3 depth convolution, and the context information feature map acquired through dilated convolution are clarified. It is further emphasized how these features are effectively amalgamated in the final fused feature map, showcasing strong activation across both local and global contexts. Additionally, a gradually decreasing channel reduction factor is employed, as elaborated in Table 3. Through the gradual adjustment of the channel reduction factor, it is observed that with a reduction factor of r=1/4, the PBRA module enhances mIoU by 1.5% and boosts speed by 12FPS in comparison to BRA.
Moreover, discontinuities and missing pixels are noted in segmentation results when bilinear interpolation is used for upsampling. Observations of the depth feature maps prior to bilinear upsampling reveal that features corresponding to roads and sidewalks bear similarities, leading to potential misclassifications. To counteract this issue, shallow features that preserve edge information are introduced and merged into the FADE upsampling process, thereby improving edge segmentation. This method effectively addresses the loss of spatial information, resulting in smoother and more defined edge segmentation outcomes.
Experimental outcomes indicate that, at an input image resolution of 512×1024, the network attains an average Intersection over Union (IoU) of 75.1% on the Cityscapes dataset, operating at a speed of 121 frames per second, while maintaining a modest model size of only 0.7M. Furthermore, at an input image resolution of 360×480, the network secures an average IoU of 68.2% on the CamVid dataset. Compared with other real-time semantic segmentation methods, this network maintains an optimal balance between speed and accuracy, fulfilling the real-time operation requirements for applications such as autonomous driving.
-
-
表 1 本文网络框架
Table 1. Network framework of BiLevelNet
Stage Operator Mode Output size Stage 1 3 × 3 Conv Stride 2 32 × 256 × 512 3 × 3 Conv Stride 1 32 × 256 × 512 3 × 3 Conv Stride 1 32 × 256 × 512 Stage 2 AFR-S 64 × 128 × 256 2 × ARF Dilated 2 64 × 128 × 256 Stage 3 AFR-S 128 × 64 × 128 4 × AFR Dilated 4 128 × 64 × 128 5 × AFR Dilated 8 128 × 64 × 128 Decoder DAF 32 × 256 × 512 1 × 1 Conv Stride 1 19 × 256 × 512 Bilinear 19 × 512 × 1024 表 2 不同特征提取模块在Cityscapes数据集的性能对比
Table 2. Performance comparison of different feature extraction modules on the Cityscapes dataset
Params/M FLOPs/G FPS mIoU/% SSnbt 0.83 11.61 132 67.1 DAB 0.75 10.78 140 71.8 AFR 0.68 9.64 128 75.5 表 3 不同缩减因子在Cityscapes验证集的实验结果
Table 3. Experimental results of different reduction factor modules in Cityscapes validation set
Ratio Params/M FLOPs/G FPS mIoU/% 0 0.67 9.59 135 74.0 1 0.74 10.25 116 74.2 1/2 0.69 9.75 120 75.0 1/4 0.68 9.64 128 75.5 1/8 0.67 9.61 130 75.1 1/16 0.67 9.6 131 74.1 表 4 FADE上采样算子在Cityscapes验证集的消融实验
Table 4. Experimental results of FADE modules on the Cityscapes validation dataset
Params/M FLOPs/G FPS mIoU/% Bilinear 0.68 9.64 128 75.5 FADE 0.7 10.4 121 75.9 表 5 不同模型在Cityscapes数据集的性能对比
Table 5. Performance comparison of different models on the Cityscapes dataset
Algorithm Size Params/M FLOPs/G FPS mIoU/% ENet 512×1024 0.36 4.35 42 58.3 ERFNet 512×1024 2.10 26.8 59 68.0 LEDNet 512×1024 0.94 11.5 71 69.2 DABNet 512×1024 0.76 - 104 70.1 ELANet[29] 512×1024 0.67 9.7 93 74.7 RELAXNet 512×1024 1.90 22.84 64 74.8 DALNet[30] 512×1 024 0.48 - 74 71.1 BiseNet-v2 512×1024 3.40 21.2 156 72.6 MIFNet[31] 512×1024 0.82 12.03 74 73.1 文献[32] 512×1024 6.22 12.5 154.7 74.2 Ours 512×1024 0.70 10.4 121 75.1 表 6 不同模型在 Cityscapes数据集上的各类别交并比
Table 6. Evaluation results of per-class IoU /% on the Cityscapes dataset
Class ERFNet DABNet LEDNet FDDWNet Ours Roa 97.9 96.8 97.1 98.0 98.0 Sid 82.1 78.5 78.6 82.4 82.2 Bui 90.7 90.9 90.4 91.1 91.8 Wal 45.2 45.3 46.5 52.5 54.8 Fen 50.4 50.1 48.1 51.2 56.5 Pol 59.0 59.1 60.9 59.9 63.2 Tli 62.6 65.2 60.4 64.4 68.4 TSi 68.4 70.7 71.1 68.9 72.1 Veg 91.9 92.5 91.2 92.5 92.8 Ter 69.4 68.1 60.0 70.3 70.5 Sky 94.2 94.6 93.2 94.4 94.5 Ped 78.5 80.5 74.3 80.8 82.3 Rid 59.8 58.5 51.8 59.8 65.2 Car 93.4 92.7 92.3 94.0 94.3 Tru 52.5 52.7 61.0 56.5 59.2 Bus 60.8 67.2 72.4 68.9 78.5 Tra 53.7 50.9 51.0 48.6 73.9 Mot 49.9 50.4 43.3 55.7 57.9 Bic 64.2 65.7 70.2 67.7 70.2 表 7 CamVid数据集上的性能对比
Table 7. Performance comparison on the CamVid dataset
Algorithm Size Pretrain Params/M mIoU/% ENet 360×480 N 0.36 51.3 CGNet 360×480 N 0.5 64.7 DALNet 360×480 N 0.47 66.1 LEDNet 360×480 N 0.94 66.6 DABNet 360×480 N 0.76 66.4 MIFNet 360×480 N 0.81 67.7 ELANet 360×480 N 0.67 67.9 BiseNet-v2 360×480 Y 5.8 68.7 Ours 360×480 N 0.7 68.2 -
参考文献
[1] Li L H, Qian B, Lian J, et al. Traffic scene segmentation based on RGB-D image and deep learning[J]. IEEE Trans Intell Transp Syst, 2017, 19(5): 1664−1669. doi: 10.1109/TITS.2017.2724138
[2] 梁礼明, 卢宝贺, 龙鹏威, 等. 自适应特征融合级联Transformer视网膜血管分割算法[J]. 光电工程, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161
Liang L M, Lu B H, Long P W, et al. Adaptive feature fusion cascade transformer retinal vessel segmentation algorithm[J]. Opto-Electron Eng, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161
[3] 闵锋, 彭伟明, 况永刚, 等. 基于非下采样轮廓波变换的遥感地物分割算法[J]. 电光与控制, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008
Min F, Peng W M, Kuang Y G, et al. A remote sensing ground object segmentation algorithm based on non-subsampled contourlet transform[J]. Electron Opt Control, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008
[4] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2881–2890. https://doi.org/10.1109/CVPR.2017.660.
[5] 张文博, 瞿珏, 王崴, 等. 融合多尺度特征的改进Deeplab v3+图像语义分割算法[J]. 电光与控制, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003
Zhang W B, Qu J, Wang W, et al. An improved Deeplab v3+ image semantic segmentation algorithm incorporating multi-scale features[J]. Electron Opt Control, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003
[6] Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. https://doi.org/10.1109/ICCV.2019.00140.
[7] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350.
[8] Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video: a high-definition ground truth database[J]. Pattern Recognit Lett, 2009, 30(2): 88−97. doi: 10.1016/j.patrec.2008.04.005
[9] Yu C Q, Gao C X, Wang J B, et al. BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation[J]. Int J Comput Vis, 2021, 129(11): 3051−3068. doi: 10.1007/s11263-021-01515-2
[10] Zhuang M X, Zhong X Y, Gu D B, et al. LRDNet: a lightweight and efficient network with refined dual attention decorder for real-time semantic segmentation[J]. Neurocomputing, 2021, 459: 349−360. doi: 10.1016/j.neucom.2021.07.019
[11] Romera E, Álvarez J M, Bergasa L M, et al. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation[J]. IEEE Trans Intell Transp Syst, 2018, 19(1): 263−272. doi: 10.1109/TITS.2017.2750080
[12] Liu J, Zhou Q, Qiang Y, et al. FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation[C]//Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 2373–2377. https://doi.org/10.1109/ICASSP40776.2020.9053838.
[13] Liu J, Xu X Q, Shi Y Q, et al. RELAXNet: residual efficient learning and attention expected fusion network for real-time semantic segmentation[J]. Neurocomputing, 2022, 474: 115−127. doi: 10.1016/j.neucom.2021.12.003
[14] 林珊玲, 彭雪玲, 林坚普, 等. 多尺度增强特征融合的钢表面缺陷目标检测[J]. 光学精密工程, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075
Lin S L, Peng X L, Lin J P, et al. Object detection of steel surface defect based on multi-scale enhanced feature fusion[J]. Opt Precision Eng, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075
[15] Wang Y, Zhou Q, Liu J, et al. Lednet: a lightweight encoder-decoder network for real-time semantic segmentation[C]//Proceedings of 2019 IEEE International Conference on Image Processing, 2019: 1860–1864. https://doi.org/10.1109/ICIP.2019.8803154.
[16] Wei H R, Liu X, Xu S C, et al. DWRSeg: dilation-wise residual network for real-time semantic segmentation[Z]. arXiv: 2212.01173, 2023. https://arxiv.org/abs/2212.01173v1.
[17] Chen J R, Kao S H, He H, et al. Run, don't walk: chasing higher FLOPS for faster neural networks[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12021–12031. https://doi.org/10.1109/CVPR52729.2023.01157.
[18] Ma N N, Zhang X Y, Zheng H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 116–131. https://doi.org/10.1007/978-3-030-01264-9_8.
[19] Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3–19. https://doi.org/10.1007/978-3-030-01234-2_1.
[20] 张冲, 黄影平, 郭志阳, 等. 基于语义分割的实时车道线检测方法[J]. 光电工程, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378
Zhang C, Huang Y P, Guo Z Y, et al. Real-time lane detection method based on semantic segmentation[J]. Opto-Electron Eng, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378
[21] Huang Z L, Wang X G, Huang L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 603–612. https://doi.org/10.1109/ICCV.2019.00069.
[22] 吴刚, 葛芸, 储珺, 等. 面向遥感图像检索的级联池化自注意力研究[J]. 光电工程, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029
Wu G, Ge Y, Chu J, et al. Cascade pooling self-attention research for remote sensing image retrieval[J]. Opto-Electron Eng, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029
[23] Xia Z F, Pan X R, Song S J, et al. Vision transformer with deformable attention[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4794–4803. https://doi.org/10.1109/CVPR52688.2022.00475.
[24] Zhu L, Wang X J, Ke Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10323–10333. https://doi.org/10.1109/CVPR52729.2023.00995.
[25] Wang J Q, Chen K, Xu R, et al. CARAFE: content-aware ReAssembly of FEatures[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 3007–3016. https://doi.org/10.1109/ICCV.2019.00310.
[26] 刘春娟, 乔泽, 闫浩文, 等. 基于多尺度互注意力的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008
Liu C J, Qiao Z, Yan H W, et al. Semantic segmentation network for remote sensing image based on multi-scale mutual attention[J]. J Zhejiang Univ (Eng Sci), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008
[27] Lu H, Liu W Z, Fu H T, et al. FADE: fusing the assets of decoder and encoder for task-agnostic upsampling[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 231–247. https://doi.org/10.1007/978-3-031-19812-0_14.
[28] Li H C, Xiong P F, Fan H Q, et al. DFANet: deep feature aggregation for real-time semantic segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9522–9531. https://doi.org/10.1109/CVPR.2019.00975.
[29] Yi Q M, Dai G S, Shi M, et al. ELANet: effective lightweight attention-guided network for real-time semantic segmentation[J]. Neural Process Lett, 2023, 55(5): 6425−6442. doi: 10.1007/s11063-023-11145-z
[30] 石敏, 沈佳林, 易清明, 等. 快速超轻量城市交通场景语义分割[J]. 计算机科学与探索, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015
Shi M, Shen J L, Yi Q M, et al. Rapid and ultra-lightweight semantic segmentation in urban traffic scene[J]. J Front Comput Sci Technol, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015
[31] 易清明, 张文婷, 石敏, 等. 多尺度特征融合的道路场景语义分割[J]. 激光与光电子学进展, 2023, 60(12): 1210006. doi: 10.3788/LOP220914
Yi Q M, Zhang W T, Shi M, et al. Semantic segmentation for road scene based on multiscale feature fusion[J]. Laser Optoelectron Prog, 2023, 60(12): 1210006. doi: 10.3788/LOP220914
[32] 兰建平, 董冯雷, 杨亚会, 等. 改进STDC-Seg的实时图像语义分割网络算法[J]. 传感器与微系统, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04
Lan J P, Dong F L, Yang Y H, et al. Real-time image semantic segmentation network algorithm based on improved STDC-Seg[J]. Transducer Microsyst Technol, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04
-
访问统计