面向道路场景语义分割的移动窗口变换神经网络设计

杭昊,黄影平,张栩瑞,等. 面向道路场景语义分割的移动窗口变换神经网络设计[J]. 光电工程,2024,51(1): 230304. doi: 10.12086/oee.2024.230304
引用本文: 杭昊,黄影平,张栩瑞,等. 面向道路场景语义分割的移动窗口变换神经网络设计[J]. 光电工程,2024,51(1): 230304. doi: 10.12086/oee.2024.230304
Hang H, Huang Y P, Zhang X R, et al. Design of Swin Transformer for semantic segmentation of road scenes[J]. Opto-Electron Eng, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304
Citation: Hang H, Huang Y P, Zhang X R, et al. Design of Swin Transformer for semantic segmentation of road scenes[J]. Opto-Electron Eng, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304

面向道路场景语义分割的移动窗口变换神经网络设计

  • 基金项目:
    国家自然科学基金资助项目(62276167)
详细信息

Design of Swin Transformer for semantic segmentation of road scenes

  • Fund Project: Projected supported by National Natrual Science Foundation of China (62276167)
More Information
  • 道路场景语义分割是自动驾驶环境感知的一项重要任务。近年来,变换神经网络(Transformer)在计算机视觉领域开始应用并取得了很好的效果。针对复杂场景图像语义分割精度低、细小目标识别能力不足等问题,本文提出了一种基于移动窗口Transformer的多尺度特征融合的道路场景语义分割算法。该网络采用编码-解码结构,编码器使用改进后的移动窗口Transformer特征提取器对道路场景图像进行特征提取,解码器由注意力融合模块和特征金字塔网络构成,充分融合多尺度的语义特征。在Cityscapes城市道路场景数据集上进行验证测试,实验结果表明,与多种现有的语义分割算法进行对比,本文方法在分割精度方面有较大的提升。

  • Overview: Semantic segmentation of road scenes is a crucial task for the perception of autonomous driving environments. In recent years, deep learning technologies have elevated research in semantic segmentation, leading to the emergence of numerous new algorithms. Methods based on deep learning train models with extensive data automatically extract data features and become the mainstream approach for semantic segmentation. Currently, deep learning algorithms applied to image semantic segmentation primarily fall into two categories: those based on CNN and those based on Transformer. CNN-based image semantic segmentation algorithms such as FCN, PSPNet, U-Net, and DeepLab have made significant contributions to the field. Transformer is a novel architecture based on self-attention, initially applied in the NLP domain. With powerful feature extraction capabilities, Transformer can capture long-range dependencies between feature vectors, acquiring richer contextual information. Researchers have gradually adapted Transformers to the computer vision domain, forming various Visual Transformers. Subsequently, the Swin Transformer stands out, employing a hierarchical structure to output multi-scale features, calculating local self-attention within a window, achieving information interaction between windows through shift-window operations, and demonstrating excellent performance in various visual tasks. Despite extensive research on semantic segmentation algorithms for road scenes, existing methods still face challenges in practical applications. Addressing issues such as low segmentation accuracy in complex scene images and inadequate recognition of small targets, this paper proposes a road scene semantic segmentation algorithm based on the SwinTransformer with multi-scale feature fusion. The network adopts an encoder-decoder structure, where the encoder employs an improved SwinTransformer feature extractor for feature extraction in road scene images, reducing information loss during downsampling and retaining as many edge features as possible. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating multi-scale semantic features and efficiently restoring fine-grained details in urban road images. We conduct quantitative and qualitative experiments on the Cityscapes urban road scene dataset. The results show that, compared to various existing semantic segmentation algorithms, our method exhibits significant improvements in segmentation accuracy. However, our network structure is relatively complex, with a large number of computations and parameters. In practical applications, further refinement, optimization of the network structure, and lightweight processing to reduce parameters and computations are still required.

  • 加载中
  • 图 1  网络架构

    Figure 1.  Network architecture

    图 2  Swin Transformer架构

    Figure 2.  Swin Transformer architecture

    图 3  Swin Transformer 模块

    Figure 3.  Swin Transformer block

    图 4  图像块合并模块

    Figure 4.  Patch Merging module

    图 5  特征压缩模块

    Figure 5.  FCM module

    图 6  注意力融合模块

    Figure 6.  AFM module

    图 7  Cityscapes场景中多种方法分割效果对比图

    Figure 7.  Comparison of segmentation effects of multiple methods in Cityscapes scenes

    图 8  消融实验效果对比图

    Figure 8.  Comparison of ablation experiment effects

    表 1  实验环境

    Table 1.  Experimental environment

    实验环境配置实验环境配置
    CPUAMD5600XdCPU核心数6
    GPUNVIDIA RTX3070主频3.7 GHz
    内存32 G显存11 G
    操作系统Ubuntu18.04编程语言Python 3.7
    深度学习框架Pytorch 1.10.0CUDA10.2
    下载: 导出CSV

    表 2  各类模型在Cityscapes数据集上的IoU和MIoU

    Table 2.  IoU and MIoU of various models on the Cityscapes dataset

    ClassesFCNPSPNetUNetDeepLabv3SwinTOurs
    Road97.198.098.098.198.098.1
    Sidewalk79.981.884.284.584.786.2
    Building89.391.191.191.791.491.6
    Wall44.248.248.751.254.455.5
    Fence48.350.351.553.657.359.9
    Pole30.645.748.250.355.557.2
    Traffic Light44.750.051.753.761.963.2
    Traffic Sign56.862.365.868.273.574.4
    Vegetation87.189.290.190.190.292.4
    Terrain60.462.865.364.261.363.2
    Sky90.894.293.895.394.295.1
    Person64.171.272.674.575.576.9
    Rider38.245.646.149.555.755.9
    Car90.492.092.292.693.893.5
    Truck51.368.563.474.473.672.5
    Bus72.080.377.683.279.479.9
    Train74.477.478.581.577.778.1
    Motocycle52.550.155.553.556.559.2
    Bicycle59.160.163.464.271.273.2
    MIoU/%64.9269.2870.4573.7173.1775.18
    下载: 导出CSV

    表 3  各类模型在Cityscapes数据集上的PA和MPA

    Table 3.  PA and MPA of various models on the Cityscapes dataset

    ClassesFCNPSPNetUNetDeepLabv3SwinTOurs
    Road98.198.598.899.199.199.1
    Sidewalk89.989.390.292.091.292.7
    Building96.394.796.196.296.596.8
    Wall52.272.160.773.171.472.3
    Fence60.369.368.572.571.474.6
    Pole36.674.759.274.374.177.7
    Traffic Light56.772.062.769.270.472.1
    Traffic Sign68.879.375.876.576.779.3
    Vegetation94.193.295.193.695.397.7
    Terrain74.479.878.378.179.280.3
    Sky95.897.297.897.597.597.9
    Person77.182.284.684.286.387.9
    Rider58.268.655.171.272.473.7
    Car96.496.096.296.397.697.6
    Truck62.379.576.475.573.576.2
    Bus85.087.389.691.785.687.7
    Train78.483.492.588.479.382.9
    Motocycle66.573.567.577.577.379.2
    Bicycle77.173.180.476.280.384.2
    MPA/%74.6479.9780.0682.3181.5984.83
    下载: 导出CSV

    表 4  各类语义分割算法性能比较

    Table 4.  Performance comparison of various semantic segmentation algorithms

    方法MIoU/%MPA/%Param/MFLOPs/GFPS
    FCN64.9274.6434.9066.3858.61
    PSPNet69.2879.9751.86152.9781.25
    UNet70.4580.0649.10166.9254.52
    DeepLabv373.7182.3168.37235.3736.59
    SwinT73.1781.59121.25297.5712.22
    Ours75.1884.83123.77305.4614.83
    下载: 导出CSV

    表 5  消融实验

    Table 5.  Ablation experiment

    实验序号AFMFCMASPPMIoU/%MPA/%
    ×××73.181.6
    ××73.880.5
    ×74.983.3
    75.284.8
    注:“√”表示网络中包含该结构,“×”表示在网络中去掉该结构。
    下载: 导出CSV
  • [1]

    Mo Y J, Wu Y, Yang X N, et al. Review the state-of-the-art technologies of semantic segmentation based on deep learning[J]. Neurocomputing, 2022, 493: 626−646. doi: 10.1016/j.neucom.2022.01.005

    [2]

    Liu X L, Deng Z D, Yang Y H. Recent progress in semantic image segmentation[J]. Artif Intell Rev, 2019, 52(2): 1089−1106. doi: 10.1007/s10462-018-9641-3

    [3]

    张莹, 黄影平, 郭志阳, 等. 基于点云与图像交叉融合的道路分割方法[J]. 光电工程, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340

    Zhang Y, Huang Y P, Guo Z Y, et al. Point cloud-image data fusion for road segmentation[J]. Opto-Electron Eng, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340

    [4]

    Chiu M T, Xu X Q, Wei Y C, et al. Agriculture-vision: a large aerial image database for agricultural pattern analysis[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 2825–2835. https://doi.org/10.1109/CVPR42600.2020.00290.

    [5]

    Qureshi I, Yan J H, Abbas Q, et al. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends[J]. Inf Fusion, 2023, 90: 316−352. doi: 10.1016/j.inffus.2022.09.031

    [6]

    Chua L O, Roska T. The CNN paradigm[J]. IEEE Trans Circuits Syst I:Fundam Theory Appl, 1993, 40(3): 147−156. doi: 10.1109/81.222795

    [7]

    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010.

    [8]

    Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965.

    [9]

    Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//18th International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015: 234–241. https://doi.org/10.1007/978-3-319-24574-4_28.

    [10]

    Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230–6239. https://doi.org/10.1109/CVPR.2017.660.

    [11]

    Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//3rd International Conference on Learning Representations, 2015.

    [12]

    Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell, 2018, 40(4): 834−848. doi: 10.1109/TPAMI.2017.2699184

    [13]

    Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[Z]. arXiv: 1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587.

    [14]

    Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 833–851. https://doi.org/10.1007/978-3-030-01234-2_49.

    [15]

    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations, 2021.

    [16]

    Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986.

    [17]

    Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. https://doi.org/10.1109/CVPR.2017.106.

    [18]

    Xie S N, Tu Z W. Holistically-nested edge detection[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, 2015: 1395–1403. https://doi.org/10.1109/ICCV.2015.164.

    [19]

    Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. Int J Comput Vis, 2004, 59(2): 167−181. doi: 10.1023/B:VISI.0000022288.19776.77

    [20]

    Sehar U, Naseem M L. How deep learning is empowering semantic segmentation: traditional and deep learning techniques for semantic segmentation: a comparison[J]. Multimed Tools Appl, 2022, 81(21): 30519−30544. doi: 10.1007/s11042-022-12821-3

    [21]

    Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, 2012: 1106–1114.

    [22]

    He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90.

    [23]

    Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 10337–10346. https://doi.org/10.1109/ICCV48922.2021.01019.

    [24]

    马梁, 苟于涛, 雷涛, 等. 基于多尺度特征融合的遥感图像小目标检测[J]. 光电工程, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363

    Ma L, Gou Y T, Lei T, et al. Small object detection based on multi-scale feature fusion using remote sensing images[J]. Opto-Electron Eng, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363

    [25]

    Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350.

    [26]

    Ulku I, Akagündüz E. A survey on deep learning-based architectures for semantic segmentation on 2D images[J]. Appl Artif Intell, 2022, 36(1): 2032924. doi: 10.1080/08839514.2022.2032924

  • 加载中

(9)

(5)

计量
  • 文章访问数: 
  • PDF下载数: 
  • 施引文献:  0
出版历程
收稿日期:  2023-12-14
修回日期:  2024-01-23
录用日期:  2024-01-24
刊出日期:  2024-01-25

目录

/

返回文章
返回