面向道路场景语义分割的移动窗口变换神经网络设计

杭昊; 黄影平; 张栩瑞; 罗鑫

doi:10.12086/oee.2024.230304

面向道路场景语义分割的移动窗口变换神经网络设计

- 上海理工大学光电信息与计算机工程学院，上海 200093
基金项目:
国家自然科学基金资助项目(62276167)

详细信息

作者简介:
杭昊 (1997-)，男，硕士研究生，研究方向为计算机视觉。E-mail：212230402@st.usst.edu.cn;

黄影平 (1966-)，男，教授，研究方向为汽车电子、计算机视觉。E-mail：huangyingping@usst.edu.cn

**^*通讯作者:** 黄影平，huangyingping@usst.edu.cn

中图分类号: TP391.4

收稿日期: 2023-12-14

修回日期: 2024-01-23

录用日期: 2024-01-24

刊出日期: 2024-01-25

Design of Swin Transformer for semantic segmentation of road scenes

- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
Fund Project: Projected supported by National Natrual Science Foundation of China (62276167)

More Information

**^*Corresponding author:** huangyingping@usst.edu.cn

Received Date 14 December 2023

Revised Date 23 January 2024

Accepted Date 24 January 2024

Published Date 25 January 2024

摘要

摘要

道路场景语义分割是自动驾驶环境感知的一项重要任务。近年来，变换神经网络(Transformer)在计算机视觉领域开始应用并取得了很好的效果。针对复杂场景图像语义分割精度低、细小目标识别能力不足等问题，本文提出了一种基于移动窗口Transformer的多尺度特征融合的道路场景语义分割算法。该网络采用编码-解码结构，编码器使用改进后的移动窗口Transformer特征提取器对道路场景图像进行特征提取，解码器由注意力融合模块和特征金字塔网络构成，充分融合多尺度的语义特征。在Cityscapes城市道路场景数据集上进行验证测试，实验结果表明，与多种现有的语义分割算法进行对比，本文方法在分割精度方面有较大的提升。
- 语义分割 /
- 移动窗口变换神经网络 /
- 注意力机制 /
- 自动驾驶 /
- 深度学习
Abstract

Road scene semantic segmentation is a crucial task in autonomous driving environment perception. In recent years, Transformer neural networks have been applied in the field of computer vision and have shown excellent performance. Addressing issues such as low semantic segmentation accuracy in complex scene images and insufficient recognition capabilities for small objects, this paper proposes a road scene semantic segmentation algorithm based on Swin Transformer with multiscale feature fusion. The network adopts an encoder-decoder structure, where the encoder utilizes an improved Swin Transformer feature extractor for road scene image feature extraction. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating semantic features at multiple scales. Validation tests on the Cityscapes urban road scene dataset show that, compared to various existing semantic segmentation algorithms, our approach demonstrates significant improvement in segmentation accuracy.
- semantic segmentation /
- Swin Transformer /
- attention mechanism /
- autonomous driving /
- deep learning

Overview

Overview

Overview: Semantic segmentation of road scenes is a crucial task for the perception of autonomous driving environments. In recent years, deep learning technologies have elevated research in semantic segmentation, leading to the emergence of numerous new algorithms. Methods based on deep learning train models with extensive data automatically extract data features and become the mainstream approach for semantic segmentation. Currently, deep learning algorithms applied to image semantic segmentation primarily fall into two categories: those based on CNN and those based on Transformer. CNN-based image semantic segmentation algorithms such as FCN, PSPNet, U-Net, and DeepLab have made significant contributions to the field. Transformer is a novel architecture based on self-attention, initially applied in the NLP domain. With powerful feature extraction capabilities, Transformer can capture long-range dependencies between feature vectors, acquiring richer contextual information. Researchers have gradually adapted Transformers to the computer vision domain, forming various Visual Transformers. Subsequently, the Swin Transformer stands out, employing a hierarchical structure to output multi-scale features, calculating local self-attention within a window, achieving information interaction between windows through shift-window operations, and demonstrating excellent performance in various visual tasks. Despite extensive research on semantic segmentation algorithms for road scenes, existing methods still face challenges in practical applications. Addressing issues such as low segmentation accuracy in complex scene images and inadequate recognition of small targets, this paper proposes a road scene semantic segmentation algorithm based on the SwinTransformer with multi-scale feature fusion. The network adopts an encoder-decoder structure, where the encoder employs an improved SwinTransformer feature extractor for feature extraction in road scene images, reducing information loss during downsampling and retaining as many edge features as possible. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating multi-scale semantic features and efficiently restoring fine-grained details in urban road images. We conduct quantitative and qualitative experiments on the Cityscapes urban road scene dataset. The results show that, compared to various existing semantic segmentation algorithms, our method exhibits significant improvements in segmentation accuracy. However, our network structure is relatively complex, with a large number of computations and parameters. In practical applications, further refinement, optimization of the network structure, and lightweight processing to reduce parameters and computations are still required.

HTML全文

图 1 网络架构

Figure 1. Network architecture

下载: 全尺寸图片幻灯片

图 2 Swin Transformer架构

Figure 2. Swin Transformer architecture

下载: 全尺寸图片幻灯片

图 3 Swin Transformer 模块

Figure 3. Swin Transformer block

下载: 全尺寸图片幻灯片

图 4 图像块合并模块

Figure 4. Patch Merging module

下载: 全尺寸图片幻灯片

图 5 特征压缩模块

Figure 5. FCM module

下载: 全尺寸图片幻灯片

图 6 注意力融合模块

Figure 6. AFM module

下载: 全尺寸图片幻灯片

图 7 Cityscapes场景中多种方法分割效果对比图

Figure 7. Comparison of segmentation effects of multiple methods in Cityscapes scenes

下载: 全尺寸图片幻灯片

图 8 消融实验效果对比图

Figure 8. Comparison of ablation experiment effects

下载: 全尺寸图片幻灯片

表 1 实验环境

Table 1. Experimental environment

实验环境	配置	实验环境	配置
CPU	AMD5600Xd	CPU核心数	6
GPU	NVIDIA RTX3070	主频	3.7 GHz
内存	32 G	显存	11 G
操作系统	Ubuntu18.04	编程语言	Python 3.7
深度学习框架	Pytorch 1.10.0	CUDA	10.2

下载: 导出CSV

表 2 各类模型在Cityscapes数据集上的IoU和MIoU

Table 2. IoU and MIoU of various models on the Cityscapes dataset

Classes	FCN	PSPNet	UNet	DeepLabv3	SwinT	Ours
Road	97.1	98.0	98.0	98.1	98.0	98.1
Sidewalk	79.9	81.8	84.2	84.5	84.7	86.2
Building	89.3	91.1	91.1	91.7	91.4	91.6
Wall	44.2	48.2	48.7	51.2	54.4	55.5
Fence	48.3	50.3	51.5	53.6	57.3	59.9
Pole	30.6	45.7	48.2	50.3	55.5	57.2
Traffic Light	44.7	50.0	51.7	53.7	61.9	63.2
Traffic Sign	56.8	62.3	65.8	68.2	73.5	74.4
Vegetation	87.1	89.2	90.1	90.1	90.2	92.4
Terrain	60.4	62.8	65.3	64.2	61.3	63.2
Sky	90.8	94.2	93.8	95.3	94.2	95.1
Person	64.1	71.2	72.6	74.5	75.5	76.9
Rider	38.2	45.6	46.1	49.5	55.7	55.9
Car	90.4	92.0	92.2	92.6	93.8	93.5
Truck	51.3	68.5	63.4	74.4	73.6	72.5
Bus	72.0	80.3	77.6	83.2	79.4	79.9
Train	74.4	77.4	78.5	81.5	77.7	78.1
Motocycle	52.5	50.1	55.5	53.5	56.5	59.2
Bicycle	59.1	60.1	63.4	64.2	71.2	73.2
MIoU/%	64.92	69.28	70.45	73.71	73.17	75.18

下载: 导出CSV

表 3 各类模型在Cityscapes数据集上的PA和MPA

Table 3. PA and MPA of various models on the Cityscapes dataset

Classes	FCN	PSPNet	UNet	DeepLabv3	SwinT	Ours
Road	98.1	98.5	98.8	99.1	99.1	99.1
Sidewalk	89.9	89.3	90.2	92.0	91.2	92.7
Building	96.3	94.7	96.1	96.2	96.5	96.8
Wall	52.2	72.1	60.7	73.1	71.4	72.3
Fence	60.3	69.3	68.5	72.5	71.4	74.6
Pole	36.6	74.7	59.2	74.3	74.1	77.7
Traffic Light	56.7	72.0	62.7	69.2	70.4	72.1
Traffic Sign	68.8	79.3	75.8	76.5	76.7	79.3
Vegetation	94.1	93.2	95.1	93.6	95.3	97.7
Terrain	74.4	79.8	78.3	78.1	79.2	80.3
Sky	95.8	97.2	97.8	97.5	97.5	97.9
Person	77.1	82.2	84.6	84.2	86.3	87.9
Rider	58.2	68.6	55.1	71.2	72.4	73.7
Car	96.4	96.0	96.2	96.3	97.6	97.6
Truck	62.3	79.5	76.4	75.5	73.5	76.2
Bus	85.0	87.3	89.6	91.7	85.6	87.7
Train	78.4	83.4	92.5	88.4	79.3	82.9
Motocycle	66.5	73.5	67.5	77.5	77.3	79.2
Bicycle	77.1	73.1	80.4	76.2	80.3	84.2
MPA/%	74.64	79.97	80.06	82.31	81.59	84.83

下载: 导出CSV

表 4 各类语义分割算法性能比较

Table 4. Performance comparison of various semantic segmentation algorithms

方法	MIoU/%	MPA/%	Param/M	FLOPs/G	FPS
FCN	64.92	74.64	34.90	66.38	58.61
PSPNet	69.28	79.97	51.86	152.97	81.25
UNet	70.45	80.06	49.10	166.92	54.52
DeepLabv3	73.71	82.31	68.37	235.37	36.59
SwinT	73.17	81.59	121.25	297.57	12.22
Ours	75.18	84.83	123.77	305.46	14.83

下载: 导出CSV

表 5 消融实验

Table 5. Ablation experiment

实验序号	AFM	FCM	ASPP	MIoU/%	MPA/%
①	×	×	×	73.1	81.6
②	√	×	×	73.8	80.5
③	√	√	×	74.9	83.3
④	√	√	√	75.2	84.8
注：“√”表示网络中包含该结构，“×”表示在网络中去掉该结构。

下载: 导出CSV

参考文献(26)

参考文献

[1]	Mo Y J, Wu Y, Yang X N, et al. Review the state-of-the-art technologies of semantic segmentation based on deep learning[J]. Neurocomputing, 2022, 493: 626−646. doi: 10.1016/j.neucom.2022.01.005
[2]	Liu X L, Deng Z D, Yang Y H. Recent progress in semantic image segmentation[J]. Artif Intell Rev, 2019, 52(2): 1089−1106. doi: 10.1007/s10462-018-9641-3
[3]	张莹, 黄影平, 郭志阳, 等. 基于点云与图像交叉融合的道路分割方法[J]. 光电工程, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340 Zhang Y, Huang Y P, Guo Z Y, et al. Point cloud-image data fusion for road segmentation[J]. Opto-Electron Eng, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340
[4]	Chiu M T, Xu X Q, Wei Y C, et al. Agriculture-vision: a large aerial image database for agricultural pattern analysis[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 2825–2835. https://doi.org/10.1109/CVPR42600.2020.00290.
[5]	Qureshi I, Yan J H, Abbas Q, et al. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends[J]. Inf Fusion, 2023, 90: 316−352. doi: 10.1016/j.inffus.2022.09.031
[6]	Chua L O, Roska T. The CNN paradigm[J]. IEEE Trans Circuits Syst I:Fundam Theory Appl, 1993, 40(3): 147−156. doi: 10.1109/81.222795
[7]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010.
[8]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965.
[9]	Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//18th International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015: 234–241. https://doi.org/10.1007/978-3-319-24574-4_28.
[10]	Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230–6239. https://doi.org/10.1109/CVPR.2017.660.
[11]	Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//3rd International Conference on Learning Representations, 2015.
[12]	Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell, 2018, 40(4): 834−848. doi: 10.1109/TPAMI.2017.2699184
[13]	Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[Z]. arXiv: 1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587.
[14]	Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 833–851. https://doi.org/10.1007/978-3-030-01234-2_49.
[15]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations, 2021.
[16]	Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986.
[17]	Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. https://doi.org/10.1109/CVPR.2017.106.
[18]	Xie S N, Tu Z W. Holistically-nested edge detection[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, 2015: 1395–1403. https://doi.org/10.1109/ICCV.2015.164.
[19]	Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. Int J Comput Vis, 2004, 59(2): 167−181. doi: 10.1023/B:VISI.0000022288.19776.77
[20]	Sehar U, Naseem M L. How deep learning is empowering semantic segmentation: traditional and deep learning techniques for semantic segmentation: a comparison[J]. Multimed Tools Appl, 2022, 81(21): 30519−30544. doi: 10.1007/s11042-022-12821-3
[21]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, 2012: 1106–1114.
[22]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90.
[23]	Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 10337–10346. https://doi.org/10.1109/ICCV48922.2021.01019.
[24]	马梁, 苟于涛, 雷涛, 等. 基于多尺度特征融合的遥感图像小目标检测[J]. 光电工程, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363 Ma L, Gou Y T, Lei T, et al. Small object detection based on multi-scale feature fusion using remote sensing images[J]. Opto-Electron Eng, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363
[25]	Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350.
[26]	Ulku I, Akagündüz E. A survey on deep learning-based architectures for semantic segmentation on 2D images[J]. Appl Artif Intell, 2022, 36(1): 2032924. doi: 10.1080/08839514.2022.2032924

施引文献

资源附件(0)

访问统计

访问统计

点击扫一扫

图(9)

表(5)

计量

文章访问数:
PDF下载数:
施引文献: 0

面向道路场景语义分割的移动窗口变换神经网络设计

作者简介:
杭昊 (1997-)，男，硕士研究生，研究方向为计算机视觉。E-mail：212230402@st.usst.edu.cn;

黄影平 (1966-)，男，教授，研究方向为汽车电子、计算机视觉。E-mail：huangyingping@usst.edu.cn

**^*通讯作者:** 黄影平，huangyingping@usst.edu.cn

Design of Swin Transformer for semantic segmentation of road scenes

**^*Corresponding author:** huangyingping@usst.edu.cn

摘要

Abstract

Overview

参考文献

访问统计

计量

目录

作者须知

其他内容

条款和政策

面向道路场景语义分割的移动窗口变换神经网络设计

作者简介: 杭昊 (1997-)，男，硕士研究生，研究方向为计算机视觉。E-mail：212230402@st.usst.edu.cn; 黄影平 (1966-)，男，教授，研究方向为汽车电子、计算机视觉。E-mail：huangyingping@usst.edu.cn

*通讯作者: 黄影平，huangyingping@usst.edu.cn

Design of Swin Transformer for semantic segmentation of road scenes

*Corresponding author: huangyingping@usst.edu.cn

摘要

Abstract

Overview

参考文献

访问统计

计量

出版历程

目录

作者须知

其他内容

条款和政策

作者简介:
杭昊 (1997-)，男，硕士研究生，研究方向为计算机视觉。E-mail：212230402@st.usst.edu.cn;

黄影平 (1966-)，男，教授，研究方向为汽车电子、计算机视觉。E-mail：huangyingping@usst.edu.cn

**^*通讯作者:** 黄影平，huangyingping@usst.edu.cn

**^*Corresponding author:** huangyingping@usst.edu.cn