基于改进YOLOv5s的无人机图像实时目标检测

陈旭,彭冬亮,谷雨. 基于改进YOLOv5s的无人机图像实时目标检测[J]. 光电工程,2022,49(3): 210372. doi: 10.12086/oee.2022.210372
引用本文: 陈旭,彭冬亮,谷雨. 基于改进YOLOv5s的无人机图像实时目标检测[J]. 光电工程,2022,49(3): 210372. doi: 10.12086/oee.2022.210372
Chen X, Peng D L, Gu Y. Real-time object detection for UAV images based on improved YOLOv5s[J]. Opto-Electron Eng, 2022, 49(3): 210372. doi: 10.12086/oee.2022.210372
Citation: Chen X, Peng D L, Gu Y. Real-time object detection for UAV images based on improved YOLOv5s[J]. Opto-Electron Eng, 2022, 49(3): 210372. doi: 10.12086/oee.2022.210372

基于改进YOLOv5s的无人机图像实时目标检测

  • 基金项目:
    浙江省自然科学基金资助项目(LY21F030010)
详细信息
    作者简介:
    *通讯作者: 谷雨,guyu@hdu.edu.cn
  • 中图分类号: TP391.41

Real-time object detection for UAV images based on improved YOLOv5s

  • Fund Project: Natural Science Foundation of Zhejiang Province, China (LY21F030010)
More Information
  • 针对无人机图像背景复杂、分辨率高、目标尺度差异大等特点,提出了一种实时目标检测算法YOLOv5sm+。首先,分析了网络宽度和深度对无人机图像检测性能的影响,通过引入可增大感受野的残差空洞卷积模块来提高空间特征的利用率,基于YOLOv5s设计了一种改进的浅层网络YOLOv5sm,以提高无人机图像的检测精度。然后,设计了一种特征融合模块SCAM,通过局部特征自监督的方式提高细节信息利用率,通过多尺度特征有效融合提高了中大目标的分类精度。最后,设计了目标位置回归与分类解耦的检测头结构,进一步提高了分类精度。采用VisDrone无人机航拍数据集实验结果表明,提出的YOLOv5sm+模型对验证集测试时交并比为0.5时的平均精度均值(mAP50)达到了60.6%,相比于YOLOv5s模型mAP50提高了4.8%,超过YOLOv5m模型的精度,同时推理速度也有提升。通过在DIOR遥感数据集上的迁移实验也验证了改进模型的有效性。提出的改进模型具有虚警率低、重叠目标识别率高的特点,适合于无人机图像的目标检测任务。

  • Overview: The real-time object detection under unmanned aerial vehicle (UAV) scenario has a wide range of military and civilian applications, including traffic monitoring, power line detection, etc. As UAV image has the characteristics of complex background, high resolution, and large scale differences between targets, how to meet the requirements of both detection accuracy and real-time performance is one of key problems to be solved. Thus, a balanced real-time detection algorithm based on YOLOv5s, which is named as YOLOv5sm+ is proposed in this paper. First, the influence of network width and depth on UAV image detection performance was analyzed. The experimental results on VisDrone datasets show that, due to the less internal feature mapping, the detection performance improves with model depth rather than model width. Moreover, with the depth of the model grows, semantic information can improve detection accuracy under generic object detection scenarios. An improved shallow network based on YOLOv5s, which is named as YOLOv5sm, was proposed to improve the detection accuracy of major targets in UAV image through improving the utilization of spatial features extracted by residual dilated convolution module that could increase the receptive field. Then, a cross-stage attention feature fusion module (SCAM) was designed, which could improve the utilization of detailed information by local feature self-supervision and could improve classification accuracy of medium and large targets through effective feature fusion. Finally, a detection head structure consisting with decoupled regression and classification head was proposed to further improve the classification accuracy. The first stage completes the regression task, and the second stage uses the cross-stage convolution module to assist in the classification task. The contradiction between regression and classification was alleviated, and the accuracy of the fine-grained classification was improved. Under the synergistic of the balanced light-weight feature extraction network (YOLOv5sm), the cross-stage attention feature fusion module (SCAM) and the improved detection head, the algorithm named YOLOv5sm+ was proposed. The experimental results on VisDrone dataset show that when intersection over union equals 0.5 mean average precision (mAP50) of the proposed YOLOv5sm+ model reaches 60.6%. Compared with YOLOv5s model, mAP50 of YOLOv5sm+ has increased 4.1%. In addition, YOLOv5sm+ has higher detection speed. The migration experiment on the DIOR remote sensing dataset also verified the effectiveness of the proposed model. The improved model has the characteristics of low false alarm rate and high recognition rate under overlapping conditions, and is suitable for the object detection task of UAV images.

  • 加载中
  • 图 1  YOLOv5骨干网络架构图

    Figure 1.  YOLOv5 backbone network architecture diagram

    图 2  特征融合模块结构图

    Figure 2.  Structure diagram of feature fusion module

    图 3  (a) Res-DConv模块;(b) 感受野映射

    Figure 3.  (a) Res-DConv module; (b) Receptive field mapping

    图 4  改进模块结构

    Figure 4.  Improved module structure

    图 5  YOLOv5sm+ 模型架构

    Figure 5.  YOLOv5sm+ model architecture

    图 6  (a) VisDrone数据集类别实例总计;(b) YOLOv5m算法下的类混淆矩阵

    Figure 6.  (a) Total number of category instances on the VisDrone dataset; (b) Classes confusion matrix of YOLOv5m algorithm

    图 7  不同算法在VisDrone无人机场景下的检测实例。

    Figure 7.  The detection examples of different algorithms in the VisDrone UAV scene.

    图 8  三种算法在密集车辆场景的检测结果对比图。

    Figure 8.  Comparison of the detection effects of three algorithms in dense vehicle scenes.

    图 9  改进算法在DIOR数据集的检测对比。

    Figure 9.  Detection comparison of improved algorithm in DIOR dataset.

    表 1  感受野分析表

    Table 1.  Receptive field analysis table

    YOLOv5s感受野通道YOLOv5sm感受野通道
    Focus 6 32 Conv 3*3 (stride:2) 3 24
    Conv3*3 (dilation:2) 15 48
    下采样 10 64 Conv3*3 (stride:2) 19 96
    Res-Block 27 96
    C3_x1 18 64 Res-Dconv 51 96
    下采样 26 128 Conv 3*3 (stride:2) 59 192
    C3_x3 74 128 C3_x3 107 192
    下采样 90 256 Conv3*3 (stride:2) 123 384
    C3_x3 186 256 C3_x3 219 384
    下采样 218 512 Conv3*3 (stride:2) 251 768
    Spp 218~634 512 Spp 251~667 768
    C3_x1 282~698 512 C3_x1 315~731 768
    下载: 导出CSV

    表 2  呼应感受野、下采样的锚点预设置

    Table 2.  Pre-setting anchors in response to the receptive field and down-sampling

    下采样因子345
    最大感受野/pixel111255731
    先验框范围8*8~37*3732*32~85*8596*96~365*365
    下载: 导出CSV

    表 3  不同类型目标数量统计

    Table 3.  Statistics of different types of objects

    目标种类Small (0×0~32×32)Mid (32×32~96×96)Large (96×96~)
    数量44.4418.631.704
    下载: 导出CSV

    表 4  深度、宽度模型性能对比实验结果

    Table 4.  Performance comparison experiment results of depth and width models

    深度宽度mAP50mAPBFLOPs
    0.330.50.5020.28816.5
    0.330.750.5400.31936.3
    1.330.50.5250.31135.4
    下载: 导出CSV

    表 5  Res-Dconv模块验证实验结果

    Table 5.  Verification experiment results on Res-Dconv module

    BaselineRes-DconvmAP50mAPBFLOPs
    0.5020.28816.5
    0.5160.29919.8
    下载: 导出CSV

    表 6  本文算法模块在VisDrone数据集上的消融实验结果

    Table 6.  The ablation experiment results of our algorithm modules on the VisDrone dataset

    BaselineSMSCAMSDCMmAPmAP50BFLOPsInferAP-smallAP-mediumAP-large
    YOLOv5s0.3190.54816.54.80.2200.4370.495
    0.3580.58930.18.30.2800.4760.495
    0.3240.55514.73.80.2250.4460.511
    0.3330.55519.54.90.2500.4480.482
    0.3560.59338.09.00.2780.4750.512
    0.3600.59630.87.70.2810.4790.505
    注:加粗字体为该列最优值
    下载: 导出CSV

    表 7  不同算法在VisDrone数据集上的检测性能

    Table 7.  Detection performance of different algorithms on VisDrone dataset

    算法mAP50mAPmAP75AP-smallAP-midAP-largeBFLOPsInfer/ms
    YOLOv30.6090.3890.4170.2970.4960.545154.927.8
    Scaled-YOLOv40.6200.4000.4280.3050.5140.626119.427.1
    ClusDet[1]0.5620.3240.316-----
    HRDNet[1]0.6200.35510.351-----
    YOLOv5s0.5480.3190.3170.2200.4370.49516.54.8
    YOLOv5m0.5950.3650.3720.2850.4820.52550.49.8
    YOLOX-s0.5350.3140.3170.2250.4150.48541.655.1
    MobileNetv30.5540.3290.3290.2450.4430.49523.88.0
    MobileViT0.5550.3330.3370.2490.4420.418-13.7
    YOLOv5sm+0.5960.3600.3690.2810.4790.50530.87.7
    YOLOv5sm+*0.6060.3670.3780.2950.4780.439--
    注:+为添加改进模块的模型, *为多尺度测试结果,包含引用文献实验结果。
    下载: 导出CSV

    表 8  不同算法在DIOR数据集上的检测性能

    Table 8.  Detection performance of different algorithms on DIOR dataset

    模型BackBonemAP50
    Faster R-CNN [33]VGG160.541
    PANet [20]ResNet500.638
    Retina-Net[24]ResNet500.685
    文献[32]ResNet500.732
    CAT-Net[34]ResNet500.763
    YOLOv5sm+(ours)-0.667
    注:加粗字体为该列最优值,包含其他文献对比结果。
    下载: 导出CSV
  • [1]

    Wu X, Li W, Hong D F, et al. Deep learning for UAV-based object detection and tracking: a survey[EB/OL]. arXiv: 2110.12638. https://arxiv.org/abs/2110.12638.

    [2]

    Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580–587. doi: 10.1109/CVPR.2014.81.

    [3]

    Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779–788. doi: 10.1109/CVPR.2016.91.

    [4]

    Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 21–37. doi: 10.1007/978-3-319-46448-0_2.

    [5]

    Du D W, Zhu P F, Wen L Y, et al. VisDrone-DET2019: the vision meets drone object detection in image challenge results[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop, 2019: 213–226. doi: 10.1109/ICCVW.2019.00030.

    [6]

    Du D W, Qi Y K, Yu H Y, et al. The unmanned aerial vehicle benchmark: object detection and tracking[C]//Proceedings of the15th European Conference on Computer Vision, 2018: 375–391. doi: 10.1007/978-3-030-01249-6_23.

    [7]

    Cai Z W, Vasconcelos N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 6154–6162.

    [8]

    Yang F, Fan H, Chu P, et al. Clustered object detection in aerial images[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 8310–8319. doi: 10.1109/ICCV.2019.00840.

    [9]

    Singh B, Najibi M, Davis L S. SNIPER: efficient multi-scale training[C]//Proceedings of Annual Conference on Neural Information Processing Systems 2018, 2018: 9333–9343.

    [10]

    Wei Z W, Duan C Z, Song X H, et al. AMRNet: chips augmentation in aerial images object detection[EB/OL]. arXiv: 2009.07168. https://arxiv.org/abs/2009.07168.

    [11]

    Duan K W, Bai S, Xie L X, et al. CenterNet: Keypoint triplets for object detection[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 6568–6577. doi: 10.1109/ICCV.2019.00667.

    [12]

    Howard A G, Zhu M L, Chen B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861. https://arxiv.org/abs/1704.04861.

    [13]

    Wang R J, Li X, Ao S, et al. Pelee: a real-time object detection system on mobile devices[C]//Proceedings of the 6th International Conference on Learning Representations, 2018.

    [14]

    Tan M X, Pang R M, Le Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10778–10787.

    [15]

    Han K, Wang Y H, Tian Q, et al. Ghostnet: more features from cheap operations[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1580–1589. doi: 10.1109/CVPR42600.2020.00165.

    [16]

    Li K, Wan G, Cheng G, et al. Object detection in optical remote sensing images: a survey and a new benchmark[J]. ISPRS J Photogr Remote Sens, 2020, 159: 296−307. doi: 10.1016/j.isprsjprs.2019.11.023

    [17]

    Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 1571–1580. doi: 10.1109/CVPRW50498.2020.00203.

    [18]

    He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Trans Pattern Anal Mach Intell, 2015, 37(9): 1904−1916. doi: 10.1109/TPAMI.2015.2389824

    [19]

    Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944.

    [20]

    Liu S, Qi L, Qin H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8759–8768. doi: 10.1109/CVPR.2018.00913.

    [21]

    Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. arXiv: 2004.10934. https://arxiv.org/abs/2004.10934.

    [22]

    Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection[C]//Proceedings of the 13th European Conference on Computer Vision, 2014: 834–849. doi: 10.1007/978-3-319-10590-1_54.

    [23]

    Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1.

    [24]

    Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324.

    [25]

    Wu Y, Chen Y P, Yuan L, et al. Rethinking classification and localization for object detection[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10183–10192. doi: 10.1109/CVPR42600.2020.01020.

    [26]

    Chen X L, Fang H, Lin T Y, et al. Microsoft coco captions: data collection and evaluation server[EB/OL]. arXiv: 1504.00325. https://arxiv.org/abs/1504.00325.

    [27]

    Wang C Y, Bochkovskiy A, Liao H Y M. Scaled-YOLOv4: scaling cross stage partial network[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13024–13033. doi: 10.1109/CVPR46437.2021.01283.

    [28]

    Farhadi A, Redmon J. Yolov3: an incremental improvement[C]//Proceedings of Computer Vision and Pattern Recognition, 2018: 1804–2767.

    [29]

    Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. doi: 10.1109/ICCV.2019.00140.

    [30]

    Mehta S, Rastegari M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. arXiv: 2110.02178. https://arxiv.org/abs/2110.02178.

    [31]

    Ge Z, Liu S T, Wang F, et al. YOLOX: exceeding YOLO series in 2021[EB/OL]. arXiv: 2107.08430. https://arxiv.org/abs/2107.08430.

    [32]

    姚艳清, 程塨, 谢星星, 等. 多分辨率特征融合的光学遥感图像目标检测[J]. 遥感学报, 2021, 25(5): 1124−1137. doi: 10.11834/jrs.20210505

    Yao Y Q, Cheng G, Xie X X, et al. Optical remote sensing image object detection based on multi-resolution feature fusion[J]. J Remote Sens, 2021, 25(5): 1124−1137. doi: 10.11834/jrs.20210505

    [33]

    Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Trans Pattern Anal Mach Intell, 2017, 39(6): 1137−1149. doi: 10.1109/TPAMI.2016.2577031

    [34]

    Liu Y, Li H F, Hu C, et al. CATNet: context AggregaTion network for instance segmentation in remote sensing images[EB/OL]. arXiv: 2111.11057. https://arxiv.org/abs/2111.11057.

  • 加载中

(10)

(8)

计量
  • 文章访问数: 
  • PDF下载数: 
  • 施引文献:  0
出版历程
收稿日期:  2021-11-22
修回日期:  2022-01-04
刊出日期:  2022-03-25

目录

/

返回文章
返回