A multi-target semantic segmentation method for millimetre wave SAR images based on a dual-branch multi-scale fusion network
-
摘要:
在毫米波合成孔径雷达(SAR)安检成像违禁品的检测与识别中,存在着目标尺寸过小、目标被部分遮挡和多目标之间重叠等复杂情况,不利于违禁品的准确识别。针对这些问题,提出了一种基于双分支多尺度融合网络(DBMFnet)的违禁品检测方法。该网络使用Encoder-Decoder的结构,在Encoder阶段,提出一种双分支并行特征提取网络(DBPFEN)来增强特征提取;在Decoder阶段,提出一种多尺度融合模块(MSFM)来提高对目标的检测能力。实验结果表明,该方法的均交并比(mIoU)均优于现有的语义分割方法,降低了漏检与错检率。
-
关键词:
- 毫米波合成孔径雷达 /
- 违禁品检测 /
- 深度学习 /
- 语义分割 /
- 双分支多尺度融合网络
Abstract:There are several major challenges in the detection and identification of contraband in millimetre-wave synthetic aperture radar (SAR) security imaging: the complexities of small target sizes, partially occluded targets and overlap between multiple targets, which are not conducive to the accurate identification of contraband. To address these problems, a contraband detection method based on dual branch multiscale fusion network (DBMFnet) is proposed. The overall architecture of the DBMFnet follows the encoder-decoder framework. In the encoder stage, a dual-branch parallel feature extraction network (DBPFEN) is proposed to enhance the feature extraction. In the decoder stage, a multi-scale fusion module (MSFM) is proposed to enhance the detection ability of the targets. The experimental results show that the proposed method outperforms the existing semantic segmentation methods in the mean intersection over union (mIoU) and reduces the incidence of missed and error detection of targets.
-
Overview: With the advancements of millimeter wave technology, millimeter wave security inspection systems have reached a higher level of maturity. Compared with traditional security inspection technologies such as X-ray, infrared, and metal detectors, millimeter wave security imaging not only enables the detection of the metallic objects hidden under fabrics, but also identifies dangerous items such as plastic firearms, knives, explosives, etc. Significantly, it is crucial to note that millimeter waves are non-ionizing and do not cause harm to the human body. The utilization of millimeter wave security inspection enables the acquisition of precise image information and significantly reduces the occurrence of false alarms, making millimeter wave imaging equipment extensively employed in the security inspection of the human body.
There are several major challenges in the detection and identification of contraband in millimetre-wave synthetic aperture radar (SAR) security imaging: the complexities of small target sizes, partially occluded targets and overlap between multiple targets, which are not conducive to the accurate identification of contraband. To address these problems, a contraband detection method based on Dual Branch Multiscale Fusion Network (DBMFnet) is proposed. The overall architecture of the DBMFnet follows the encoder-decoder framework. In the encoder stage, a dual-branch parallel feature extraction network (DBPFEN) is proposed to enhance the feature extraction. In the feature extraction process of DBMFnet, one branch preserves the high resolution while the other branch extracts the rich semantic information through multiple downsampling operations. Bilateral connections are established between high-resolution and low-resolution branches to facilitate repeated feature exchange, ensuring that the high-resolution branch feature maps integrate into the low-rate branch feature maps across different scales, which facilitates the combination of rich semantic information and fine-grained details to improve the detection of small and interfering targets in images. In the decoder stage, a multi-scale fusion module (MSFM) is proposed to enhance the detection ability of the targets. The module consists of the Feature Alignment Module (FAM), which allows multiple low-resolution feature maps to merge into high-resolution maps. The FAM is inspired by the optical flow for the motion alignment between adjacent video frames, where the feature maps Fh, Flof different resolutions are used as the input and changed to the same number of channels by a 1×1 convolutional layer, respectively. Subsequently, the high-resolution feature map Fh is concatenated with the low-resolution feature map Fl by a bilinear interpolation up-sampling layer.
The experimental results show that when tested using the HM-SAR dataset, our proposed model improves mIoU by 2.54% compared to the existing best performing semantic segmentation models. The ablation experiment shows that the proposed MSFM can effectively improve the mIoU value.
-
图 6 各模型测试结果,每一行代表相同图片测试的结果,每一列代表同一模型的测试结果。黑色像素表示背景,红色像素 表示锤头,绿色像素表示扳手,黄色像素表示手枪,蓝色像素表示小刀
Figure 6. Test results of each model. Each row represents the test results of the same picture, and each column represents the test results of the same model. Black denotes the background, green denotes the wrench, yellow denotes the pistol, red denotes the hammer, and blue denotes the knife
表 1 双分支特征提取网络结构
Table 1. Architectures of DBFEN
Stage Output DBFEN Stage Output DBFEN Conv1 256×256 3×3, 64, stride 2 Conv6 64×64 $ \left( \begin{array}{l}3 \times 3,\;128\\3 \times 3,\;128\end{array} \right) \times 2$ Conv2 128×128 3×3, 64, stride 2 Conv7 16×16 $ \left( \begin{array}{l}3 \times 3,\;256\\3 \times 3,\;512\end{array} \right) \times 2$ Conv3 64×64 $ \left( \begin{array}{l}3 \times 3,\;64\\3 \times 3,\;128\end{array} \right) \times 2$ Conv8 64×64 $ \left( \begin{array}{l}3 \times 3,\;128\\3 \times 3,\;256\end{array} \right) \times 2$ Conv4 64×64 $ \left( \begin{array}{l}3 \times 3,\;128\\3 \times 3,\;128\end{array} \right) \times 2$ Conv9 8×8 $ \left( \begin{array}{l}3 \times 3,\;512\\3 \times 3,\;1024\end{array} \right) \times 2$ Conv5 32×32 $ \left( \begin{array}{l}3 \times 3,\;128\\3 \times 3,\;256\end{array} \right) \times 2$ 表 2 各模型在HM-SAR数据集中的分割性能比较
Table 2. Comparisons of the segmentation performance of each model in the HM-SAR dataset
Network model MPA/% mIoU/% F1/% Network model MPA/% mIoU/% F1/% U-net 80.29 70.35 81.87 Deeplabv3+ 81.05 70.58 82.00 Pspnet 82.98 72.32 83.28 HRnet-v2 82.33 72.90 83.69 FCN-8s 81.29 72.11 83.11 DBMFnet (ours) 85.01 75.44 85.21 表 3 各模型在HM-SAR数据集中的目标分割性能比较
Table 3. Comparisons of the objects segmentation performance of each model in the HM-SAR dataset
Class U-net Pspnet Deeplabv3+ HRnet-v2 FCN-8s DBMFnet (ours) Pre IoU Pre IoU Pre IoU Pre IoU Pre IoU Pre IoU Hammer 80.74 61.98 76.49 63.7 80.15 63.99 79.93 67.35 79.16 65.17 81.91 69.33 Wrench 82.66 66.78 82.88 71.84 80.61 66.57 78.80 66.15 84.04 69.56 84.22 75.24 Pistol 75.63 63.77 77.3 64.21 75.45 62.65 85.71 69.47 81.07 65.81 87.89 70.56 Knife 78.59 59.4 81.36 62.01 78.82 59.84 81.67 61.68 80.06 60.16 82.55 66.15 表 4 各个模型的计算复杂度和推理速度
Table 4. Calculation complexity and inference speed of each model
Network model Params/M GFLOPs Speed/(f/s) U-net 24.89 452.31 32 Pspnet 46.7 118.43 33.5 FCN-8s 32.95 277.74 16 Deeplabv3+ 54.71 166.87 21 HRnet 29.55 80.18 11.5 DBMFnet(our) 19.54 47.36 26 表 5 使用不同解码器模块的模型性能对比
Table 5. Comparisons of models using different decoder modules
Network model mIoU Params/M GFLOPs Baseline 72.61 23.15 38.78 Deeplabv3+(FCM) 70.58 54.71 166.87 FCN-8s(FDM) 72.11 32.95 277.74 Baseline+FCM 74.1 22.44 100.8 Baseline+FDM 73.16 21.65 45.27 Baseline+MSFM 75.44 23.06 47.86 -
[1] Saadat M S, Sur S, Nelakuditi S, et al. MilliCam: hand-held millimeter-wave imaging[C]//Proceedings of 29th International Conference on Computer Communications and Networks, Honolulu, 2020: 1–9. https://doi.org/10.1109/ICCCN49398.2020.9209710.
[2] Jing H D, Li S Y, Cui X X, et al. Near-field single-frequency millimeter-wave 3-D imaging via multifocus image fusion[J]. IEEE Antennas Wirel Propag Lett, 2021, 20(3): 298−302. doi: 10.1109/LAWP.2020.3048478
[3] Nozokido T, Noto M, Murai T. Passive millimeter-wave microscopy[J]. IEEE Microw Wirel Compon Lett, 2009, 19(10): 638−640. doi: 10.1109/LMWC.2009.2029741
[4] Appleby R, Anderton R N. Millimeter-wave and submillimeter-wave imaging for security and surveillance[J]. Proc IEEE, 2007, 95(8): 1683−1690. doi: 10.1109/JPROC.2007.898832
[5] Işiker H, Ünal İ, Tekbaş M, et al. An auto‐classification procedure for concealed weapon detection in millimeter‐wave radiometric imaging systems[J]. Microw Opt Technol Lett, 2018, 60(3): 583−594. doi: 10.1002/mop.31005
[6] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90.
[7] Chollet F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 1251–1258. https://doi.org/10.1109/CVPR.2017.195.
[8] Ren S Q, He K M, Girshick R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, 2015.
[9] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, 2016: 21–37. https://doi.org/10.1007/978-3-319-46448-0_2.
[10] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 779–788. https://doi.org/10.1109/CVPR.2016.91.
[11] Xie E Z, Wang W H, Yu Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
[12] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. https://doi.org/10.1109/CVPR.2017.660.
[13] Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, 2018. https://doi.org/10.1007/978-3-030-01234-2_49.
[14] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. https://doi.org/10.1109/CVPR.2019.00584.
[15] Pan H H, Hong Y D, Sun W C, et al. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Trans Intell Transp Syst, 2023, 24(3): 3448−3460. doi: 10.1109/TITS.2022.3228042
[16] López-Tapia S, Molina R, de la Blanca N P. Deep CNNs for object detection using passive millimeter sensors[J]. IEEE Trans Circuits Syst Video Technol, 2019, 29(9): 2580−2589. doi: 10.1109/TCSVT.2017.2774927
[17] Liu C Y, Yang M H, Sun X W. Towards robust human millimeter wave imaging inspection system in real time with deep learning[J]. Prog Electromagn Res, 2018, 161: 87−100. doi: 10.2528/PIER18012601
[18] Sun P, Liu T, Chen X T, et al. Multi-source aggregation transformer for concealed object detection in millimeter-wave images[J]. IEEE Trans Circuits Syst Video Technol, 2022, 32(9): 6148−6159. doi: 10.1109/TCSVT.2022.3161815
[19] 王林华, 袁明辉, 黄慧, 等. 太赫兹安检系统人体图像边缘物体识别[J]. 红外与激光工程, 2017, 46(11): 1125002. doi: 10.3788/IRLA201746.1125002
Wang L H, Yuan M H, Huang H, et al. Recognition of edge object of human body in THz security inspection system[J]. Infrared Laser Eng, 2017, 46(11): 1125002. doi: 10.3788/IRLA201746.1125002
[20] Wang C J, Yang K H, Sun X W. Precise localization of concealed objects in millimeter-wave images via semantic segmentation[J]. IEEE Access, 2020, 8: 121246−121256. doi: 10.1109/ACCESS.2020.3007256
[21] Liang D, Pan J X, Yu Y, et al. Concealed object segmentation in terahertz imaging via adversarial learning[J]. Optik, 2019, 185: 1104−1114. doi: 10.1016/j.ijleo.2019.04.034
[22] Li X T, You A S, Zhu Z, et al. Semantic flow for fast and accurate scene parsing[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, 2020: 775–793. https://doi.org/10.1007/978-3-030-58452-8_45.