-
摘要:
红外-可见光行人重识别在视频监控、智能交通、安防等领域具有广泛应用。但是不同图像模态间的差异,给该领域带来了巨大的挑战。现有方法主要集中于缓解模态间差异以获得更具鉴别性的特征,但却忽略了邻级特征之间的关系以及多尺度信息对全局特征的影响。因此,本文提出一种基于多特征聚合的红外-可见光行人重识别方法(MFANet)解决现有方法的缺陷。首先在特征提取阶段融合邻级特征,引导低级特征信息的融入,以强化高级特征,使得特征更具健壮性;然后聚合不同感受野的多尺度特征以获得丰富的上下文信息;最后,以多尺度特征作为引导,强化特征以获得更具鉴别性的特征。在SYSU-MM01和RegDB数据集上的实验结果证明了所提方法的有效性,其中SYSU-MM01数据集在最困难的全搜索单镜头模式下平均精度达到了71.77%。
Abstract:Infrared-visible person re-identification has been widely used in video surveillance, intelligent transportation, security, and other fields. However, due to the differences between different image modalities, it brings great challenges to this field. The existing methods mainly focus on mitigating the differences between modes to obtain more discriminating features, but ignore the relationship between adjacent features and the influence of multi-scale information on global features. Here, a infrared-visible person re-identification method (MFANet) based on multi-feature aggregation is proposed to solve the shortcomings of existing methods. Firstly, the adjacent level features are fused in the feature extraction stage, and the integration of low-level feature information is guided to strengthen the high-level features and make the features more robust. Then, the multi-scale features of different receptive fields of view are aggregated to obtain rich contextual information. Finally, multi-scale features are used as a guide to strengthen the features to obtain more discriminating features. Experimental results on SYSU-MM01 and RegDB datasets show the effectiveness of the proposed method, and the average accuracy of SYSU-MM01 dataset reaches 71.77% in the most difficult all-search single-shot mode.
-
Key words:
- person re-identification /
- infrared /
- multi-scale /
- adjacent level features
-
Overview: Infrared-visible person re-identification is a prominent research topic in the field of computer vision, encompassing several essential aspects. These include multi-modal perception technology, challenges in person re-identification, practical application demands, and the development of datasets and evaluation metrics. With the emergence of multi-modal perception technology, the primary objective of infrared-visible light person re-identification is to effectively fuse information from different modalities to enhance the accuracy and robustness of person re-identification. Person re-identification faces challenges such as variations in viewpoint, pose, occlusion, and lighting conditions. Furthermore, infrared-visible person re-identification poses additional challenges as a cross-modal task. This technology holds broad prospects for applications in video surveillance, security, intelligent transportation, and other related fields. Particularly, it is well-suited for person re-identification in low-light or nighttime environments. The development of relevant datasets and evaluation metrics has facilitated ongoing innovation and improvement in infrared-visible person re-identification algorithms and systems. Infrared-visible person re-identification is a research field extensively supported by various backgrounds, providing a foundation for enhancing the performance and application effectiveness of person re-identification. With the continuous exploration of researchers, the accuracy of infrared-visible person re-identification has steadily improved. However, due to the differences between different image modalities, it brings great challenges to this field. The existing methods mainly focus on mitigating the differences between modes to obtain more discriminating features, but ignore the relationship between adjacent features and the influence of multi-scale information on global features. Here, a infrared-visible person re-identification method (MFANet) based on multi-feature aggregation is proposed to solve the shortcomings of existing methods. Firstly, the adjacent level features are fused in the feature extraction stage, and the integration of low-level feature information is guided to strengthen the high-level features and make the features more robust. Then, the multi-scale features of different receptive fields of view are aggregated to obtain rich contextual information. Finally, multi-scale features are used as a guide to strengthen the features to obtain more discriminating features. Experimental results on SYSU-MM01 and RegDB datasets show the effectiveness of the proposed method, and the average accuracy of SYSU-MM01 dataset reaches 71.77% in the all-search single-shot mode and 78.24% in the indoor-search single-shot mode.
-
表 1 SYSU-MM01数据集比较结果
Table 1. Comparison results on SYSU-MM01 datasets
Method Setting All search Indoor search rank-1 rank-10 rank-20 mAP mINP rank-1 rank-10 rank-20 mAP mINP One-stream[14] 12.04 49.68 66.74 13.67 - 16.94 63.55 82.10 22.95 - Two-stream[14] 11.65 47.99 65.50 12.85 - 15.60 61.18 81.02 21.49 - Zero-Padding[14] 14.80 54.12 71.33 15.59 - 20.58 68.38 85.79 26.92 - HCML[18] 14.32 53.16 69.17 16.16 - 24.52 73.25 86.73 30.08 - BDTR[17] 27.32 66.96 81.07 27.32 - 31.92 77.18 89.28 41.86 - D2RL[4] 28.90 70.60 82.40 29.20 - - - - - - AlignGAN[19] 42.03 85.25 93.73 41.48 - 45.86 90.17 95.39 55.18 - AGW[16] 47.58 84.45 92.11 47.69 35.30 54.29 91.14 95.99 63.02 59.23 Xmodal[20] 49.92 89.79 95.96 50.73 - - - - - - DDAG[8] 53.61 89.17 95.30 52.02 39.62 58.37 91.92 97.42 65.44 62.61 CM-NAS[21] 62.04 92.92 97.31 60.00 - 67.03 97.02 99.34 72.97 - CAJ[9] 68.23 95.59 98.49 65.32 53.61 74.01 97.79 99.67 78.52 76.79 MPANet[6] 70.07 95.39 98.39 67.07 - 76.35 97.56 99.48 80.16 - PIC[23] 57.5 - - 55.1 - 60.4 - - 67.7 - DART[24] 68.72 96.39 98.96 66.29 53.26 72.52 97.84 99.46 78.17 74.94 SPOT[10] 65.34 92.73 97.04 62.25 48.86 69.42 96.22 99.12 74.63 70.48 DML[7] 58.40 91.20 95.80 56.10 - 62.40 95.20 98.70 69.50 - PMT[12] 67.53 95.36 98.64 64.98 51.86 71.66 96.73 99.25 76.52 72.74 SFANet[25] 65.74 92.98 97.05 60.83 - 71.60 96.60 99.45 80.05 - SIDA[26] 68.36 95.91 98.56 64.19 - 73.28 97.35 99.52 77.49 - MTMFE[27] 69.47 96.42 99.11 66.41 - 72.56 96.98 99.20 76.58 - Ours 71.77 96.15 98.7 68.43 55.21 78.24 98.23 99.49 81.9 78.44 表 2 RegDB数据集比较结果
Table 2. Comparison results on RegDB datasets
Method Setting Visible to infrared Infrared to visible rank-1 rank-10 rank-20 mAP mINP rank-1 rank-10 rank-20 mAP mINP Zero-Padding[14] 17.75 34.21 44.35 18.90 - 16.63 34.68 44.25 17.82 - HCML[18] 24.44 47.53 56.78 20.08 - 21.70 45.02 55.58 22.24 - BDTR[17] 33.56 58.61 67.43 32.76 - 32.92 58.46 68.43 31.96 - D2RL[4] 43.40 66.10 76.30 44.10 - - - - - AGW[16] 70.05 86.21 91.55 66.37 50.19 70.49 87.21 91.84 65.90 51.24 Xmodal[20] 62.21 83.13 91.72 60.18 - - - - - - DDAG[8] 69.34 85.77 89.98 63.19 49.24 64.77 83.85 88.90 58.54 48.62 CM-NAS[21] 84.54 95.18 97.85 80.32 - 82.56 94.52 97.37 78.31 - MCLNet[22] 80.31 92.70 96.03 73.07 - 75.93 90.93 94.59 69.49 - CAJ[9] 84.72 95.17 97.38 78.70 65.33 84.09 94.79 97.11 77.25 61.56 PIC[23] 83.6 - - 79.6 - 79.5 - - 77.4 - DART[24] 78.23 - - 67.04 48.36 75.04 - - 64.38 43.32 SPOT[10] 80.35 93.48 96.44 72.46 56.19 79.37 92.79 96.01 72.26 56.06 DML[7] 77.60 - - 84.30 - 77.00 - - 83.60 - PMT[12] 84.83 - - 76.55 - 84.16 - - 75.13 - SFANet[25] 76.31 91.02 94.27 68.00 - 70.15 85.24 89.27 63.77 - SIDA[26] 81.73 - 96.55 75.07 - 79.71 - 95.47 72.60 - MTMFE[27] 85.04 94.38 97.22 82.52 - 81.11 92.35 96.19 79.59 - Ours 85.38 95.39 97.54 79.49 65.72 84.58 95.27 97.23 78.02 62.22 表 3 SYSU-MM01数据集上4种不同设定的消融研究
Table 3. Ablation study of four different settings on the SYSU-MM01 dataset
Settings All search Indoor search AFAM MSAM rank-1 rank-10 rank-20 mAP mINP rank-1 rank-10 rank-20 mAP mINP 68.23 95.59 98.49 65.32 51.90 74.01 97.79 99.67 78.52 74.78 √ 69.30 95.69 98.41 65.95 52.14 75.27 97.84 99.48 79.51 75.79 √ 70.89 95.88 98.52 67.61 54.3 77.69 97.43 99.25 81.09 77.48 √ √ 71.77 96.15 98.70 68.43 55.21 78.24 98.23 99.49 81.90 78.44 表 4 多尺度特征聚合模块感受野分析
Table 4. Multi scale feature aggregation module receptive field analysis
Settings Receptive field All search Indoor search rank-1 rank-10 rank-20 mAP mINP rank-1 rank-10 rank-20 mAP mINP 1,3,5,7 69.41 95.82 98.54 66.27 52.96 75.34 97.96 99.66 79.98 76.53 1,2,3,4 70.17 95.51 98.51 67.14 54.1 76.44 98.09 99.6 80.65 77.22 1,3,5 70.5 95.77 98.54 67.11 53.68 76.66 97.81 99.51 80.54 76.99 1,2,3 70.53 95.86 98.67 67.21 53.77 76.59 97.88 99.37 80.68 77.22 1,3 71.77 96.15 98.70 68.43 55.21 78.24 98.23 99.49 81.90 78.44 -
[1] 刘丽, 李曦, 雷雪梅. 多尺度多特征融合的行人重识别模型[J]. 计算机辅助设计与图形学学报, 2022, 34(12): 1868−1876. doi: 10.3724/SP.J.1089.2022.19218
Liu L, Li X, Lei X M. A person re-identification method with multi-scale and multi-feature fusion[J]. J Comput-Aided Des Comput Graphics, 2022, 34(12): 1868−1876. doi: 10.3724/SP.J.1089.2022.19218
[2] 石跃祥, 周玥. 基于阶梯型特征空间分割与局部注意力机制的行人重识别[J]. 电子与信息学报, 2022, 44(1): 195−202. doi: 10.11999/JEIT201006
Shi Y X, Zhou Y. Person re-identification based on stepped feature space segmentation and local attention mechanism[J]. J Electron Inf Technol, 2022, 44(1): 195−202. doi: 10.11999/JEIT201006
[3] 王松, 纪鹏, 张云洲, 等. 自适应感受野网络的行人重识别[J]. 控制与决策, 2022, 37(1): 119−126. doi: 10.13195/j.kzyjc.2020.0505
Wang S, Ji P, Zhang Y Z, et al. Adaptive receptive network for person re-identification[J]. Control Decis, 2022, 37(1): 119−126. doi: 10.13195/j.kzyjc.2020.0505
[4] Wang Z X, Wang Z, Zheng Y Q, et al. Learning to reduce dual-level discrepancy for infrared-visible person re-identification[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 618–626. https://doi.org/10.1109/CVPR.2019.00071.
[5] Zhong X, Lu T Y, Huang W X, et al. Grayscale enhancement colorization network for visible-infrared person re-identification[J]. IEEE Trans Circ Syst Video Technol, 2022, 32(3): 1418−1430. doi: 10.1109/TCSVT.2021.3072171
[6] Wu Q, Dai P Y, Chen J, et al. Discover cross-modality nuances for visible-infrared person re-identification[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4330–4339. https://doi.org/10.1109/CVPR46437.2021.00431.
[7] Zhang D M, Zhang Z Z, Ju Y, et al. Dual mutual learning for cross-modality person re-identification[J]. IEEE Trans Circ Syst Video Technol, 2022, 32(8): 5361−5373. doi: 10.1109/TCSVT.2022.3144775
[8] Ye M, Shen J B, Crandall D J, et al. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 229–247. https://doi.org/10.1007/978-3-030-58520-4_14.
[9] Ye M, Ruan W J, Du B, et al. Channel augmented joint learning for visible-infrared recognition[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 13567–13576. https://doi.org/10.1109/ICCV48922.2021.01331.
[10] Chen C Q, Ye M, Qi M B, et al. Structure-aware positional transformer for visible-infrared person re-identification[J]. IEEE Trans Image Process, 2022, 31: 2352−2364. doi: 10.1109/TIP.2022.3141868
[11] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010.
[12] Lu H, Zou X Z, Zhang P P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification[C]//Proceedings of the 37th AAAI Conference on Artificial Intelligence, 2023: 1835–1843. https://doi.org/10.1609/aaai.v37i2.25273.
[13] Lin B B, Zhang S L, Yu X. Gait recognition via effective global-local feature representation and local temporal aggregation[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 14648–14656. https://doi.org/10.1109/ICCV48922.2021.01438.
[14] Wu A C, Zheng W S, Yu H X, et al. RGB-infrared cross-modality person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, 2017: 5380–5389. https://doi.org/10.1109/ICCV.2017.575.
[15] Nguyen D T, Hong H G, Kim K W, et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors, 2017, 17(3): 605. doi: 10.3390/s17030605
[16] Ye M, Shen J B, Lin G J, et al. Deep learning for person re-identification: a survey and outlook[J]. IEEE Trans Pattern Anal Mach Intell, 2022, 44(6): 2872−2893. doi: 10.1109/TPAMI.2021.3054775
[17] Ye M, Lan X Y, Wang Z, et al. Bi-directional center-constrained top-ranking for visible thermal person re-identification[J]. IEEE Trans Inf Forensics Secur, 2020, 15: 407−419. doi: 10.1109/TIFS.2019.2921454
[18] Ye M, Lan X Y, Li J W, et al. Hierarchical discriminative learning for visible thermal person re-identification[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 919. https://doi.org/10.1609/aaai.v32i1.12293.
[19] Wang G A, Zhang T Z, Cheng J, et al. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 3623–3632. https://doi.org/10.1109/ICCV.2019.00372.
[20] Li D G, Wei X, Hong X P, et al. Infrared-visible cross-modal person re-identification with an X modality[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 4610–4617. https://doi.org/10.1609/aaai.v34i04.5891.
[21] Fu C Y, Hu Y B, Wu X, et al. CM-NAS: cross-modality neural architecture search for visible-infrared person re-identification[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 11823–11832. https://doi.org/10.1109/ICCV48922.2021.01161.
[22] Hao X, Zhao S Y, Ye M, et al. Cross-modality person re-identification via modality confusion and center aggregation[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 16403–16412. https://doi.org/10.1109/ICCV48922.2021.01609.
[23] Zheng X T, Chen X M, Lu X Q. Visible-infrared person re-identification via partially interactive collaboration[J]. IEEE Trans Image Process, 2022, 31: 6951−6963. doi: 10.1109/TIP.2022.3217697
[24] Yang M X, Huang Z Y, Hu P, et al. Learning with twin noisy labels for visible-infrared person re-identification[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 14308–14317. https://doi.org/10.1109/CVPR52688.2022.01391.
[25] Liu H J, Ma S, Xia D X, et al. SFANet: a spectrum-aware feature augmentation network for visible-infrared person reidentification[J]. IEEE Trans Neural Netw Learn Syst, 2023, 34(4): 1958−1971. doi: 10.1109/TNNLS.2021.3105702
[26] Gong J H, Zhao S Y, Lam K M, et al. Spectrum-irrelevant fine-grained representation for visible–infrared person re-identification[J]. Comput Vis Image Underst, 2023, 232: 103703. doi: 10.1016/j.cviu.2023.103703
[27] Huang N C, Liu J N, Luo Y J, et al. Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person Re-IDentification[J]. Pattern Recogn, 2023, 135: 109145. doi: 10.1016/j.patcog.2022.109145
[28] Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. Int J Comput Vis, 2020, 128(2): 336−359. doi: 10.1007/s11263-019-01228-7