GLCrowd:基于全局-局部注意力的弱监督密集场景人群计数模型

张红民,田钱前,颜鼎鼎,等. GLCrowd:基于全局-局部注意力的弱监督密集场景人群计数模型[J]. 光电工程,2024,51(10): 240174. doi: 10.12086/oee.2024.240174
引用本文: 张红民,田钱前,颜鼎鼎,等. GLCrowd:基于全局-局部注意力的弱监督密集场景人群计数模型[J]. 光电工程,2024,51(10): 240174. doi: 10.12086/oee.2024.240174
Zhang H M, Tian Q Q, Yan D D, et al. GLCrowd: a weakly supervised global-local attention model for congested crowd counting[J]. Opto-Electron Eng, 2024, 51(10): 240174. doi: 10.12086/oee.2024.240174
Citation: Zhang H M, Tian Q Q, Yan D D, et al. GLCrowd: a weakly supervised global-local attention model for congested crowd counting[J]. Opto-Electron Eng, 2024, 51(10): 240174. doi: 10.12086/oee.2024.240174

GLCrowd:基于全局-局部注意力的弱监督密集场景人群计数模型

  • 基金项目:
    重庆市自然科学基金面上项目(cstc2021 jcyj-msxmX0525,CSTB2022NSCQ-MSX0786,CSTB2023NSCQ-MSX0911);重庆市教委科学技术研究项目(KJQN202201109)
详细信息

GLCrowd: a weakly supervised global-local attention model for congested crowd counting

  • Fund Project: Chongqing Natural Science Foundation Top Project (cstc2021 jcyj-msxmX0525, CSTB2022NSCQ-MSX0786, CSTB2023NSCQ-MSX0911); Science and Technology Research Project of Chongqing Municipal Education Commission (KJQN202201109)
More Information
  • 针对人群计数在密集场景下存在背景复杂、尺度变化大等问题,提出了一种结合全局-局部注意力的弱监督密集场景人群计数模型——GLCrowd。首先,设计了一种结合深度卷积的局部注意力模块,通过上下文权重增强局部特征,同时结合特征权重共享获得高频局部信息。其次,利用Vision Transformer (ViT)的自注意力机制捕获低频全局信息。最后,将全局与局部注意力有效融合,并通过回归令牌来完成计数。在Shanghai Tech PartA、Shanghai Tech PartB、UCF-QNRF以及UCF_CC_50数据集上进行了模型测试,MAE分别达到了64.884、8.958、95.523、209.660,MSE分别达到了104.411、16.202、173.453、282.217。结果表明,提出的GLCrowd网络模型在密集场景下的人群计数中具有较好的性能。

  • Overview: Crowd counting aims to estimate the number of people in an image using computer algorithms and has wide applications in public safety, urban planning, advertising, and tourism management. However, counting in dense crowds still faces significant challenges due to complex backgrounds and scale variations. Vision Transformer (ViT) offers superior information processing capabilities compared to convolutional neural networks (CNNs), but it falls short in detail capture compared to traditional CNNs, limiting its performance in dense scenarios. To address this issue and reduce reliance on extensive annotated data, this paper proposes a weakly supervised crowd counting method based on a Transformer with a combination of global and local attention mechanisms. The approach aims to leverage the complementary strengths of both global and local attention to enhance crowd counting performance.

    First, the Vision Transformer’s self-attention captures global features, focusing on low-frequency global information to understand the overall context and background layout of the image. Next, local attention captures high-frequency local information, using depthwise convolution (DWconv) to aggregate local features from the values (V). Depthwise convolution is also applied to queries (Q) and keys (K), and attention maps are generated through the Hadamard product and Tanh function. This approach aims to achieve higher-quality contextual weights with stronger nonlinearity for identifying specific objects and detailed features in the image. Global and local attention outputs are then concatenated along the channel dimension. To integrate local information into the feed-forward network (FFN), the paper replaces the original FFN in ViT with a feed-forward network that includes DWconv. Finally, a regression head is used to complete the crowd counting task.

    The proposed model was validated on the Shanghai Tech, UCF-QNRF, and UCF_CC_50 datasets and compared with current mainstream weakly supervised crowd counting models. On the Shanghai Tech Part A dataset, the proposed model improved MAE by 1.2 and MSE by 0.7 compared to the second-best model, Transcrowd_gap. Although the OT_M model showed slight advantages in Shanghai Tech Part B, this is mainly because OT_M uses point annotations. The proposed model uses coarse annotations that only provide the total number of people in the image. On the UCF_QNRF dataset, the proposed model improved MAE by 1.7 compared to Transcrowd_gap. On the UCF_CC_50 dataset, the proposed model achieved a 9.2 improvement in MAE and a significant 48.2 improvement in MSE compared to the CrowdFormer model. These results demonstrate that the proposed network model outperforms other mainstream models in weakly supervised crowd counting.

  • 加载中
  • 图 1  GLCrowd网络模型结构

    Figure 1.  GLCrowd network structure

    图 2  全局-局部注意力模块

    Figure 2.  Global-local attention module

    图 3  不同方法的比较 。(a)传统卷积;(b)标准自注意力;(c)局部注意力

    Figure 3.  Comparison of different methods. (a) Traditional convolution; (b) Standard self-attention; (c) Local attention

    图 4  卷积前馈网络模块

    Figure 4.  ConvFFN moudle

    图 5  部分可视化结果

    Figure 5.  Partial visualization results

    表 1  数据集信息

    Table 1.  Information of datasets

    Shanghai TechUCF-QNRFUCF_CC_50
    Part APart B
    图像数量482716153550
    平均尺寸589×868768×10242013×29022101×2888
    平均人数501123.612791278
    最大人数3139578128654543
    总人数24167788488125164263974
    下载: 导出CSV

    表 2  实验环境信息

    Table 2.  Information of experimental environment

    配置参数
    操作系统Ubuntu 20.04.3 LTS (GNU/Linux 5.15.0-97-generic x86_64)
    显卡型号NVIDIA GeForce RTX 4090(×1)
    显存大小24 G
    下载: 导出CSV

    表 3  部分实验参数数据

    Table 3.  Data of partial experimental parameters

    实验参数 具体参数
    权重衰减 5×10−4
    训练总周期数 1200
    优化器动量 0.95
    学习率 1×10−5
    下载: 导出CSV

    表 4  Shanghai Tech 数据集实验结果

    Table 4.  Experimental results on Shanghai Tech dataset

    算法模型 Part A Part B
    MAE MSE MAE MSE
    HLNet[40] 71.5 108.6 11.3 20
    Transcrowd_gap[29] 66.1 105.1 9.3 16.1
    Transcrowd_token[29] 69.0 116.5 10.6 19.7
    MATT[41] 80.1 129.4 11.7 17.5
    OT_M[20] 70.7 114.5 8.1 13.1
    SUA_crowd[42] 68.5 121.9 14.1 20.6
    PDDNet[43] 72.6 112.2 10.3 17.0
    GLCrowd 64.887 104.411 8.958 16.202
    下载: 导出CSV

    表 5  UCF_QNRF数据集实验结果

    Table 5.  Experimental results on UCF_QNRF dataset

    算法模型 MAE MSE
    HLNet[40] 100.4 182.6
    Transcrowd_token[29] 98.9 176.1
    Transcrowd_gap[29] 97.2 168.5
    OT_M[20] 100.6 167.6
    SUA_crowd[42] 130.3 226.3
    PDDNet[43] 130.2 246.6
    GLCrowd 95.523 173.453
    下载: 导出CSV

    表 6  UCF_CC_50 数据集实验结果

    Table 6.  Experimental results on UCF_CC_50 dataset

    算法模型 MAE MSE
    MATT[41] 355.0 550.0
    CCTrans[31] 245.0 343.0
    SSGP_Crowd[44] 355.0 505.0
    SE Cycle GAN[45] 373.4 528.8
    CDPL_crowd[46] 336.5 486.1
    Transcrowd[29] 272.2 395.3
    CrowdFormer[47] 218.8 330.4
    PC-Net[48] 217.3 309.7
    GLCrowd 209.660 282.217
    下载: 导出CSV

    表 7  消融实验对比结果

    Table 7.  Comparative results of ablation experiments

    局部
    注意力
    ConvFFN 回归
    令牌
    Part A Part B UCF_CC_50
    MAE MSE MAE MSE MAE MSE
    第一组 × × 75.260 125.441 10.003 19.315 240.120 327.796
    第二组 × 67.403 108.298 9.428 18.446 236.233 305.122
    第三组 64.887 104.411 8.955 16.202 209.660 282.217
    下载: 导出CSV
  • [1]

    田月媛, 邓淼磊, 高辉, 等. 基于深度学习的人群计数算法综述[J]. 电子测量技术, 2022, 45(7): 152−159. doi: 10.19651/j.cnki.emt.2108231

    Tian Y Y, Deng M L, Gao H, et al. Review of crowd counting algorithms based on deep learning[J]. Electron Meas Technol, 2022, 45(7): 152−159. doi: 10.19651/j.cnki.emt.2108231

    [2]

    Xiong H P, Lu H, Liu C X, et al. From open set to closed set: supervised spatial divide-and-conquer for object counting[J]. Int J Comput Vis, 2023, 131(7): 1722−1740. doi: 10.1007/s11263-023-01782-1

    [3]

    Wu S K, Yang F Y. Boosting detection in crowd analysis via underutilized output features[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 15609–15618. https://doi.org/10.1109/CVPR52729.2023.01498.

    [4]

    Yu Y, Cai Z, Miao D Q, et al. An interactive network based on transformer for multimodal crowd counting[J]. Appl Intell, 2023, 53(19): 22602−22614. doi: 10.1007/s10489-023-04721-2

    [5]

    Wang M J, Zhou J, Cai H, et al. CrowdMLP: weakly-supervised crowd counting via multi-granularity MLP[J]. Pattern Recognit, 2023, 144: 109830. doi: 10.1016/j.patcog.2023.109830

    [6]

    卢振坤, 刘胜, 钟乐, 等. 人群计数研究综述[J]. 计算机工程与应用, 2022, 58(11): 33−46. d. doi: 10.3778/j.issn.1002-8331.2111-0281

    Lu Z K, Liu S, Zhong L, et al. Survey on reaserch of crowd counting[J]. Comput Eng Appl, 2022, 58(11): 33−46. doi: 10.3778/j.issn.1002-8331.2111-0281

    [7]

    郭爱心, 夏殷锋, 王大为, 等. 一种抗背景干扰的多尺度人群计数算法[J]. 计算机工程, 2022, 48(5): 251−257. doi: 10.19678/j.issn.1000-3428.0061423

    Guo A X, Xia Y F, Wang D W, et al. A multi-scale crowd counting algorithm with removing background interference[J]. Comput Eng, 2022, 48(5): 251−257. doi: 10.19678/j.issn.1000-3428.0061423

    [8]

    Zhang Q M, Xu Y F, Zhang J, et al. ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond[J]. Int J Comput Vis, 2023, 131(5): 1141−1162. doi: 10.1007/s11263-022-01739-w

    [9]

    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010.

    [10]

    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations, 2021.

    [11]

    Lin H, Ma Z H, Ji R R, et al. Boosting crowd counting via multifaceted attention[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 19628–19637. https://doi.org/10.1109/CVPR52688.2022.01901.

    [12]

    Liang D K, Xu W, Bai X. An end-to-end transformer model for crowd localization[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 38–54. https://doi.org/10.1007/978-3-031-19769-7_3.

    [13]

    Lin H, Ma Z H, Hong X P, et al. Gramformer: learning crowd counting via graph-modulated transformer[C]//Proceedings of the 38th AAAI Conference on Artificial Intelligence, 2024. https://doi.org/10.1609/aaai.v38i4.28126.

    [14]

    Li B, Zhang Y, Xu H H, et al. CCST: crowd counting with swin transformer[J]. Vis Comput, 2023, 39(7): 2671−2682. doi: 10.1007/s00371-022-02485-3

    [15]

    Wang F S, Liu K, Long F, et al. Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting[Z]. arXiv: 2203.06388, 2022. https://arxiv.org/abs/2203.06388.

    [16]

    Wang C A, Song Q Y, Zhang B S, et al. Uniformity in heterogeneity: diving deep into count interval partition for crowd counting[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 3234–3242. https://doi.org/10.1109/ICCV48922.2021.00322.

    [17]

    Song Q Y, Wang C A, Jiang Z K, et al. Rethinking counting and localization in crowds: a purely point-based framework[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 3365–3374. https://doi.org/10.1109/ICCV48922.2021.00335.

    [18]

    Shivapuja S V, Khamkar M P, Bajaj D, et al. Wisdom of (binned) crowds: a bayesian stratification paradigm for crowd counting[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 3574–3582. https://doi.org/10.1145/3474085.3475522.

    [19]

    Wang X Y, Zhang B F, Yu L M, et al. Hunting sparsity: density-guided contrastive learning for semi-supervised semantic segmentation[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 3114–3123. https://doi.org/10.1109/CVPR52729.2023.00304.

    [20]

    Lin W, Chan A B. Optimal transport minimization: crowd localization on density maps for semi-supervised counting[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 21663–21673. https://doi.org/10.1109/CVPR52729.2023.02075.

    [21]

    Gao H, Zhao W J, Zhang D X, et al. Application of improved transformer based on weakly supervised in crowd localization and crowd counting[J]. Sci Rep, 2023, 13(1): 1144. doi: 10.1038/s41598-022-27299-0

    [22]

    Liu Y T, Wang Z, Shi M J, et al. Discovering regression-detection bi-knowledge transfer for unsupervised cross-domain crowd counting[J]. Neurocomputing, 2022, 494: 418−431. doi: 10.1016/j.neucom.2022.04.107

    [23]

    Xu C F, Liang D K, Xu Y C, et al. AutoScale: learning to scale for crowd counting[J]. Int J Comput Vis, 2022, 130(2): 405−434. doi: 10.1007/s11263-021-01542-z

    [24]

    Liang D K, Xie J H, Zou Z K, et al. CrowdCLIP: unsupervised crowd counting via vision-language model[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 2893–2903. https://doi.org/10.1109/CVPR52729.2023.00283.

    [25]

    Zhang C Y, Zhang Y, Li B, et al. CrowdGraph: weakly supervised crowd counting via pure graph neural network[J]. ACM Trans Multimedia Comput, Commun Appl, 2024, 20(5): 135. doi: 10.1145/3638774

    [26]

    Yang Y F, Li G R, Wu Z, et al. Weakly-supervised crowd counting learns from sorting rather than locations[C]// Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, 2020: 1–17. https://doi.org/10.1007/978-3-030-58598-3_1.

    [27]

    Liu Y T, Ren S C, Chai L Y, et al. Reducing spatial labeling redundancy for active semi-supervised crowd counting[J]. IEEE Trans Pattern Anal Mach Intell, 2022, 45(7): 9248−9255. doi: 10.1109/TPAMI.2022.3232712

    [28]

    Deng M F, Zhao H L, Gao M. CLFormer: a unified transformer-based framework for weakly supervised crowd counting and localization[J]. Vis Comput, 2024, 40(2): 1053−1067. doi: 10.1007/s00371-023-02831-z

    [29]

    Liang D K, Chen X W, Xu W, et al. TransCrowd: weakly-supervised crowd counting with transformers[J]. Sci China Inf Sci, 2022, 65(6): 160104. doi: 10.1007/s11432-021-3445-y

    [30]

    Sun G L, Liu Y, Probst T, et al. Rethinking global context in crowd counting[Z]. arXiv: 2105.10926, 2021. https://arxiv.org/abs/2105.10926.

    [31]

    Tian Y, Chu X X, Wang H P. CCTrans: simplifying and improving crowd counting with transformer[Z]. arXiv: 2109.14483, 2021. https://arxiv.org/abs/2109.14483.

    [32]

    Chen X S, Lu H T. Reinforcing local feature representation for weakly-supervised dense crowd counting[Z]. arXiv: 2202.10681, 2022. https://arxiv.org/abs/2202.10681v1.

    [33]

    Gao J Y, Gong M G, Li X L. Congested crowd instance localization with dilated convolutional swin transformer[J]. Neurocomputing, 2022, 513: 94−103. doi: 10.1016/j.neucom.2022.09.113

    [34]

    Wang F S, Sang J, Wu Z Y, et al. Hybrid attention network based on progressive embedding scale-context for crowd counting[J]. Inf Sci, 2022, 591: 306−318. doi: 10.1016/j.ins.2022.01.046

    [35]

    Yang S P, Guo W Y, Ren Y H. CrowdFormer: an overlap patching vision transformer for top-down crowd counting[C]// Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022: 23–29. https://doi.org/10.24963/ijcai.2022/215.

    [36]

    Zhang Y Y, Zhou D S, Chen S Q, et al. Single-image crowd counting via multi-column convolutional neural network[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 589–597. https://doi.org/10.1109/CVPR.2016.70.

    [37]

    Idrees H, Tayyab M, Athrey K, et al. Composition loss for counting, density map estimation and localization in dense crowds[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 532–546. https://doi.org/10.1007/978-3-030-01216-8_33.

    [38]

    Idrees H, Saleemi I, Seibert C, et al. Multi-source multi-scale counting in extremely dense crowd images[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013: 2547–2554. https://doi.org/10.1109/CVPR.2013.329.

    [39]

    Patwal A, Diwakar M, Tripathi V, et al. Crowd counting analysis using deep learning: a critical review[J]. Proc Comput Sci, 2023, 218: 2448−2458. doi: 10.1016/j.procs.2023.01.220

    [40]

    Chen Y Q, Zhao H L, Gao M, et al. A weakly supervised hybrid lightweight network for efficient crowd counting[J]. Electronics, 2024, 13(4): 723. doi: 10.3390/electronics13040723

    [41]

    Lei Y J, Liu Y, Zhang P P, et al. Towards using count-level weak supervision for crowd counting[J]. Pattern Recognit, 2021, 109: 107616. doi: 10.1016/j.patcog.2020.107616

    [42]

    Meng Y D, Zhang H R, Zhao Y T, et al. Spatial uncertainty-aware semi-supervised crowd counting[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 15549–15559. https://doi.org/10.1109/ICCV48922.2021.01526.

    [43]

    Liang L J, Zhao H L, Zhou F B, et al. PDDNet: lightweight congested crowd counting via pyramid depth-wise dilated convolution[J]. Appl Intell, 2023, 53(9): 10472−10484. doi: 10.1007/S10489-022-03967-6

    [44]

    Sindagi V A, Yasarla R, Babu D S, et al. Learning to count in the crowd from limited labeled data[C]//Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, 2020: 212–229. https://doi.org/10.1007/978-3-030-58621-8_13.

    [45]

    Wang Q, Gao J Y, Lin W, et al. Learning from synthetic data for crowd counting in the wild[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 8198–8207. https://doi.org/10.1109/CVPR.2019.00839.

    [46]

    Liu W Z, Durasov N, Fua P. Leveraging self-supervision for cross-domain crowd counting[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 5341–5352. https://doi.org/10.1109/CVPR52688.2022.00527.

    [47]

    Savner S S, Kanhangad V. CrowdFormer: weakly-supervised crowd counting with improved generalizability[J]. J Vis Commun Image Representation, 2023, 94: 103853. doi: 10.1016/j.jvcir.2023.103853

    [48]

    Li Y C, Jia R S, Hu Y X, et al. A weakly-supervised crowd density estimation method based on two-stage linear feature calibration[J]. IEEE/CAA J Autom Sin, 2024, 11(4): 965−981. doi: 10.1109/JAS.2023.123960

  • 加载中

(6)

(7)

计量
  • 文章访问数: 
  • PDF下载数: 
  • 施引文献:  0
出版历程
收稿日期:  2024-07-24
修回日期:  2024-10-07
录用日期:  2024-10-08
刊出日期:  2024-10-25

目录

/

返回文章
返回