基于自适应模板更新与多特征融合的视频目标分割算法

汪水源,侯志强,王囡,等. 基于自适应模板更新与多特征融合的视频目标分割算法[J]. 光电工程,2021,48(10): 210193. doi: 10.12086/oee.2021.210193
引用本文: 汪水源,侯志强,王囡,等. 基于自适应模板更新与多特征融合的视频目标分割算法[J]. 光电工程,2021,48(10): 210193. doi: 10.12086/oee.2021.210193
Wang S Y, Hou Z Q, Wang N, et al. Video object segmentation algorithm based on adaptive template updating and multi-feature fusion[J]. Opto-Electron Eng, 2021, 48(10): 210193. doi: 10.12086/oee.2021.210193
Citation: Wang S Y, Hou Z Q, Wang N, et al. Video object segmentation algorithm based on adaptive template updating and multi-feature fusion[J]. Opto-Electron Eng, 2021, 48(10): 210193. doi: 10.12086/oee.2021.210193

基于自适应模板更新与多特征融合的视频目标分割算法

  • 基金项目:
    国家自然科学基金资助项目(62072370)
详细信息
    作者简介:
    通讯作者: 侯志强(1973-),男,博士,教授,博士生导师,主要从事图像处理、计算机视觉和信息融合的研究。E-mail:hzq@xupt.edu.cn
  • 中图分类号: TP391

Video object segmentation algorithm based on adaptive template updating and multi-feature fusion

  • Fund Project: National Natural Science Foundation of China (62072370)
More Information
  • 针对SiamMask不能很好地适应目标外观变化,特征信息利用不足导致生成掩码较为粗糙等问题,本文提出一种基于自适应模板更新与多特征融合的视频目标分割算法。首先,算法利用每一帧的分割结果对模板进行自适应更新;其次,使用混合池化模块对主干网络第四阶段提取的特征进行增强,将增强后的特征与粗略掩码进行融合;最后,使用特征融合模块对粗略掩码进行逐阶段细化,该模块能够对拼接后的特征进行有效的加权组合。实验结果表明,与SiamMask相比,本文算法性能有明显提升。在DAVIS2016数据集上,本文算法的区域相似度和轮廓相似度分别为0.727和0.696,比基准算法提升了1.0%和1.8%,速度达到40.2 f/s;在DAVIS2017数据集上,本文算法的区域相似度和轮廓相似度分别为0.567和0.615,比基准算法提升了2.4%和3.0%,速度达到42.6 f/s。

  • Overview: In recent years, video object segmentation (VOS) has been widely used in video surveillance, autopilot, intelligent robot, and other fields, and it has attracted more and more researchers' attention. According to the degree of human participation, video object segmentation can be divided into interactive video object segmentation, unsupervised video object segmentation, and semi-supervised video object segmentation. Semi-supervised VOS is the most concerned task in the field of video object segmentation, and it is also the research direction of this paper. Semi-supervised VOS gives the real mask of the target in the first frame of the video, and its purpose is to segment the target mask automatically in the remaining frames. However, in the whole video sequence, the target to be segmented may experience great appearance changes, occlusion, and fast movement, so it is a very challenging task to segment the target robust in the video sequence.

    SiamMask forms is a multi-branch twin network framework by adding Mask branches to SiamRPN. In the field of video object segmentation, SiamMask achieves competitive segmentation accuracy on DAVIS2016 and DAVIS2017 data-sets. At the same time, the speed is nearly an order of magnitude faster than the method in the same period. Compared with the classical OSVOS, SiamMask is two orders of magnitude faster, so the video object segmentation can be applied in practice. However, due to the lack of template update, SiamMask is prone to tracking drift in complex videos. In addition, in the process of mask generation, SiamMask uses a lot of feature information loss, the fusion process is relatively rough, and does not use the feature map of the whole stage of the backbone network to refine the mask. In order to solve the above problems, this paper proposes a video object segmentation algorithm based on the adaptive template update and the multi-feature fusion. First of all, the proposed algorithm uses an adaptive update strategy to process the template, which can update the template using the segmentation results of each frame. Secondly, in order to use more feature information to refine the mask, this algorithm uses the hybrid pooling module to enhance the features extracted in the fourth stage of the backbone network, and fuses the enhanced features with the rough mask. Finally, in order to generate a more fine mask, this algorithm uses the feature fusion module to participate in the mask thinning process of intermediate features with richer spatial information in each stage of the backbone network. The experimental results show that the proposed algorithm significantly improves the tracking drift caused by occlusion and similar background interference, the performances on DAVIS2016 and DAVIS2017 data-sets are significantly improved, and the running speed meets the real-time requirements.

  • 加载中
  • 图 1  SiamMask算法整体框架

    Figure 1.  Overall framework of siamMask algorithm

    图 2  模板更新模块与模板更新流程

    Figure 2.  Template update module and template update process

    图 3  混合池化模块(MPM)

    Figure 3.  Mixed pooling module (MPM)

    图 4  特征融合模块(FFM)

    Figure 4.  Feature fusion module (FFM)

    图 5  本文算法整体框架

    Figure 5.  The overall framework of the algorithm in this paper

    图 6  定性实验结果

    Figure 6.  Qualitative experimental results

    表 1  DAVIS2016验证集上不同算法之间的性能对比

    Table 1.  Performance comparison between different algorithms on the DAVIS2016 verification set

    Method J-mean F-mean Speed/(f/s)
    VPN[24] 0.702 0.655 1.6
    BVS[25] 0.600 0.588 2.7
    PLM[26] 0.702 0.625 6.7
    MuG-W[2] 0.657 0.636 1.4
    SiamMask[16] 0.717 0.678 55.0
    SiamMask_R 0.692 0.665 55.0
    Ours 0.727 0.696 40.2
    下载: 导出CSV

    表 2  DAVIS2017验证集上不同算法之间的性能对比

    Table 2.  Performance comparison between different algorithms on the DAVIS2017 verification set

    Method J-mean F-mean Speed/(f/s)
    OSVOS[3] 0.566 0.639 0.1
    FAVOS[8] 0.546 0.618 0.8
    OSMN[13] 0.525 0.571 8.0
    MuG-W[2] 0.541 0.580 1.4
    SiamMask[16] 0.543 0.585 55.0
    SiamMask_R 0.543 0.585 55.0
    Ours 0.567 0.615 42.6
    下载: 导出CSV

    表 3  本文算法在DAVIS2017上的消融实验

    Table 3.  Ablation experiment of this algorithm on the DAVIS2017

    SiamMask MPM FFM Update J F
    0.543 0.585
    0.546 0.591
    0.550 0.597
    0.559 0.602
    0.552 0.598
    0.562 0.608
    0.565 0.612
    0.567 0.615
    下载: 导出CSV
  • [1]

    Miao J X, Wei Y C, Yang Y. Memory aggregation networks for efficient interactive video object segmentation[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10366-10375.

    [2]

    Lu X K, Wang W G, Shen J B, et al. Learning video object segmentation from unlabeled videos[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8957-8967.

    [3]

    Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5320-5329.

    [4]

    Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3491-3500.

    [5]

    Voigtlaender P, Leibe B. Online adaptation of convolutional neural networks for video object segmentation[Z]. arXiv: 1706.09364, 2017.

    [6]

    Luiten J, Voigtlaender P, Leibe B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[C]//Proceedings of the 14th Asian Conference on Computer Vision, 2018: 565-580.

    [7]

    Li X X, Loy C C. Video object segmentation with joint re-identification and attention-aware mask propagation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 93-110.

    [8]

    Cheng J C, Tsai Y H, Hung W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7415-7424.

    [9]

    Chen Y H, Pont-Tuset J, Montes A, et al. Blazingly fast video object segmentation with pixel-wise metric learning[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 1189-1198.

    [10]

    Hu Y T, Huang J B, Schwing A G. VideoMatch: matching based video object segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 56-73.

    [11]

    Voigtlaender P, Chai Y N, Schroff F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9473-9482.

    [12]

    Johnander J, Danelljan M, Brissman E, et al. A generative appearance model for end-to-end video object segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 8945-8954.

    [13]

    Yang L J, Wang Y R, Xiong X H, et al. Efficient video object segmentation via network modulation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 6499-6507.

    [14]

    Oh S W, Lee J Y, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7376-7385.

    [15]

    Oh S W, Lee J Y, Xu N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 9225-9234.

    [16]

    Wang Q, Zhang L, Bertinetto L, et al. Fast online object tracking and segmentation: a unifying approach[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1328-1338.

    [17]

    Li B, Yan J J, Wu W, et al. High performance visual tracking with Siamese region proposal network[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.

    [18]

    Perazzi F, Pont-Tuset J, McWilliams B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 724-732.

    [19]

    Pont-Tuset J, Perazzi F, Caelles S, et al. The 2017 DAVIS challenge on video object segmentation[Z]. arXiv: 1704.00675, 2018.

    [20]

    Zhang L C, Gonzalez-Garcia A, Van De Weijer J, et al. Learning the model update for Siamese trackers[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 4009-4018.

    [21]

    Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230-6239.

    [22]

    Hou Q B, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 4002-4011.

    [23]

    Yu C Q, Wang J B, Peng C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 334-349.

    [24]

    Jampani V, Gadde R, Gehler P V. Video propagation networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3154-3164.

    [25]

    Märki N, Perazzi F, Wang O, et al. Bilateral space video segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 743-751.

    [26]

    Yoon J S, Rameau F, Kim J, et al. Pixel-level matching for video object segmentation using convolutional neural networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, 2017: 2186-2195.

    [27]

    Chen X, Yan B, Zhu J W, et al. Transformer tracking[Z]. arXiv: 2103.15436, 2021.

  • 加载中

(6)

(3)

计量
  • 文章访问数:  2678
  • PDF下载数:  986
  • 施引文献:  0
出版历程
收稿日期:  2021-06-06
修回日期:  2021-09-09
刊出日期:  2021-10-15

目录

/

返回文章
返回