基于自适应模板更新与多特征融合的视频目标分割算法

汪水源; 侯志强; 王囡; 李富成; 蒲磊; 马素刚

doi:10.12086/oee.2021.210193

基于自适应模板更新与多特征融合的视频目标分割算法

- 1.
  西安邮电大学计算机学院，陕西西安 710121
- 2.
  西安邮电大学陕西省网络数据分析与智能处理重点实验室，陕西西安 710121
- 3.
  火箭军工程大学作战保障学院，陕西西安 710025
基金项目:
国家自然科学基金资助项目(62072370)

详细信息

作者简介:
汪水源(1996-)，男，硕士研究生，主要从事计算机视觉、视频目标分割的研究。E-mail：wsy_wang1@163.com

通讯作者: 侯志强(1973-)，男，博士，教授，博士生导师，主要从事图像处理、计算机视觉和信息融合的研究。E-mail：hzq@xupt.edu.cn

中图分类号: TP391

收稿日期: 2021-06-06

修回日期: 2021-09-09

刊出日期: 2021-10-15

Video object segmentation algorithm based on adaptive template updating and multi-feature fusion

- 1.
  Institute of Computer, Xi'an University of Posts and Telecommunications, Xi'an, Shaanxi 710121, China
- 2.
  Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, Xi'an, Shaanxi 710121, China
- 3.
  Rocket Force University of Engineering, Operational Support School, Xi'an, Shaanxi 710025, China
Fund Project: National Natural Science Foundation of China (62072370)

More Information

Corresponding author: Hou Zhiqiang,E-mail: hzq@xupt.edu.cn

Received Date 06 June 2021

Revised Date 09 September 2021

Published Date 15 October 2021

摘要

摘要:
针对SiamMask不能很好地适应目标外观变化，特征信息利用不足导致生成掩码较为粗糙等问题，本文提出一种基于自适应模板更新与多特征融合的视频目标分割算法。首先，算法利用每一帧的分割结果对模板进行自适应更新；其次，使用混合池化模块对主干网络第四阶段提取的特征进行增强，将增强后的特征与粗略掩码进行融合；最后，使用特征融合模块对粗略掩码进行逐阶段细化，该模块能够对拼接后的特征进行有效的加权组合。实验结果表明，与SiamMask相比，本文算法性能有明显提升。在DAVIS2016数据集上，本文算法的区域相似度和轮廓相似度分别为0.727和0.696，比基准算法提升了1.0%和1.8%，速度达到40.2 f/s；在DAVIS2017数据集上，本文算法的区域相似度和轮廓相似度分别为0.567和0.615，比基准算法提升了2.4%和3.0%，速度达到42.6 f/s。
- 视频目标分割 /
- 模板更新 /
- 特征融合 /
- 掩码细化
Abstract:
In order to solve the problem that SiamMask cannot adapt to the change of target appearance and the lack of use of feature information leads to rough mask generation, this paper proposes a video object segmentation algorithm based on the adaptive template update and the multi-feature fusion. First of all, the algorithm adaptively updates the template using the segmentation results of each frame; secondly, the hybrid pooling module is used to enhance the features extracted in the fourth stage of the backbone network, and the enhanced features are fused with the rough mask; finally, the feature fusion module is used to refine the rough mask stage by stage, which can effectively combine the spliced features. Experimental results show that, compared with SiamMask, the performance of the proposed algorithm is significantly improved. On the DAVIS2016 data-set, the region similarity and contour similarity of this algorithm are 0.727 and 0.696, respectively, which is 1.0% and 1.8% higher than that of the benchmark algorithm, and the speed reaches 40.2 f/s. On the DAVIS2017 data-set, the region similarity and contour similarity of this algorithm are 0.567 and 0.615, respectively, which is 2.4% and 3.0% higher than that of the benchmark algorithm, and the speed reaches 42.6 f/s.
- video object segmentation /
- template update /
- feature fusion /
- mask thinning

Overview

Overview: In recent years, video object segmentation (VOS) has been widely used in video surveillance, autopilot, intelligent robot, and other fields, and it has attracted more and more researchers' attention. According to the degree of human participation, video object segmentation can be divided into interactive video object segmentation, unsupervised video object segmentation, and semi-supervised video object segmentation. Semi-supervised VOS is the most concerned task in the field of video object segmentation, and it is also the research direction of this paper. Semi-supervised VOS gives the real mask of the target in the first frame of the video, and its purpose is to segment the target mask automatically in the remaining frames. However, in the whole video sequence, the target to be segmented may experience great appearance changes, occlusion, and fast movement, so it is a very challenging task to segment the target robust in the video sequence.
SiamMask forms is a multi-branch twin network framework by adding Mask branches to SiamRPN. In the field of video object segmentation, SiamMask achieves competitive segmentation accuracy on DAVIS2016 and DAVIS2017 data-sets. At the same time, the speed is nearly an order of magnitude faster than the method in the same period. Compared with the classical OSVOS, SiamMask is two orders of magnitude faster, so the video object segmentation can be applied in practice. However, due to the lack of template update, SiamMask is prone to tracking drift in complex videos. In addition, in the process of mask generation, SiamMask uses a lot of feature information loss, the fusion process is relatively rough, and does not use the feature map of the whole stage of the backbone network to refine the mask. In order to solve the above problems, this paper proposes a video object segmentation algorithm based on the adaptive template update and the multi-feature fusion. First of all, the proposed algorithm uses an adaptive update strategy to process the template, which can update the template using the segmentation results of each frame. Secondly, in order to use more feature information to refine the mask, this algorithm uses the hybrid pooling module to enhance the features extracted in the fourth stage of the backbone network, and fuses the enhanced features with the rough mask. Finally, in order to generate a more fine mask, this algorithm uses the feature fusion module to participate in the mask thinning process of intermediate features with richer spatial information in each stage of the backbone network. The experimental results show that the proposed algorithm significantly improves the tracking drift caused by occlusion and similar background interference, the performances on DAVIS2016 and DAVIS2017 data-sets are significantly improved, and the running speed meets the real-time requirements.

HTML全文

图 1 SiamMask算法整体框架

Figure 1. Overall framework of siamMask algorithm

下载: 全尺寸图片幻灯片

图 2 模板更新模块与模板更新流程

Figure 2. Template update module and template update process

下载: 全尺寸图片幻灯片

图 3 混合池化模块(MPM)

Figure 3. Mixed pooling module (MPM)

下载: 全尺寸图片幻灯片

图 4 特征融合模块(FFM)

Figure 4. Feature fusion module (FFM)

下载: 全尺寸图片幻灯片

图 5 本文算法整体框架

Figure 5. The overall framework of the algorithm in this paper

下载: 全尺寸图片幻灯片

图 6 定性实验结果

Figure 6. Qualitative experimental results

下载: 全尺寸图片幻灯片

表 1 DAVIS2016验证集上不同算法之间的性能对比

Table 1. Performance comparison between different algorithms on the DAVIS2016 verification set

Method	J-mean	F-mean	Speed/(f/s)
VPN^[24]	0.702	0.655	1.6
BVS^[25]	0.600	0.588	2.7
PLM^[26]	0.702	0.625	6.7
MuG-W^[2]	0.657	0.636	1.4
SiamMask^[16]	0.717	0.678	55.0
SiamMask_R	0.692	0.665	55.0
Ours	0.727	0.696	40.2

下载: 导出CSV

表 2 DAVIS2017验证集上不同算法之间的性能对比

Table 2. Performance comparison between different algorithms on the DAVIS2017 verification set

Method	J-mean	F-mean	Speed/(f/s)
OSVOS^[3]	0.566	0.639	0.1
FAVOS^[8]	0.546	0.618	0.8
OSMN^[13]	0.525	0.571	8.0
MuG-W^[2]	0.541	0.580	1.4
SiamMask^[16]	0.543	0.585	55.0
SiamMask_R	0.543	0.585	55.0
Ours	0.567	0.615	42.6

下载: 导出CSV

表 3 本文算法在DAVIS2017上的消融实验

Table 3. Ablation experiment of this algorithm on the DAVIS2017

SiamMask	MPM	FFM	Update	J	F
√				0.543	0.585
√	√			0.546	0.591
√		√		0.550	0.597
√			√	0.559	0.602
√	√	√		0.552	0.598
√	√		√	0.562	0.608
√		√	√	0.565	0.612
√	√	√	√	0.567	0.615

下载: 导出CSV

参考文献(27)

[1]	Miao J X, Wei Y C, Yang Y. Memory aggregation networks for efficient interactive video object segmentation[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10366-10375.
[2]	Lu X K, Wang W G, Shen J B, et al. Learning video object segmentation from unlabeled videos[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8957-8967.
[3]	Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5320-5329.
[4]	Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3491-3500.
[5]	Voigtlaender P, Leibe B. Online adaptation of convolutional neural networks for video object segmentation[Z]. arXiv: 1706.09364, 2017.
[6]	Luiten J, Voigtlaender P, Leibe B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[C]//Proceedings of the 14th Asian Conference on Computer Vision, 2018: 565-580.
[7]	Li X X, Loy C C. Video object segmentation with joint re-identification and attention-aware mask propagation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 93-110.
[8]	Cheng J C, Tsai Y H, Hung W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7415-7424.
[9]	Chen Y H, Pont-Tuset J, Montes A, et al. Blazingly fast video object segmentation with pixel-wise metric learning[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 1189-1198.
[10]	Hu Y T, Huang J B, Schwing A G. VideoMatch: matching based video object segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 56-73.
[11]	Voigtlaender P, Chai Y N, Schroff F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9473-9482.
[12]	Johnander J, Danelljan M, Brissman E, et al. A generative appearance model for end-to-end video object segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 8945-8954.
[13]	Yang L J, Wang Y R, Xiong X H, et al. Efficient video object segmentation via network modulation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 6499-6507.
[14]	Oh S W, Lee J Y, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7376-7385.
[15]	Oh S W, Lee J Y, Xu N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 9225-9234.
[16]	Wang Q, Zhang L, Bertinetto L, et al. Fast online object tracking and segmentation: a unifying approach[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1328-1338.
[17]	Li B, Yan J J, Wu W, et al. High performance visual tracking with Siamese region proposal network[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.
[18]	Perazzi F, Pont-Tuset J, McWilliams B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 724-732.
[19]	Pont-Tuset J, Perazzi F, Caelles S, et al. The 2017 DAVIS challenge on video object segmentation[Z]. arXiv: 1704.00675, 2018.
[20]	Zhang L C, Gonzalez-Garcia A, Van De Weijer J, et al. Learning the model update for Siamese trackers[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 4009-4018.
[21]	Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230-6239.
[22]	Hou Q B, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 4002-4011.
[23]	Yu C Q, Wang J B, Peng C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 334-349.
[24]	Jampani V, Gadde R, Gehler P V. Video propagation networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3154-3164.
[25]	Märki N, Perazzi F, Wang O, et al. Bilateral space video segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 743-751.
[26]	Yoon J S, Rameau F, Kim J, et al. Pixel-level matching for video object segmentation using convolutional neural networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, 2017: 2186-2195.
[27]	Chen X, Yan B, Zhu J W, et al. Transformer tracking[Z]. arXiv: 2103.15436, 2021.

施引文献

资源附件(0)

访问统计

点击扫一扫

图(6)

表(3)

计量

文章访问数: 2678
PDF下载数: 986
施引文献: 0

基于自适应模板更新与多特征融合的视频目标分割算法

作者简介:
汪水源(1996-)，男，硕士研究生，主要从事计算机视觉、视频目标分割的研究。E-mail：wsy_wang1@163.com

通讯作者: 侯志强(1973-)，男，博士，教授，博士生导师，主要从事图像处理、计算机视觉和信息融合的研究。E-mail：hzq@xupt.edu.cn

Video object segmentation algorithm based on adaptive template updating and multi-feature fusion

Corresponding author: Hou Zhiqiang,E-mail: hzq@xupt.edu.cn

计量

目录

作者须知

其他内容

条款和政策

基于自适应模板更新与多特征融合的视频目标分割算法

作者简介: 汪水源(1996-)，男，硕士研究生，主要从事计算机视觉、视频目标分割的研究。E-mail：wsy_wang1@163.com

通讯作者: 侯志强(1973-)，男，博士，教授，博士生导师，主要从事图像处理、计算机视觉和信息融合的研究。E-mail：hzq@xupt.edu.cn

Video object segmentation algorithm based on adaptive template updating and multi-feature fusion

Corresponding author: Hou Zhiqiang,E-mail: hzq@xupt.edu.cn

计量

出版历程

目录

作者须知

其他内容

条款和政策

作者简介:
汪水源(1996-)，男，硕士研究生，主要从事计算机视觉、视频目标分割的研究。E-mail：wsy_wang1@163.com