结合双模板融合与孪生网络的鲁棒视觉目标跟踪
摘 要
目的 视觉目标跟踪算法主要包括基于相关滤波和基于孪生网络两大类。前者虽然精度较高但运行速度较慢,无法满足实时要求。后者在速度和精度方面取得了出色的跟踪性能,然而,绝大多数基于孪生网络的目标跟踪算法仍然使用单一固定的模板,导致算法难以有效处理目标遮挡、外观变化和相似干扰物等情形。针对当前孪生网络跟踪算法的不足,提出了一种高效、鲁棒的双模板融合目标跟踪方法(siamese tracker with double template fusion,Siam-DTF)。方法 使用第1帧的标注框作为初始模板,然后通过外观模板分支借助外观模板搜索模块在跟踪过程中为目标获取合适、高质量的外观模板,最后通过双模板融合模块,进行响应图融合和特征融合。融合模块结合了初始模板和外观模板各自的优点,提升了算法的鲁棒性。结果 实验在3个主流的目标跟踪公开数据集上与最新的9种方法进行比较,在OTB2015(object tracking benchmark 2015)数据集中,本文方法的AUC(area under curve)得分和精准度分别为0.701和0.918,相比于性能第2的SiamRPN++(siamese region proposal network++)算法分别提高了0.6%和1.3%;在VOT2016(visual object tracking 2016)数据集中,本文方法取得了最高的期望平均重叠(expected average overlap,EAO)和最少的失败次数,分别为0.477和0.172,而且EAO得分比基准算法SiamRPN++提高了1.6%,比性能第2的SiamMask_E算法提高了1.1%;在VOT2018数据集中,本文方法的期望平均重叠和精确度分别为0.403和0.608,在所有算法中分别排在第2位和第1位。本文方法的平均运行速度达到47帧/s,显著超出跟踪问题实时性标准要求。结论 本文提出的双模板融合目标跟踪方法有效克服了当前基于孪生网络的目标跟踪算法的不足,在保证算法速度的同时有效提高了跟踪的精确度和鲁棒性,适用于工程部署与应用。
关键词
Double template fusion based siamese network for robust visual object tracking
Chen Zhiliang, Shi Fanhuai(College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China) Abstract
Objective Visual object tracking (VOT) analysis has challenged for computer vision research. Current trackers can be roughly segmented into two categories like correlation filter trackers and Siamese network based trackers. Correlation filter trackers train a circular correlation based regressor analysis in the Fourier domain. Siamese network based trackers have improved the speed and accuracy issue of deep features. A Siamese network consists of two branches which implicitly encodes the original patches to another space and then fuses them with an identified tensor to generate a single output. However, most Siamese network based trackers utilize the single fixed template to resolve occlusion, appearance change and distractors problems. We illustrate an efficient and robust Siamese network based tracker via double template fusion, referred as Siamese tracker with double template fusion (Siam-DTF). The demonstrated Siam-DTF has a double template mechanism in related to qualified robustness. Method Siam-DTF consists of three emerging branches like initial template z, appearance template za and search area x. First, we facilitate the appearance template search module (ATSM) which fully utilizes the information of historical frames to efficiently obtain the appropriate and high-quality appearance template when the initial template is not consistent with the current frame. The appearance template, which is flexible and adaptive to the appearance changes of the object, can represent the object well when facing hard tracking challenges. We choose the frame with the highest confidence in the historical frames to crop the appearance template. To filter out low-quality template, we drop the appearance template if its predicted box has a lower intersection-of-union or its confidence score is lower than that of the initial template. In order to balance the accuracy and speed of our tracker, we use a sparse update strategy on the appearance template. In terms of theoretical analysis and experimental validations, we clarify that the confidence score change of tracker reflects the tracking quality more. When the max confidence of current frame is lower than average confidence of the historical N frames with a certain margin m, we conduct the ATSM to update the appearance template. Next, our fusion module illustration achieves more robust results based on these two templates. The initial template and the appearance template branch are integrated in terms of fusion of score maps and fusion of features. Result The nine tailored trackers model including the correlation filter trackers and Siamese network based trackers demonstrated on three public tracking datasets in the context of object tracking benchmark 2015 (OTB2015), VOT2016 and VOT2018. In OTB2015, the quantitative evaluation metrics contained area under curve (AUC) and precision. The proposed Siam-DTF is capable to rank 1 st both in terms of AUC and precision. Compared with the baseline tracker Siamese region proposal network++ (SiamRPN++), our Siam-DTF improves 0.6% in AUC and 1.3% in precision. Since we unveiling the power of deep feature of historical frames, Siam-DTF precision is prior to correlation filter tracker efficient convolution operators (ECO) by 0.8%. In VOT2016 and VOT2018, the quantitative evaluation metrics contained accuracy (average overlap while tracking successfully) and robustness (failure times). The overall performance is evaluated using expected average overlap (EAO) which takes account of both accuracy and robustness. In VOT2016, Siam-DTF achieves the qualified EAO score 0.477 and the least failure rate 0.172. For EAO score, our method outperforms the baseline tracker SiamRPN++ and the second best tracker SiamMask_E by 1.6% and 1.1%, respectively. Also our method decreases the failure rate from 0.200 to 0.172 compared to SiamRPN++, indicating that our Siam-DTF tracker robustness is well. In VOT2018, Siam-DTF obtains the good result with accuracy of 0.608. Siam-DTF also obtains the second good EAO score 0.403. As for tracking speed, Siam-DTF tracker not only achieves a substantial improvement, but also running efficiently at 47 frame per second (FPS). In summary, it is concluded that all these consistent results show the strong generalization ability of our tracker Siam-DTF. Conclusion We propose an efficient and robust Siamese tracker with double template fusion, referred as Siam-DTF. Siam-DTF fully utilizes the information of historical frames to obtain the appearance template with good adaptability. All 3 benchmarks analysis demonstrate the effectiveness based on and Siam-DTF consistent results.
Keywords
|