孪生导向锚框RPN网络实时目标跟踪
尚欣茹1,2, 温尧乐1,2, 奚雪峰1,3, 胡伏原1,3(1.苏州科技大学电子与信息工程学院, 苏州 215009;2.苏州科技大学苏州市大数据与信息服务重点实验室, 苏州 215009;3.苏州科技大学苏州市虚拟现实智能交互及应用技术重点实验室, 苏州 215009) 摘 要
目的 区域推荐网络(region proposal network,RPN)与孪生网络(Siamese)相结合进行视频目标跟踪,显示了较高的准确性。然而,孪生RPN网络(Siamese region proposal network, SiamRPN)目标跟踪器依赖于密集的锚框策略,会产生大量冗余的锚框并影响跟踪的精度和速度。为了解决该问题,本文提出了孪生导向锚框RPN网络(Siamese-guided anchor RPN,Siamese GA-RPN)。方法 Siamese GA-RPN的主要思想是利用语义特征来指导锚框生成。其中导向锚框网络包括位置预测模块和形状预测模块,这两个模块分别利用孪生网络中CNN(convolutional neural network)产生的语义特征预测锚框的位置和长宽尺寸,减少了冗余锚框的产生。然后,进一步设计了特征自适应模块,利用每个锚框的形状信息,通过可变卷积层来修正跟踪目标的原始特征图,降低目标特征与锚框信息的不一致性,提高了目标跟踪的准确性。结果 在3个具有挑战性的视频跟踪基准数据集VOT(video object tracking)2015、VOT2016和VOT2017上进行了跟踪实验,测试了算法在目标快速移动、遮挡和光照等复杂场景下的跟踪性能,并与多种优秀算法在准确性和鲁棒性两个评价指标上进行定量比较。在VOT2015数据集上,本文算法与孪生RPN网络相比,准确性提高了1.72%,鲁棒性提高了5.17%;在VOT2016数据集上,本文算法与孪生RPN网络相比,准确性提高了3.6%,鲁棒性提高了6.6%;在VOT2017数据集上进行实时实验,本文算法表现出了较好的实时跟踪效果。结论 通过孪生导向锚框RPN网络提高了锚框生成的有效性,确保了特征与锚框的一致性,实现了对目标的精确定位,较好地解决了锚框尺寸对目标跟踪精度的影响。在目标尺度发生变化、遮挡、光照条件变化和目标快速运动等复杂场景下仍然表现出了较强的鲁棒性和适应性。
关键词
Target tracking system based on the Siamese guided anchor region proposal network
Shang Xinru1,2, Wen Yaole1,2, Xi Xuefeng1,3, Hu Fuyuan1,3(1.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China;2.Suzhou Key Laboratory for Big Data and Information Service, Suzhou University of Science and Technology, Suzhou 215009, China;3.Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou University of Science and Technology, Suzhou 215009, China) Abstract
Objective After combining the region proposal network (RPN) with the Siamese network for video target tracking, improved target trackers have been consecutively proposed, all of which have demonstrated relatively high accuracy. Through analysis and comparison, we found that the anchor frame strategy of the RPN module of a Siamese RPN (SiamRPN) generates a large number of anchor frames generated through a sliding window. We then calculate the intersection over union (IOU) between anchor frames to generate candidate regions. Subsequently, we determine the position of target through the classifier and optimize the position of the frame regression. Although this method improves the accuracy of target tracking, it does not consider the semantic features of the target image, resulting in inconsistencies between the anchor frame and the features. It also generates a large number of redundant anchor frames, which exert a certain effect on the accuracy of target tracking, leading to a considerable increase in calculation amount. Method To solve this problem, this study proposes a Siamese guided anchor RPN (Siamese GA-RPN). The primary idea is to use semantic features to guide the anchoring and then convolve with the frame to be detected to obtain the response score figure. Lastly, end-to-end training is achieved on the target tracking network. The guided anchoring network is designed with location and shape prediction branches. The two branches use the semantic features extracted by the convolutional neural network (CNN) in the Siamese network to predict the locations wherein the center of the objects of interest exist and the scales and aspect ratios at different locations, reducing the generation of redundant anchors. Then, a feature adaptive module is designed. This module uses the variable convolution layer to modify the original feature map of the tracking target on the basis of the shape information of the anchor frame at each position, reducing the inconsistency between the features and the anchors and improving target tracking accuracy. Result Tracking experiments were performed on three challenging video tracking benchmark datasets: VOT(viedo object tracking)2015, VOT2016, and VOT2017. The algorithm’s tracking performance was tested on complex scenes, such as fast target movement, occlusion, and lighting. A quantitative comparison was made on two evaluation indexes: accuracy and robustness. On the VOT2015 dataset, the accuracy of the algorithm was improved by 1.72% and robustness was increased by 5.17% compared with those of the twin RPN network. On the VOT2016 dataset, the accuracy of the algorithm was improved by 3.6% compared with that of the twin RPN network. Meanwhile, robustness was improved by 6.6%. Real-time experiments were performed on the VOT2017 dataset, and the algorithm proposed in this study demonstrates good real-time tracking effect. Simultaneously, this algorithm was compared with the full convolutional Siam (Siam-FC) and Siam-RPN on four video sequences: rainy day, underwater, target occlusion, and poor light. The algorithm developed in this study exhibits good performance in the four scenarios in terms of tracking effect. Conclusion The anchor frame RPN network proposed in this study improves the effectiveness of anchor frame generation, ensures the consistency of features and anchor frames, achieves the accurate positioning of targets, and solves the problem of anchor frame size target tracking accuracy influences. The experimental results on the three video tracking benchmark data sets show better tracking results, which are better than several top-ranking video tracking algorithms with comprehensive performance, and show good real-time performance. And it can still track the target more accurately in complex video scenes such as change in target scale, occlusion, change in lighting conditions, fast target movement, etc., which shows strong robustness and adaptability.
Keywords
|