Current Issue Cover
动态模态交互和特征自适应融合的RGBT跟踪

王福田1,2, 张淑云1, 李成龙1, 罗斌1(1.安徽大学计算机科学与技术学院多模态认知计算实验室, 合肥 230000;2.合肥综合性国家科学中心人工智能研究院, 合肥 230000)

摘 要
目的 可见光和热红外模态数据具有很强的互补性,RGBT (RGB-thermal)跟踪受到越来越多的关注。传统RGBT目标跟踪方法只是将两个模态的特征进行简单融合,跟踪的性能受到一定程度的限制。本文提出了一种基于动态交互和融合的方法,协作学习面向RGBT跟踪的模态特定和互补表示。方法 首先,不同模态的特征进行交互生成多模态特征,在每个模态的特定特征学习中使用注意力机制来提升判别性。其次,通过融合不同层次的多模态特征来获得丰富的空间和语义信息,并通过设计一个互补特征学习模块来进行不同模态互补特征的学习。最后,提出一个动态权重损失函数,根据对两个模态特定分支预测结果的一致性和不确定性进行约束以自适应优化整个网络中的参数。结果 在两个基准RGBT目标跟踪数据集上进行实验,数据表明,在RGBT234数据集上,本文方法的精确率(precision rate,PR)为79.2%,成功率(success rate,SR)为55.8%;在GTOT (grayscale-thermal object tracking)数据集上,本文方法的精确率为86.1%,成功率为70.9%。同时也在RGBT234和GTOT数据集上进行了对比实验以验证算法的有效性,实验结果表明本文方法改善了RGBT目标跟踪的结果。结论 本文提出的RGBT目标跟踪算法,有效挖掘了两个模态之间的互补性,取得了较好的跟踪精度。
关键词
RGBT tracking based on dynamic modal interaction and adaptive feature fusion

Wang Futian1,2, Zhang Shuyun1, Li Chenglong1, Luo Bin1(1.Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230000, China;2.Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230000, China)

Abstract
Objective Visual target tracking can be applied to the computer vision analysis, such as video surveillance, unmanned autopilot systems, and human-computer interaction. Thermal infrared cameras have the advantages of long-range of action, strong penetrating ability, hidden objects. As a branch of visual tracking, RGBT(RGB-thermal) tracking aims to estimate the status of the target in a video sequence by aggregating complementary data from two different modalities given the groundtruth bounding box of the first frame of the video sequence. Previous RGBT tracking algorithms are constrained of traditional handcraft features or insufficient to explore and utilize complementary information from different modalities. In order to explore the complementary information between the two modalities, we propose a dynamic interaction and fusion method for RGBT tracking.Method Generally, RGB images capture visual appearance information (e.g., colors and textures) of target, and thermal images acquire temperature information which is robust to the conditions of lighting and background clutter. To obtain more powerful representations, we can introduce the useful information of another modality. However, the fusion of different modalities is opted from addition or concatenation in common due to some noisy information of the obtained modality features. First, a modality interaction module is demonstrated to suppress clutter noise based on the multiplication operation. Second, a fusion module is designed to gather cross-modality features of all layers. It captures different abstractions of target representations for more accurate localization. Third, a complementary gate mechanism guided learning structure calculates the complementary features of different modalities. As the input of the gate, we use the modality-specific features and the cross-modality features obtained from the fusion module. The output of the gate is a numerical value. To obtain the complementary features, we carry out a dot product operation on this value and the cross-modality features. Finally, a dynamic weighting loss is presented to optimize the parameters of the network adaptively in terms of the constraints of the consistency and uncertainty of the prediction results of two modality-specific branches. Our method is evaluated on two standard RGBT tracking datasets, like GTOT(grayscale thermal object tracking) and RGBT234. The two evaluation indicators(precision rate and success rate) is illustrated to measure the performance of tracking. Our model is built through the open source toolbox Pytorch, and the stochastic gradient descent method is optimized. Our implementation runs on the platform of PyTorch with 4.2 GHz Intel Core I7-7700K and NVIDIA GeForce GTX 1080Ti GPU.Result We conduct many comparative experiments on the RGBT234 and GTOT datasets. Based on the GTOT dataset analysis, our method (86.1%, 70.9%) exceeds baseline tracker (80.6%, 65.6%) by 5.5% in precision rate(PR) and 5.3% in success rate(SR). our method (79.2%, 55.8%) is 7.0% higher in PR and 6.3% higher in SR than the baseline tracker in terms of the RGBT234 dataset analysis. Compared to the second-performing tracking method, on the RGBT234 dataset, our method is 2.6% higher than DAPNet (76.6%) in PR, and 2.1% higher than DAPNet(53.7%) in SR. At the same time, we conduct component analysis experiments on two datasets. Our experimental results illustrate that each module can improve the performance of tracking.Conclusion Our RGBT target tracking algorithm obtains rich semantic and spatial information through modal interaction and fusion modules, and uses the gate mechanism to explore the complementarity between different modalities. Dynamic weighting loss is illustrated to adaptively optimize the parameters in the model in accordance with the constraints of the prediction results of two modality-specific branches.
Keywords

订阅号|日报