融合上下文感知注意力的Transformer目标跟踪方法

徐晗; 董仕豪; 张家伟; 郑钰辉

发布时间： 2024-04-30
摘要点击次数： 832
全文下载次数： 392
DOI: :10.11834/jig.240084
| Volume | Number

融合上下文感知注意力的Transformer目标跟踪方法

徐晗, 董仕豪, 张家伟, 郑钰辉(南京信息工程大学计算机学院)

摘要

目的近年来，Transformer跟踪器取得突破性的进展，其中自注意力机制发挥了重要作用。当前，自注意力机制中独立关联计算易导致权重不明显现象，限制了跟踪方法性能。为此，提出了一种融合上下文感知注意力的Transformer目标跟踪方法。方法首先，引入SwinTransformer（hierarchical vision transformer using shifted windows）以提取视觉特征，利用跨尺度策略整合深层与浅层的特征信息，提高网络对复杂场景中目标表征能力。其次，构建了基于上下文感知注意力的编解码器，充分融合模板特征和搜索特征。上下文感知注意力使用嵌套注意计算，加入分配权重的目标掩码，可有效抑制由相关性计算不准确导致的噪声。最后，使用角点预测头估计目标边界框，通过相似度分数结果更新模板图像。结果在TrackingNet（large-scale object tracking dataset）、LaSOT（large-scale single object tracking）、GOT-10K（generic object tracking benchmark）等多个公开数据集上开展大量测试，均取得了优异性能。在GOT-10K上平均重叠率达到73.9%，在所有对比方法中排在第一位；在LaSOT上的AUC（area under curve）得分和精准度为0.687、0.749，相比与性能第二的ToMP（Transforming model prediction for tracking）分别提高了1.1%和2.7%；在TrackingNet上的AUC得分和精准度为0.831、0.807，较第 2 名分别高出 0.8%和0.3%。结论所提方法利用上下文感知注意力聚焦特征序列中的目标信息，提高了向量交互的精确性，可有效应对快速运动、相似物干扰等问题，提升了跟踪性能。

关键词

计算机视觉目标跟踪上下文感知注意力 Transformer 特征融合

Context-aware attention fused transformer tracking

Xu Han, Dong Shihao, Zhang Jiawei, Zheng Yuhui(School of Computer Science,Nanjing University of Information Science and Technology)

Abstract

Abstract: Objective Visual target tracking, as one of the key tasks in the field of computer vision, is mainly aimed at predicting the size and position of a target in a given video sequence. In recent years, target tracking has been widely used in the fields of autonomous driving, unmanned aerial vehicles (UAVs), military activities, intelligent surveillance, and so on. Although numerous excellent methods have emerged in the field of target tracking, however, there are still multifaceted challenges to be faced, including, but not limited to, shape variations, occlusion, motion blurring, and interference from proximate objects. Currently, target tracking methods are categorized into two main groups: correlation-based filtering and deep learning-based. The former approximates the target tracking process as a search image signal domain computation. However, it is difficult to fully utilize the image representation information using manual features, which greatly limits the performance of tracking methods. In recent years, deep learning has made significant progress in the field of target tracking by virtue of its powerful visual representation processing capabilities. In recent years, Transformer trackers have made breakthroughs, in which the self-attention mechanism plays an important role. Currently, the independent correlation calculation in the self-attention mechanism is prone to lead to the phenomenon of ambiguous weights, which constraints hampers the tracking method"s overall performance. For this reason, a Transformer target tracking method incorporating context-aware attention is proposed. Method First, Hierarchical Vision Transformer using Shifted Windows（SwinTransformer） is introduced to extract visual features, and a cross-scale strategy is utilized to integrate deep and shallow feature information to improve the network"s ability to characterize targets in complex scenes. The cross-scale fusion strategy is used to obtain key information at different scales, capture templates and search for image diversity texture features, which motivates the tracking network to better understand the relationship between the target and the background. Second, a context-aware attention-based codec is constructed to fully fuse template features and search features. For the problem of inaccurate correlation computation that occurs in the attention mechanism, nested computation is used for query key pairs to focus on the target information in the input sequenceand incorporates a target mask for assigning weights, which can effectively suppress the noise caused by inaccurate correlation computation, seek the consistency of the feature vectors, and prompt better interaction of feature information. The encoder uses features from the output of the trunk as input and uses global contextual information to reinforce the original features, thus enabling the model to learn discriminative features for object localization. The decoder takes as input the target query and the sequence of enhanced features from the encoder, using a two-branch cross attention design. One of the branches computes the target query and the encoder"s inputs for attending to features across the full range of locations and search regions on the template. Finally, a corner prediction header is used to estimate the target bounding box, and the template image is updated by the similarity score results. Specifically, the decoded features are fed into a fully convolutional network that outputs two probability maps for the upper left and lower right corners of the target bounding box. The predicted box coordinates are then obtained by calculating the expectation of the probability distribution for the two corners. Results To train the tracking model, training pairs are randomly selected from the common objects in context(COCO), a large-scale object tracking Dataset(TrackingNet), large-scale single object tracking(LaSOT) and generic object tracking benchmark (GOT-10k) datasets in this paper. The model"s minimum training data unit is a triad consisting of two templates and a search image. It was trained using 500 epochs, using 6 × 104 triples per epoch. The backbone network and the remainder have initial learning rates are 10^(-5)和and 10^(-4). After 400 training sessions, the learning rate decreased by a factor of 10. Result Extensive testing on TrackingNet, LaSOT, GOT-10K, the online object tracking: a benchmark(OTB100), a benchmark and simulator for UAV tracking(UAV123), and the need for speed dataset and benchmark(NfS) publicly available datasets, and compared with several current state-of-the-art tracking methods, all of which achieved excellent performance. On GOT-10K, the average overlap rate reaches 73.9% , and 〖SR〗_0.5、〖SR〗_0.75 reach 84.6% 、69.8%.〖SR〗_0.5 and 〖SR〗_0.75 are the success rate of overlapping coverage area greater than 0.5 and greater than 0.75. On LaSOT, the area under curve (AUC) is 68.7%, and the precision rate (PR) and normalized precision rate (NPR) are 78.7% and 74.3%. On TrackingNet the success rate is 68.7%, the normalized precision rate is 87.7%, and the accuracy rate is 80.7%. The success rates on NfS, OTB100, and UAV123 datasets are 68.1%, 69.6%, and 68.3%. The experimental results prove that the proposed method has good generalization ability. In order to verify the effectiveness of the proposed method, ablation experiments are carried out on the GOT-10K, LaSOT and TrackingNet datasets to verify the effect of different modules on the tracking method. Using three feature extraction networks, ResNet-50, SwinTrack-Base, and the cross-scale fusion SwinTransformer, two scenarios were compared between the inclusion of the context-aware attention module and the exclusion of the module. The comparison of the final results shows that the inclusion of the context-aware attention module in SwinTransformer effectively improves the tracking performance. Conclusion The proposed method utilizes context-aware attention to focus the target information in the feature sequence, which improves the accuracy of vector interaction. The method effectively copes with the problems of fast motion and similar object interference, and improves the tracking performance However, the proposed method uses Transformer in both the feature extraction and fusion stages, which leads to a large number of parameters and takes more time for training, resulting in low computational efficiency. In the future, it is considered to merge the two stages to realize the integration of feature extraction and fusion.

Keywords

Computer vision Object tracking Context-aware attention Transformer Feature fusion

在线采编平台

论文出版

年度会议

下载中心

年度信息