Current Issue Cover
多模态视觉跟踪方法综述

李成龙1, 鹿安东2, 刘磊2, 汤进2(1.安徽大学人工智能学院, 合肥 230601;2.安徽大学计算机科学与技术学院, 合肥 230601)

摘 要
目标跟踪是计算机视觉研究中的前沿和热点问题,在安全监控、无人驾驶等领域中有着重要的应用价值。然而,目前基于可见光数据的视觉跟踪方法,在光照变化、恶劣天气下因数据质量受限难以实现鲁棒跟踪。因此,一些研究者提出了多模态视觉跟踪任务,通过引入其他模态数据,包括红外模态、深度模态、事件模态以及文本模态,在一定程度上弥补了可见光模态在恶劣天气、遮挡、快速运动和外观歧义等条件下的不足。多模态视觉跟踪旨在挖掘可见光和其他模态数据的互补优势,在视频中实现鲁棒的目标定位,对全天时全天候感知有着重要的价值和意义,受到越来越多的研究和关注。由于主流的多模态视觉跟踪方法针对可见光—红外跟踪展开,因此,本文以阐述可见光—红外跟踪方法为主,从信息融合的角度将现有方法划分为结合式融合和判别式融合,分别进行了详细介绍和分析,并对不同类方法的优缺点进行了分析和比较。然后,本文对其他多模态视觉跟踪任务的研究工作进行了介绍,并对不同多模态视觉跟踪任务的优缺点进行了分析和比较。最后,本文对多模态视觉跟踪方法进行了总结并对未来发展进行展望。
关键词
Multi-modal visual tracking: a survey

Li Chenglong1, Lu Andong2, Liu Lei2, Tang Jin2(1.School of Artificial Intelligence, Anhui University, Hefei 230601, China;2.School of Computer Science and Technology, Anhui University, Hefei 230601, China)

Abstract
Visual tracking can be as one of the key tasks in computer vision applications like surveillance, robotics and automatic driving in the past decades. The performance issue for visual tracking is still challenged of the quality of visible light data in adverse scenes, such as low illumination, background clutter, haze and smog. To deal with the imaging constraints of visible light data, current researches are focused on multiple modal data-introduced in common. The visible and modal data integration can be effective in tracking performance in terms of the manner of thermal infrared, depth, event and language. Benefiting from the integrated capability of visible and multi-modal data, multi-modal trackers have been developing intensively in such complicated scenarios of those are low illumination, occlusion, fast motion and semantic ambiguity. Nowadays, our executive summary is focused on reviewing the RGB and thermal infrared(RGBT) tracking algorithms, which is oriented for the popular visible-infrared visual tracking towards multi-modal visual tracking. Existing multi-modal visual tracking-based summaries are concerned of the segmentation of tracking algorithms in terms of multi-framework tracking or multi-level based fusions derived of pixel, feature, and decision. With respect of the information fusion plays a key role in multi-modal visual tracking, we divide and analyze existing RGBT tracking methods from the perspective of information fusion, including synthesized and specific-based fusions. Specifically, the fusion-integrated can be used to combine all multimodal information together via different fusion methods, including: 1) sparse representation fusion, 2) collaborative graph representation fusion,3) modality-synthesized and modality-specific information fusion, and 4) attribute-based feature decoupling fusion. First, sparse representation fusion has a good ability to suppress feature noise, but most of these algorithms are restricted by the time-consuming online optimization of the sparse representation models. In addition, these methods can be used as target representation via pixel values, and thus have low robustness in complex scenes. Second, collaborative graph representation fusion can be used to suppress the effect of background clutter in terms of modality weights and local image patch weights. However, these methods are required for multi-variables optimization iteratively, and the tracking efficiency is quite lower. Furthermore, these models are required to use color and gradient features, which are better than pixel values but also hard to deal with challenging scenarios. Third, modality-synthesized and modality-specific information fusion can use be used to model modality-synthesized and modality-specific representations based on different sub-networks and provide an effective fusion strategy for tracking. However, these methods are lack of the information interaction in the learning of modality-specific representations, and thus introduce noises and redundancy easily. Fourth, attribute-based feature decoupling fusion can be applied to model the target representations under different attributes, and it alleviates the dependence on large-scale training data more. However, it is difficult to cover all challenging problems in practical applications. Although these fusion-synthesized methods have achieved good tracking performance, all multiple modalities information-synthesized have to introduce the information redundancy and feature noises inevitably. To resolve these problems, some researches have been concerned of fusion-specific methods in RGBT tracking. This sort of fusion is aimed to mine the specific features of multiple modalities for effective and efficient information fusion, including: 1) feature-selected fusion, 2) attention-mechanism-based adaptive fusion, 3)mutual-enhanced fusion, and 4)other fusion-specific methods. Feature selection fusion is employed to select specific features in regular. It can not only avoid the interference of data noises and is beneficial to improving tracking performance, but also eliminate data redundancy. However, the selection criteria are hard to be designed, and unsuitable criterion often removes useful information under low-quality data and thus limits tracking performance. Adaptive fusion is aimed to estimate the reliability of multi-modal data in term of attention mechanism, including the modality, spatial and channel reliabilities, and thus achieves the adaptive fusion of multi-modal information. However, there is no clear supervision to generate these weights-reliable, which is possible to mislead the estimated results in complex scenarios. Mutual enhancement fusion is focused on data noises-related suppression for low-quality modality and its features can be enhanced between the specific information and other modality. These methods can be implemented to mine the specific information of multiple modalities and improve target representations of low-quality modalities. However, these methods are complicated and have low tracking efficiency.The task of multi-modal vision tracking has three sub-tasks besides RGBT tracking, including: 1) visible-depth tracking (called RGB and depth(RGBD) tracking), 2) visible-event tracking (called RGBE(RGB and event) tracking), 3) visible-language tracking (called RGB and language(RGBL) tracking). We review these three multi-modal visual tracking issues in brief as well. Furthermore, we predict some academic challenges and future directions for multi-modal visual tracking.
Keywords

订阅号|日报