运动特征激励的无候选框视频描述定位
摘 要
目的 视频描述定位是视频理解领域一个重要且具有挑战性的任务,该任务需要根据一个自然语言描述的查询,从一段未修剪的视频中定位出文本描述的视频片段。由于语言模态与视频模态之间存在巨大的特征表示差异,因此如何构建出合适的视频—文本多模态特征表示,并准确高效地定位目标片段成为该任务的关键点和难点。针对上述问题,本文聚焦于构建视频—文本多模态特征的优化表示,提出使用视频中的运动信息去激励多模态特征表示中的运动语义信息,并以无候选框的方式实现视频描述定位。方法 基于自注意力的方法提取自然语言描述中的多个短语特征,并与视频特征进行跨模态融合,得到多个关注不同语义短语的多模态特征。为了优化多模态特征表示,分别从时序维度及特征通道两个方面进行建模: 1)在时序维度上使用跳连卷积,即一维时序卷积对运动信息的局部上下文进行建模,在时序维度上对齐语义短语与视频片段; 2)在特征通道上使用运动激励,通过计算时序相邻的多模态特征向量之间的差异,构建出响应运动信息的通道权重分布,从而激励多模态特征中表示运动信息的通道。本文关注不同语义短语的多模态特征融合,采用非局部神经网络(non-local neural network)建模不同语义短语之间的依赖关系,并采用时序注意力池化模块将多模态特征融合为一个特征向量,回归得到目标片段的开始与结束时刻。结果 在多个数据集上验证了本文方法的有效性。在Charades-STA数据集和ActivityNet Captions数据集上,模型的平均交并比(mean intersection over union,mIoU)分别达到了52.36%和42.97%,模型在两个数据集上的召回率R@1 (Recall@1)分别在交并比阈值为0.3、0.5和0.7时达到了73.79%、61.16%和52.36%以及60.54%、43.68%和25.43%。与LGI (local-global video-text interactions)和CPNet (contextual pyramid network)等方法相比,本文方法在性能上均有明显的提升。结论 本文在视频描述定位任务上提出了使用运动特征激励优化视频—文本多模态特征表示的方法,在多个数据集上的实验结果证明了运动激励下的特征能够更好地表征视频片段和语言查询的匹配信息。
关键词
Proposal-free video grounding based on motion excitation
Guo Yichen1,2, Li Kun1,2, Guo Dan1,2(1.School of Computer and Information Engineering, Hefei University of Technology, Hefei 230601, China;2.Key Laboratory of Knowledge Engineering with Big Data(Hefei University of Technology), Ministry of Education, Hefei 230601, China) Abstract
Objective Video grounding is an essential and challenging task in relevance to video understanding nowadays. A natural language query can be used to describe a particular video segment in an untrimmed video. Given such a natural language query,the target of video grounding is focused on locating an action segment in the untrimmed video. As a highlevel semantic understanding task in computer vision,video grounding faces many challenges,since it requires the joint modeling of visual modality and linguistic modality simultaneously. First,compared to static images,the content of videos in the real world usually contains more complicated scenes. Such a few-minute video is usually composed of several action scenarios,which can be as an integration status of actors,objectives,and motions. Second,natural language is inevitably ambiguous and subjective to some extent. The description of the same activity may diverse. Intuitively,there is a big semantic gap for visual and textual-between modality. Therefore,it needs to build an appropriate video-text multi-modal feature for accurate grounding further. To resolve the challenges mentioned above,we facilitate a novel proposal-free method to learn an appropriate multi-modal features with motion excitation. Specifically,the motion excitation is exploited to highlight motion clues of multi-modal features for accurate grounding. Method The proposed method consists of three key modules relevant to:1) feature extraction,2) feature optimization,and 3) boundary prediction. First,for the feature extraction module,the 3D convolutional neural network(CNN) networks and a bi-directional long short-term memory(BiLSTM) layer is used to get the video and query features. To get fine-grained semantic cues from a language query,we extract attention mechanism-based phrase-level feature of the query. The video-text multi-modal features can be focused on multiple semantic phrases via fusing the phrase-level and video features. Subsequently,we highlight the motion information in the above multi-modal features in the feature optimization module. The features contain contextual clues of motion on temporal dimension. Meanwhile,some channels of the features represent the dynamic motion pattern of the target moment;the other channels represent irrelevant redundant information. To optimize multi-modal feature representation utilizing motion information,the skip-connection convolution and the motion excitation are used in the feature optimization module. 1) For the skip-connection convolution,a 1D temporal convolution network is used to model the local context of motion and align it with the query on the temporal dimension. 2) For the motion excitation,the temporal adjacent multimodal feature vectors-between differences is calculated,and the attention weight distribution of motion channel response is constructed,and the motion-sensitive channels are activated. Finally,we aggregate the multi-modal features focused on different semantic phrases. Non-local neural network is utilized to model the dependency among different semantic phrases. A temporal attentive pooling module is employed to aggregate the feature into a vector and a multilayer perceptron (MLP) layer to regress the temporal boundaries as well. Result Extensive experiments are carried out to verify the effectiveness of our proposed method on two public datasets,e. g.,the Charades-STA dataset and the ActivityNet Captions dataset. Comparative analysis can be reached to 52. 36% and 42. 97% in terms of the evaluation metric mean intersection over union (mIoU) on these two datasets. In addition,each of the evaluation metric R@1,IoU={0. 3,0. 5,0. 7} can reach to 73. 79%,61. 16%,52. 36% and 60. 54%,43. 68%,25. 43%. It is also compared with such two methods like local-global video-text interactions (LGI) and contextual pyramid network (CPNet). Experimental results show that our proposed method achieves significant improvement in performance compared to other methods. Conclusion To optimize the complicated scenes of the video and bridge the gap between the video and the language,we enhance the motion patterns in related to video grounding. Accordingly,the usage of the skip-connection convolution and the motion excitation can be used optimize video-text multi-modal feature representation effectively. In this way,the model can be used to represent semantic matching information between video clips and text queries accurately to a certain extent.
Keywords
video grounding motion excitation multi-modal feature representation proposal free computer vision video understanding
|