标记分布与时空注意力感知的视频动作质量评估
张宇1,2, 徐天宇1,2, 米思娅3,4(1.东南大学计算机科学与工程学院, 南京 211189;2.东南大学软件学院, 南京 211189;3.东南大学网络空间安全学院, 南京 211189;4.紫金山实验室, 南京 211111) 摘 要
目的 视频动作质量评估旨在评估视频中特定动作的执行情况和完成质量。自动化的动作质量评估能够有效地减少人力资源的损耗,可以更加精准、公正地对视频内容进行评估。传统动作质量评估方法主要存在以下问题: 1)视频中动作主体的多尺度时空特征问题; 2)认知差异导致的标记内在模糊性问题; 3)多头自注意力机制的注意力头冗余问题。针对以上问题,提出了一种能够感知视频序列中不同时空位置、生成细粒度标记的动作质量评估模型SALDL (self-attention and label distribution learning)。方法 SALDL提出Attention-Inc (attention-inception)结构,该结构通过Embedding、多头自注意力以及多层感知机将自注意力机制渐进式融入Inception结构,使模型能够获得不同尺度卷积特征之间的上下文信息。提出一种正负时间注意力模块PNTA (pos-neg temporal attention),通过PNTA损失挖掘时间注意力特征,从而减少自注意力头冗余并提取不同片段的注意力特征。SALDL模型通过标记增强及标记分布学习生成细粒度的动作质量标记。结果 提出的SALDL模型在MTL-AQA (multitask learning-action quality assessment)和JIGSAWS (JHU-ISI gesture and skill assessment working set)等数据集上进行了大量对比及消融实验,斯皮尔曼等级相关系数分别为0.941 6和0.818 3。结论 SALDL模型通过充分挖掘不同尺度的时空特征解决了多尺度时空特征问题,并引入符合标记分布的先验知识进行标记增强,达到了解决标记的内在模糊性问题以及注意力头的冗余问题。
关键词
Label distribution learning and spatio-temporal attentional awareness for video action quality assessment
Zhang Yu1,2, Xu Tianyu1,2, Mi Siya3,4(1.School of Computer Science and Engineering, Southeast University, Nanjing 211189, China;2.School of Software Engineering, Southeast University, Nanjing 211189, China;3.School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China;4.Purple Mountain Laboratory, Nanjing 211111, China) Abstract
Objective Video action quality assessment aims to evaluate the execution and completion quality of specific actions in a video. Automated action quality assessment can effectively reduce losses in human resources and generate accurate and fair evaluations of video content. Meanwhile, traditional video action quality assessment task methods mainly suffer from three problems. First, most of these methods exhibit problems involving multi-scale spatial and temporal features. Specifically, the spatial and temporal location of the action in a video is critical for action quality assessment, and the sample video contains much information unrelated to the action. Thus, the current video action quality assessment methods encounter multi-scale spatial feature issues, in which different videos may have varying subject scale sizes in the spatial dimension, thus introducing challenges in capturing action information. In addition, action quality assessment confronts problems involving multi-scale temporal features, in which different durations and execution rates may exist in the temporal dimension and where the correlations between various time segments and labels are different. Second, the existing methods ignore problems related to the inherent ambiguity of labels caused by cognitive differences. These methods tend to focus on individual score labels and ignore the inherent ambiguity of score labels, the possibility of different judges providing different scores, and the subjectivity behind the given scores. For example, diving scores are presented by seven different judges and are not determined by a single label. Third, the current attention mechanisms faces redundancy in their selfattention heads. For instance, previous studies have employed many self-attention mechanism heads, but these heads exhibit redundancy during training. Moreover, removing the majority of these heads does not significantly affect the model performance. Experiments show that increasing the number of heads only worsens the effect of action quality assessment. To address these problems, this paper proposes self-attention and label distribution learning(SALDL), an action quality assessment model that focuses on different spatio-temporal locations in video sequences and generates fine-grained labels. Method This paper designs a new video action quality assessment model called SALDL that focuses on action information at different spatio-temporal locations in video sequences and generates fine-grained labels via the label distribution learning method, thus effectively addressing label ambiguity. SALDL comprises three main parts, namely, the video representation, pos-neg temporal attention(PNTA), and label distribution learning(LDL)modules. In the video representation module, SALDL employs an inflated 3D ConvNet(I3D)network structure with multi-receptive field convolution kernels to extract the spatial features within video clips. This model also proposes an Attention-Inc module that utilizes embedding, multi-head self-attention(MHSA), and multi-layer perceptron(MLP)to progressively incorporate the self-attentive mechanism into the Inception module, hence enabling the model to obtain contextual information between convolutional features at different scales. In the PNTA module, a temporal attention module with positive and negative attention heads is used to fully exploit temporal attention features through PNTA loss, thereby reducing the redundancy of self-attentive heads and extracting attention features from different time segments. In the LDL module, the SALDL model uses label distribution learning to generate fine-grained action quality labels, thereby resolving the inherent ambiguity of the tags. We also introduce a priori knowledge that the score label fits a certain distribution and then apply label enhancement methods to convert single labels into label distributions. The predicted label distribution is approximated via the Kullback-Leibler divergence loss function to the ground truth label distribution. Result Extensive comparison experiments are performed on the multitask learning-action quality assessment(MTL-AQA)and JHU-ISI gesture and skill assessment working set(JIGSAWS) datasets. The Spearman rank correlation coefficient(Sp. Corr)was 0. 941 6 in the MTL-AQA datasets 0. 836 4, 0. 866 0, and 0. 753 1, all of which achieved state-of-the-art results. Extensive ablation experiments were also performed for the PNTA, LDL, and Attention-Inc structures in the SALDL model. The experimental regression-based SALDL model, with the output dimension of the fully connected layer, changed to 1, and with the exclusion of the softmax function, SALDL directly generated a prediction score with an Sp. Corr of 0. 932 0. SALDL-w/o PNTA, which represents the SALDL model without using the PNTA module, obtained an Sp. Corr of 0. 938 4, while SALDL-w/o Attention-Ins, which represents the SALDL model without using the Attention-Inc structure, obtained an Sp. Corr of 0. 939 9. Experimental results highlight the enhancement of each module for SALDL. We also conducted ablation experiments on the selection of a segmentation strategy and distribution function. Results show that the selection of a segmentation strategy and distribution function should be dynamic and in accordance with the dataset type. Therefore, future research should determine the ideal distribution function, the fusion of different distribution functions, and other methods to achieve adaptive label enhancement. Conclusion The proposed SALDL model addresses problems that involve multi-scale spatio-temporal features by fully mining spatio-temporal features at different scales. This model also solves the intrinsic ambiguity of labels and the redundancy of self-attention heads by introducing a priori knowledge where labels conform to a certain distribution for enhancement and achieve label distribution learning. The proposed SALDL model achieves state-of-the-art performance on several action quality assessment datasets, hence fully validating its effectiveness.
Keywords
action quality assessment(AQA) Inception module self-attention mechanism label distribution learning Spearman rank correlation coefficient
|