融合显著性图像语义特征的人体相似动作识别
摘 要
目的 基于骨骼的动作识别技术由于在光照变化、动态视角和复杂背景等情况下具有更强的鲁棒性而成为研究热点。利用骨骼/关节数据识别人体相似动作时,因动作间关节特征差异小,且缺少其他图像语义信息,易导致识别混乱。针对该问题,提出一种基于显著性图像特征强化的中心连接图卷积网络(saliency image featureenhancement based center-connected graph convolutional network,SIFE-CGCN)模型。方法 首先,设计一种骨架中心连接拓扑结构,建立所有关节点到骨架中心的连接,以捕获相似动作中关节运动的细微差异;其次,利用高斯混合背景建模算法将每一帧图像与实时更新的背景模型对比,分割出动态图像区域并消除背景干扰作为显著性图像,通过预训练的 VGG-Net (Visual Geometry Group network)提取特征图,并进行动作语义特征匹配分类;最后,设计一种融合算法利用分类结果对中心连接图卷积网络的识别结果强化修正,提高对相似动作的识别能力。此外,提出了一种基于骨架的动作相似度的计算方法,并建立一个相似动作数据集。结果 实验在相似动作数据集与 NTU RGB+D 60/120(Nanyang Technological University RGB+D 60/120)数据集上与其他方法进行比较。在相似动作数据集中,相比于次优模型识别准确率在跨参与者识别(X-Sub)和跨视角识别(X-View)基准分别提高 4.6% 和 6.0%;在 NTU RGB+D60 数据集中,相比于次优模型识别准确率在 X-Sub 和 X-View 基准分别提高 1.4% 和 0.6%;在 NTU RGB+D 120 数据集中,相比于次优模型识别准确率在 X-Sub 和跨设置识别(X-Set)基准分别提高 1.7% 和 1.1%。此外,进行多种对比实验,验证了中心连接图卷积网络、显著性图像提取方法以及融合算法的有效性。结论 提出的方法可以实现对相似动作的准确有效识别分类,且模型的整体识别性能及鲁棒性也得以提升。
关键词
Human similar action recognition by fusing saliency image semantic features
Bai Zhongyu, Ding Qichuan, Xu Hongli, Wu Chengdong(Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110819, China) Abstract
Objective Human action recognition is a valuable research area in computer vision.It has a wide range of applications, such as security monitoring, intelligent monitoring, human-computer interaction, and virtual reality.The skeleton-based action recognition method first extracts the specific position coordinates of the major body joints from the video or image by using a hardware method or a software method.Then, the skeleton information is used for action recognition.In recent years, skeleton-based action recognition has received increasing attention because of its robustness in dynamic environments, complex backgrounds, and occlusion situations.Early action recognition methods usually use hand-crafted features for action recognition modeling.However, the hand-crafted feature methods have poor generalization because of the lack of diversity in the extracted features.Deep learning has become the mainstream action recognition method because of its powerful automatic feature extraction capabilities.Traditional deep learning methods use constructed skeleton data as joint coordinate vectors or pseudo-images, which are directly input into recurrent neural networks(RNNs) or convolutional neural networks(CNNs)for action classification.However, the RNN-based or CNN-based methods lose the spatial structure information of skeleton data because of the limitation set by the European data structure.Moreover, these methods cannot extract the natural correlation of human joints.Thus, distinguishing subtle differences between similar actions becomes difficult.Human joints are naturally structured as graph structures in non-Euclidean space.Several works have successfully adopted graph convolutional networks(GCNs)to achieve state-of-the-art performance for skeletonbased action recognition.In these methods, the subtle differences between the joints are not explicitly learned.These subtle differences are crucial to recognizing similar actions.Moreover, the skeleton data extracted from the video shield the object information that interacts with humans and only retain the primary joint coordinates.The lack of image semantics and the reliance only on joint sequences remarkably challenge the recognition of similar actions.Method Given the above factors, the saliency image feature enhancement based center-connected graph convolutional network(SIFE-CGCN)is proposed in this work for skeleton-based similar action recognition.The proposed model is based on GCN, which can fully utilize the spatial and temporal dependence information between human joints.First, the CGCN is proposed for skeletonbased similar action recognition.For the spatial dimension, a center-connection skeleton topology is designed to establish connections between all human joints and the skeleton center to capture the small difference in joint movements in similar actions.For the temporal dimension, each frame is associated with the previous and subsequent frames in the sequence.Therefore, the number of adjacent nodes in the frame is fixed at 2.The regular 1D convolution is used on the temporal dimension as the temporal graph convolution.A basic co-occurrence graph convolution unit includes a spatial graph convolution, a temporal graph convolution, and a dropout layer.For training stability, the residual connection is added for each unit.The proposed network is formed by stacking nine graph convolution basic units.The batch normalization(BN)layer is added before the beginning of the network to standardize the input data, and a global average pooling layer is added at the end to unify the feature dimensions.The dual-stream architecture is used for utilizing the joint and bone information of the skeleton data simultaneously to extract data features from multiple angles.Given the different roles of each joint in different actions, the attention map is added to focus on the main motion joints in action.Second, the saliency image in the video is selected using the Gaussian mixture background modeling method.Each image frame is compared with the real-time updated background model to segment the image area with considerable changes, and the background interference is eliminated.The effective extraction of semantic feature maps from saliency images is the key to distinguishing similar actions.The Visual Geometry Group network (VGG-Net) can effectively extract the spatial structure features of objects from images.In this work, the feature map is extracted through pre-trained VGG-Net, and the fully connected layer is used for feature matching.Finally, the feature map matching result is used to strengthen and revise the recognition result of CGCN and improve the recognition ability for similar actions.In addition, the similarity calculation method for skeleton sequences is proposed, and a similar action dataset is established in this work.Result The proposed model is compared with the stateof-the-art models on the proposed similar action dataset and Nanyang Technological University RGB+D(NTU RGB+D) 60/120 dataset.The methods for comparison include CNN-based, RNN-based, and GCN-based models.On the crosssubject(X-Sub)and cross-view(X-View)benchmarks in the proposed similar action dataset, the recognition accuracy of the proposed model can reach 80.3% and 92.1%, which are 4.6% and 6.0% higher than the recognition accuracies of the suboptimal algorithm, respectively.The recognition accuracy of the proposed model on the X-Sub and X-View benchmarks in the NTU RGB+D 60 dataset can reach 91.7% and 96.9%.Compared with the suboptimal algorithm, the proposed model improves by 1.4% and 0.6%.Compared with the suboptimal model feedback graph convolutional network (FGCN), the proposed model improves the recognition accuracy by 1.7% and 1.1% on the X-Sub and cross-setup(X-Set) benchmarks in the NTU RGB+D 120 dataset, respectively.In addition, we conduct a series of comparative experiments to show clearly the effectiveness of the proposed CGCN, the saliency image extraction method, and the fusion algorithm.Conclusion In this study, we propose a SIFE-CGCN to solve the recognition confusion when recognizing similar actions due to the ambiguity between the skeleton feature and the lack of image semantic information.The experimental results show that the proposed method can effectively recognize similar actions, and the overall recognition performance and robustness of the model are improved.
Keywords
action recognition skeleton sequence similar action graph convolutional network(GCN) image salient features
|