骨骼信息的人体行为识别综述
摘 要
基于骨骼信息的人体行为识别旨在从输入的包含一个或多个行为的骨骼序列中,正确地分析出行为的种类,是计算机视觉领域的研究热点之一。与基于图像的人体行为识别方法相比,基于骨骼信息的人体行为识别方法不受背景、人体外观等干扰因素的影响,具有更高的准确性、鲁棒性和计算效率。针对基于骨骼信息的人体行为识别方法的重要性和前沿性,对其进行全面和系统的总结分析具有十分重要的意义。本文首先回顾了9个广泛应用的骨骼行为识别数据集,按照数据收集视角的差异将它们分为单视角数据集和多视角数据集,并着重探讨了不同数据集的特点和用法。其次,根据算法所使用的基础网络,将基于骨骼信息的行为识别方法分为基于手工制作特征的方法、基于循环神经网络的方法、基于卷积神经网络的方法、基于图卷积网络的方法以及基于Transformer的方法,重点阐述分析了这些方法的原理及优缺点。其中,图卷积方法因其强大的空间关系捕捉能力而成为目前应用最为广泛的方法。采用了全新的归纳方法,对图卷积方法进行了全面综述,旨在为研究人员提供更多的思路和方法。最后,从8个方面总结现有方法存在的问题,并针对性地提出工作展望。
关键词
A review of skeleton-based human action recognition
Lu Jian, Li Xuanfeng, Zhao Bo, Zhou Jian(School of Electronics and Information, Xi'an Polytechnic University, Xi'an 710600, China) Abstract
Skeleton-based human action recognition aims to correctly analyze the classes of actions from skeleton sequences, which contain one or more actions. Skeleton-based human action recognition has recently emerged as a hot research topic in the field of computer vision. Due to the fact that actions can be used to handle tasks and express human emotions, action recognition can be widely applied in various fields, such as intelligent monitoring systems, humancomputer interaction, virtual reality, and smart healthcare. Compared with RGB-based human action recognition, skeleton-based human action recognition methods are less affected by interference factors, such as background and human appearance, and have higher accuracy and robustness. In addition, these methods require a small amount of data and show a high computational efficiency, thereby increasing their prospects in practical applications. In this case, comprehensively and systematically summarizing and analyzing skeleton-based human action recognition methods become critical. Compared with other reviews on skeleton-based action recognition, our contributions are as follows:we provide a more comprehensive summary of skeleton-based action datasets;we provide a more comprehensive summary of skeleton-based action recognition methods, including the latest Transformer technology;we offer a more instructive classification of graph convolutional methods;and we not only summarize the existing problems but also forecast the prospects for future research. First, we introduce nine datasets that are commonly used for skeleton-based action recognition, including the MSR Action3D, MSR Daily Activity 3D, 3D Action Pairs, SYSU 3DHOI, UTD-MHAD, Northwestern-UCLA, NTU RGB+D 60, Skeleton-Kinetics, and NTU RGB +D 120 datasets. In order to highlight the characteristics of these datasets prominently, we divide them into single-view and multi-view datasets from the data collection perspective and then explore the traits and uses of each category. Second, based on the backbone network used by the models, we categorize the skeletonbased action recognition methods into those based on handcrafted features, based on recurrent neural network(RNN), based on convolutional neural network(CNN), based on graph convolutional network(GCN), and based on Transformer. Before the rise of deep learning methods, traditional algorithms(handcrafted features)were often used to model human skeleton data. The key problem in using such methods is how to create an effective feature representation of human skeleton sequences. However, after the rise of deep learning methods, which demonstrate excellent performance in various fields, such as face recognition, image classification, and image super-resolution, researchers have begun using deep learning networks to model skeleton data. Among them, RNN effectively processes data in the form of continuous time series and is adept at learning temporal dependencies information in sequence data, while CNN can effectively learn high-level semantic information of skeleton data. Training a CNN-based model requires lower computational costs than RNN. Unlike RNNbased methods, before using CNN, the skeleton data should be reshaped into pseudo-images. The columns of the pseudoimage represent the features of all joints in one frame, while the rows represent the features of a certain joint across all frames. However, when RNN or CNN methods are used to model skeleton data, the topological structure of the human skeleton is ignored. Transforming the skeleton data into sequence vectors of joint coordinates or a 2D grid cannot accurately describe the dynamic skeleton of the human body. Previous studies show that graph convolution has a powerful ability to model topological graph structures, making this method particularly suitable for modeling the human skeleton. Given their successful application, graph convolutional methods have been widely used in skeleton-based action recognition. This paper specifically adopts a novel inductive approach and provides a comprehensive review of GCN-based methods. These GCN-based methods are further classified according to the problems targeted in the literature with an aim to provide researchers with additional ideas and methods. These studies can be divided into optimization of the graph structure, network lightweighting, optimization of temporal and spatial features, and optimization of missing and noisy joints. This paper also provides a comprehensive summary of the issues faced by the currently available methods. This paper not only points out the limitations and challenges faced by these methods but also evaluates the future development trend and provides insightful prospects for the field. By doing so, this review not only helps readers gain a deep understanding of the current state of this task but also provides valuable guidance for future research in this area.
Keywords
|