深度学习二维人体姿态估计方法综述

孔英会; 秦胤峰; 张珂

发布时间： 2023-07-19
摘要点击次数： 1474
全文下载次数： 1866
DOI: 10.11834/jig.220436
2023 | Volume 28 | Number 7

深度学习二维人体姿态估计方法综述

孔英会^1,2, 秦胤峰¹, 张珂^1,2(1.华北电力大学电子与通信工程系, 保定 071003;2.华北电力大学河北省电力物联网技术重点实验室, 保定 071003)

摘要

人体姿态估计是计算机视觉中的一项重要任务。传统的姿态估计方法存在难以实现复杂场景下分离目标和背景、易受人为设定先验信息影响、效率过低等问题。随着人工智能技术的发展，深度学习技术日趋成熟，基于深度学习的人体姿态估计方法的精确率和速度等性能均优于传统的人体姿态估计方法。近年来，作为三维人体姿态估计的基础，二维人体姿态估计模型在解决拥挤和遮挡方面取得了长足进步，但大多数网络模型采用的是层数过多的卷积神经网络（convolutional neural network，CNN）模型，对网络速度产生了很大影响。基于部署在边缘侧的实际应用需求，二维人体姿态估计网络的轻量化成为研究热点，且具有潜在的创新应用价值。根据基于深度学习的二维人体姿态估计模型的发展历程和优化趋势，可将其分为单人姿态估计、多人姿态估计以及轻量级人体姿态估计3类。本文对各类人体姿态估计采用的不同卷积神经网络模型进行总结，对各类神经网络模型的特点进行分析，对各类估计方法的性能进行比较。虽然深度卷积神经网络（deep convolutional neural network，DCNN）模型的结构设计越来越多元化，但是各类深度学习网络模型在处理人体姿态估计任务时，仍具有一定的局限性。本文对二维人体姿态估计模型采用的技术方法及其存在的问题进行深入讨论，并给出了未来可能的研究方向。

关键词

深度学习人体姿态估计模型结构模型优化轻量化

Deep learning based two-dimension human pose estimation：a critical analysis

Kong Yinghui^1,2, Qin Yinfeng¹, Zhang Ke^1,2(1.Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China;2.Hebei Key Laboratory of Power Internet of Things Technology, North China Electric Power University, Baoding 071003, China)

Abstract

Computer vision-oriented human pose estimation is focused on location of human skeleton in image or video，in which pose information can be used for pose estimation or a specific pose or action-objective location analysis in terms of the position relationship between the key areas of the human body. Nowadays，human pose estimation-oriented action recognition and pose tracking have been developing intensively. Conventional pose estimation methods can be segmented into two categories of object detection and pose estimation. The object detection analysis is based on segmentation，matching，or statistical learning，which is challenged for targets and backgrounds clarification in complex scenarios and it is still vulnerable for prior information. Additionally，it is time-consuming and labor-intensive to construct training sample libraries and classifiers. The pose estimation analysis is in relevance to model-based or non-model-based methods，which is challenged for object detection-derived error extension and much more artificial constraint information. Nevertheless，its efficiency is still to be optimized farther. The emerging artificial intelligence（AI）based deep learning technique has its potentials for the recognition precision and speed of the deep learning-based human pose estimation methods to a certain extent. Generally，human pose estimation can be divided into two-dimensional and three-dimensional human pose estimation. For threedimensional human pose estimation，two-dimensional human pose estimation model is beneficial for dealing with the crowding and occlusion situations. However，most network models are originated from convolutional neural network（CNN）models and it is challenged for depth-loaded network speed. Lightweight two-dimensional human pose estimation networks are concerned more for edge measurement deployment. We review the development process and optimization trend of the twodimensional human pose estimation model based on deep learning literately. They can be divided into three categories： single-person pose estimation，multi-person pose estimation，and lightweight human pose estimation. Single-person pose estimation is the basis of multi-person pose estimation，which can be divided into methods based on keypoints regression and heatmap detection，and there is a trend to combine these two methods to achieve single-person pose estimation. Overall，multi-person pose estimation network model can be divided into top-down，bottom-up，and others. The precision of the top-down network model is higher，but the time efficiency is not satisfactory，especially for the crowded problem-related input data. The number of human bodies is larger in the input data，the estimation time is much more longer of network model. The precision of bottom-up network model has shrunk in small range，but the efficiency is greatly improved. Moreover，time consumption of network model is used and the human pose-estimated is independent of the number of human bodies in the input data. These two methods are actually as a dual method. Initially，to locate the position of the human body in the input data，top-down pose estimation method is focused on the body detector，and then pose estimation is performed for each sample. Specifically，some top-down methods need to crop single-person body accurately and adjust it to the central position of the input data for each estimation. The bottom-up approach is oriented to get all body keypoints in the input data and these keypoints are assigned to the objects. At the same time，the appearance of single-stage network also means that researchers need to pay more attention to the computational cost of network model. A small number of networks have combined with top-down and bottom-up methods together，and it has achieved good results. We summarize multiple CNN models used in various human pose estimations，analyze the characteristics of various neural network models， and compare the performance of various pose estimation methods. It can be seen that the structural design of deep convolutional neural network models is becoming more and more diverse，but various deep learning network models still have certain limitations when dealing with human pose estimation tasks. The technical methods adopted by the two-dimensional human pose estimation models and its existing problems are discussed，and possible future research directions are predicted. Our recommendation is aware to improve existing two-dimensional pose estimation network model for the preprocessing of input data on such aspects mentioned below：the clarity of the input data directly affects the pose estimation results，and effective image or video pre-processing methods may become a new idea to improve the precision and efficiency of pose estimation. The existing pose estimation methods are mostly via video data-cut static video frames. In essence，it is still restricted by image data pose estimation. Current real-time pose estimation of video data is essential for the application of pose tracking and action recognition. Nowadays，a few methods have been proposed to combine deep learning based pose estimation method in related to time domain information，such as optical flow，pose flow and long short-term memory. The images involved in the actual application are still to be developed on the aspects of more crowded and more serious occlusion，so they are still to be resolved and optimized. Recent pose estimation network models are improved through lightweight methods. Lightweight methods have its potentials and it can be as one of the key directions for pose estimation.

Keywords

deep learning human pose estimation model structure model optimization lightweight

在线采编平台

论文出版

年度会议

下载中心

年度信息