结合姿态估计和时序分段网络分析的羽毛球视频动作识别
陶树1, 王美丽1,2,3(1.西北农林科技大学信息工程学院, 杨凌 712100;2.农业农村部农业物联网重点实验室, 杨凌 712100;3.陕西省农业信息与智能服务重点实验室, 杨凌 712100) 摘 要
目的 为了满足羽毛球教练针对球员单打视频中的动作进行辅助分析,以及用户欣赏每种击球动作的视频集锦等多元化需求,提出一种在提取的羽毛球视频片段中对控球球员动作进行时域定位和分类的方法。方法 在羽毛球视频片段上基于姿态估计方法检测球员执拍手臂,并根据手臂的挥动幅度变化特点定位击球动作时域,根据定位结果生成元视频。将通道—空间注意力机制引入时序分段网络,并通过网络训练实现对羽毛球动作的分类,分类结果包括正手击球、反手击球、头顶击球和挑球4种常见类型,同时基于图像形态学处理方法将头顶击球判别为高远球或杀球。结果 实验结果表明,本文对羽毛球视频片段中动作时域定位的交并比(intersection over union,IoU)值为82.6%,对羽毛球每种动作类别预测的AUC (area under curve)值均在0.98以上,平均召回率与平均查准率分别为91.2%和91.6%,能够有效针对羽毛球视频片段中的击球动作进行定位与分类,较好地实现对羽毛球动作的识别。结论 本文提出的基于羽毛球视频片段的动作识别方法,兼顾了羽毛球动作时域定位和动作分类,使羽毛球动作识别过程更为智能,对体育视频分析提供了重要的应用价值。
关键词
Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis
Tao Shu1, Wang Meili1,2,3(1.College of Information Engineering, Northwest A&F University, Yangling 712100, China;2.Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture and Rural Affairs, Yangling 712100, China;3.Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling 712100, China) Abstract
Objective Video-based intelligent action recognition has been developing for computer vision analysis nowadays. It is required to recognize action in a specific scene of video due to such multiple video types. To appreciate sports leisure for users like the meta-video set of various badminton stroke,it can assist coaches to analyze stroke better if badminton strokes can be accurately located and recognized in a badminton video. Sports video analysis like the approach of the badminton stroke recognition can be transferred to tennis and table tennis via similar sports features. For a long time span of video based action recognition method, it is necessary to locate the action time domain. Badminton-oriented video can be as this kind of videos to locate stroke time domains. For the time domain localization of video actions, current research is focused on a clear action switching boundary between adjacent actions in a video, and the foreground or background features of adjacent actions are quite different, such as the action video dataset 50Salads and dataset Breakfast. However, there is no obvious boundary information between foreground and background of adjacent strokes in a badminton video. Therefore, the action recognition based long time span video is not suitable for the localization of badminton strokes. In addition, most existing researches on badminton stroke recognition are based on a static image of a stroke derived from a badminton video, and the stroke recognition of badminton-relevant meta-video is lacking. Our method is focused on an approach for locating and classifying the strokes of ball-control player in an extracted badminton video highlight. Method First, the pose estimation model regional multi-person pose estimation(RMPE) is used to detect human poses in a badminton video highlight. The pose of the targeted player is located via adding prediction scores and position constraints to shield other irrelevant factors of human bones. For the detected pose of targeted player, the node constraints are added to locate arms of the player. The holding arm and the non-holding arm are distinguished according to the difference of the swinging amplitude, and the time domain localization of badminton stroke is carried out by the swinging amplitude variation of the holding arm for extracting the meta-video of badminton stroke. The swing amplitude of the player's arm in a frame is defined as the linear weighted sum of the square of the upper and lower limbs swing vector modulus. Then, the dataset of badminton meta-videos is applied to train convolutional block attention module-temporal segment networks (CBAM-TSN) for predicting badminton strokes in meta-videos, which add convolutional block attention module in temporal segment networks. It is necessary to extract two-stream of meta-videos from dataset beforehand through training CBAM-TSN because temporal segment network (TSN) inherited the structure of two-stream convolutional neural network(CNN). The two-stream is composed of spatial stream (RGB frames) and temporal stream (optical frames). The predicted stroke from the model of CBAM-TSN contains four familiar types:forehand, backhand, overhead and lob. Finally, we classify the overhead scenario into clear or smash by morphology processing, the clear-oriented meta-videos tend to continuous dynamic mask in the background area at the end of the stroke, but the smash-oriented meta-videos have no continuous dynamic mask information in the background area. Our badminton mask in a meta-video is captured based on the result of images morphological processing. The strokes of clear and smash can be distinguished based on position-relevant features of the badminton mask. Result In a highlighted badminton video, it shows that the segmentation is correct if a meta video segmented by the method of strokes localization and a meta video extracted manually both contain the same badminton stroke. Our indicator of intersection over union (IoU) is used to evaluate the performance of strokes localization. Furthermore, the performance of badminton strokes classification is evaluated via using machine learning based indicator ROC-AUC, recall and precision. The experiment results show that our IoU of stroke localization in badminton video highlights is reached to 82.6%. The indicator AUC about four kinds of badminton strokes (forehand, backhand, overhead and lob) predicted by the model of CBAM-TSN is all over 0.98, the micro-AUC, macro-AUC, average recall and precision is reached to 0.990 8, 0.990 3, 93.5% and 94.3%,respectively. In addition, the CBAM-TSN is compared to the three popular approaches of action recognition in the context of badminton strokes recognition, gets the highest result on precision, micro-AUC and macro-AUC. The final average recall and precision is reached to 91.2% and 91.6% of each. Therefore, it can effectively locate and classify major player's strokes in a badminton video highlight. Conclusion We facilitate a novel badminton strokes recognizing method in badminton video highlights, which is in combination with badminton stroke localization and badminton stroke classification. The potential sports video analysis is developed further.
Keywords
pose estimation meta video badminton stroke localization convolutional block attention module-temporal segment network (CBAM-TSN) morphological processing badminton stroke recognition
|