采用Transformer网络的视频序列表情识别
摘 要
目的 相比于静态人脸表情图像识别,视频序列中的各帧人脸表情强度差异较大,并且含有中性表情的帧数较多,然而现有模型无法为视频序列中每帧图像分配合适的权重。为了充分利用视频序列中的时空维度信息和不同帧图像对视频表情识别的作用力差异特点,本文提出一种基于Transformer的视频序列表情识别方法。方法 首先,将一个视频序列分成含有固定帧数的短视频片段,并采用深度残差网络对视频片段中的每帧图像学习出高层次的人脸表情特征,从而生成一个固定维度的视频片段空间特征。然后,通过设计合适的长短时记忆网络(long short-term memory network,LSTM)和Transformer模型分别从该视频片段空间特征序列中进一步学习出高层次的时间维度特征和注意力特征,并进行级联输入到全连接层,从而输出该视频片段的表情分类分数值。最后,将一个视频所有片段的表情分类分数值进行最大池化,实现该视频的最终表情分类任务。结果 在公开的BAUM-1s (Bahcesehir University multimodal)和RML (Ryerson Multimedia Lab)视频情感数据集上的试验结果表明,该方法分别取得了60.72%和75.44%的正确识别率,优于其他对比方法的性能。结论 该方法采用端到端的学习方式,能够有效提升视频序列表情识别性能。
关键词
Video sequence-based human facial expression recognition using Transformer networks
Chen Gang1,2, Zhang Shiqing1, Zhao Xiaoming1,2(1.Institute of Intelligent Information Processing, Taizhou University, Taizhou 318000, China;2.School of Mechanical Engineering and Automation, Zhejiang Sci-Tech University, Hangzhou 310018, China) Abstract
Objective Human facial expression is as one of the key information carriers that cannot be ignored in interpersonal communication. The development of facial expression recognition promotes the resilience of human-computer interaction. At present, to the issue of the performance of facial expression recognition has been improved in human-computer interaction systems like medical intelligence, interactive robots and deep focus monitoring. Facial expression recognition can be divided into two categories like static based image sand video sequence based dynamic images. In the current "short video" era, video has adopted more facial expression information than static images. Compared with static images, video sequences are composed of multi frames of static images, and facial expression intensity of each frame image is featured. Therefore, video sequence based facial expression recognition should be focused on the spatial information of each frame and temporal information in the video sequences, and the importance of each frame image for the whole video expression recognition. The early hand-crafted features are insufficient for generalization ability of the trained model, such as Gabor representations, local binary patterns (LBP). Current deep learning technology has developed a series of deep neural networks to extract facial expression features. The representative deep neural network mainly includes convolutional neural network (CNN), long short-term memory (LSTM). The importance of each frame in video sequence is necessary to be concerned for video expression recognition. In order to make full use of the spatio-temporal scaled information in video sequences and driving factors of multi-frame images on video expression recognition, an end-to-end CNN + LSTM + Transformer video sequence expression recognition method is proposed.Method First, a video sequence is divided into short video clips with a fixed number of frames, and the deep residual network is used to learn high-level facial expression features from each frame of the video clip. Next, the high-level temporal dimension features and attention features are learned further from the spatial feature sequence of the video clip via designing a suitable LSTM and Transformer model, and cascaded into the full connection layer to output the expression classification score of the video clip. Finally, the expression classification scores of all video clips are pooled to achieve the final expression classification task. Our method demonstrates the spatial and the temporal features both. We use transformer to extract the attention feature of fragment frame to improve the expression recognition rate of the model in terms of the difference of facial expression intensity in each frame of video sequence. In addition, the cross-entropy loss function is used to train the emotion recognition model in an end-to-end way, which aids the model to learn more effective facial expression features. Through CNN + LSTM + Transformer model training, the size of batch is set to 4, the learning rate is set to 5×10-5, and the maximum number of cycles is set to 80, respectively.Result The importance of frame attention features learned from the Transformer model is greater than that of temporal scaled features, the optimized accuracy of each is 60.72% and 75.44% on BAUM-1 s (Bahcesehir University multimodal) and RML (Ryerson Multimedia Lab) datasets via combining CNN + LSTM + Transformer model. It shows that there is a certain degree of complementarity among the three features learned by CNN, LSTM and Transformer. The combination of the three features can improve the performance of video expression recognition effectively. Furthermore, our averaged accuracy has its potentials on BAUM-1 s and RML datasets.Conclusion Our research develops an end-to-end video sequence-based expression method based on CNN + LSTM + Transformer. It integrates CNN, LSTM and Transformer models to learn the video features of high-level, spatial features, temporal features and video frame attention. The BAUM-1 s and RML-related experimental results illustrate that the proposed method can improve the performance of video sequence-based expression recognition model effectively.
Keywords
video sequence facial expression recognition spatial-temporal dimension deep residual network long short-term memory network (LSTM) end-to-end Transformer
|