Current Issue Cover
融合时空域特征的人脸表情识别

陈拓, 邢帅, 杨文武, 金剑秋(浙江工商大学计算机与信息工程学院, 杭州 310018)

摘 要
目的 人脸表情识别是计算机视觉的核心问题之一。一方面,表情的产生对应着面部肌肉的一个连续动态变化过程,另一方面,该运动过程中的表情峰值帧通常包含了能够识别该表情的完整信息。大部分已有的人脸表情识别算法要么基于表情视频序列,要么基于单幅表情峰值图像。为此,提出了一种融合时域和空域特征的深度神经网络来分析和理解视频序列中的表情信息,以提升表情识别的性能。方法 该网络包含两个特征提取模块,分别用于学习单幅表情峰值图像中的表情静态“空域特征”和视频序列中的表情动态“时域特征”。首先,提出了一种基于三元组的深度度量融合技术,通过在三元组损失函数中采用不同的阈值,从单幅表情峰值图像中学习得到多个不同的表情特征表示,并将它们组合在一起形成一个鲁棒的且更具辩识能力的表情“空域特征”;其次,为了有效利用人脸关键组件的先验知识,准确提取人脸表情在时域上的运动特征,提出了基于人脸关键点轨迹的卷积神经网络,通过分析视频序列中的面部关键点轨迹,学习得到表情的动态“时域特征”;最后,提出了一种微调融合策略,取得了最优的时域特征和空域特征融合效果。结果 该方法在3个基于视频序列的常用人脸表情数据集CK+(the extended Cohn-Kanade dataset)、MMI (the MMI facial expression database)和Oulu-CASIA (the Oulu-CASIA NIR&VIS facial expression database)上的识别准确率分别为98.46%、82.96%和87.12%,接近或超越了当前同类方法中的表情识别最高性能。结论 提出的融合时空特征的人脸表情识别网络鲁棒地分析和理解了视频序列中的面部表情空域和时域信息,有效提升了人脸表情的识别性能。
关键词
Spatio-temporal features based human facial expression recognition

Chen Tuo, Xing Shuai, Yang Wenwu, Jin Jianqiu(School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China)

Abstract
Objective Human facial expression recognition (FER) is one of the key issues of computer vision analysis like human-computer interaction, medical care and intelligent driving. FER research has mainly two challenges in related to expression feature extraction and classification recognition. Current methods are mainly design facial expression features artificially, while deep learning based methods can independently learn to obtain semantic facial expression features. The deep learning based FER technology can integrate the two training processes of feature extraction and facial expression recognition. It has strong generalization ability and good recognition accuracy currently. Most of the existing FER algorithms are based on expression video sequences or a single peak expression scenario. However, the generation of expression corresponds to a continuous dynamic change process of facial muscles, and the motion-based expression peak frame identifies completed expression information in common. Our method demonstrates a spatio-temporal and features based deep neural network to analyze and understand video sequences derived expression information to improve expression recognition ability. Method Our network learn the static "spatial feature" of the expression and its dynamic "temporal feature" based on the video sequence, respectively. First, we illustrate a deep metric fusion network based on triplet loss learning. Our network is composed of two sub-modules like deep convolutional neural network (DCNN) module and N-metric module. The DCNN module is derived from a general convolutional neural network (CNN) to extract common detailed CNN facial features. In this module, the Visual Geometry Group 16-layer net (VGG16)-face network model structure is adopted where the output of its final 4 096-dimensional fully connected layer is used as the benched CNN feature. The N-metric module contains fully multiple connected layer branches. Each branch uses a triplet loss function to implement the supervised learning to represent different expression semantic multi-features. These dual-features representations are fused through two fully connected layers further. A more robust and spatial feature expression is illustrated. The each two fully connected layers have 256 hidden units and the output of each branch is merged together in a concatenating manner in the DCNN module. In the N-metric module, all of fully connected layer branches are shared to the same CNN feature. For example, the output of the final fully connected layer is used as the input of each branch in the DCNN module. In addition, a fixed dimension fully connected layer is used via each branch and it is associated with a certain threshold sampling for learning the corresponding feature embedding. Each branch is supervised and learned by the corresponding triple loss function. Next, facial expressions are essential to facial expression changes in motion because the changes are integrated to the overall facial expression changing. Existing methods are challenged to extract the dynamic expression features in the context of consecutive frames derived time domain through manual design or deep learning methods. But, manual-designed features are constrained of facial image sequence based temporal features extraction. The image sequence related deep neural network is insufficient to employ the prior knowledge of the key features of the face as well due to the non-learning temporal featured expressions. Our landmark-trajectory convolutional neural network analyzes the trajectory in the video sequence and learns the dynamic "temporal features" of the expression sequence consequently, which extracts the accurate motion characteristics of facial expressions in the time domain. Our network consists of four convolutional layers and two fully connected layers. The input of the landmark trajectory CNN (LTCNN) sub-network is a similar feature map constructed based on the trajectory of facial expression in the video. Third, a fine-tuning based fusion strategy is conducted to combine the learned features of two network modules obtained further, which achieves the temporal and spatial features based fusion result optimally. We train the deep metric fusion (DMF) and LTCNN sub-networks each, combine the two sub-networks through feature fusion, and fine-tuning them in an end-to-end manner sequentially. The implemented hyper-parameters are used for fine-tuning training in DMF sub-network optimization. Result Our demonstrated FEC algorithm is tested and verified on three public facial expression databases in terms of the extended Cohn-Kanade dataset (CK+), the MMI facial expression database (MMI), and the Oulu-CASIA NIR&VIS facial expression database (Oulu-CASIA). Our method achieves the recognition accuracy of 98.46%, 82.96%, and 87.12% on the databases of CK+, MMI, and Oulu-CASIA, respectively. Conclusion For our deep learning based network integrated temporal and the spatial features both to realize video sequences based FER. In the network, our two sub-modules are used to learn the "spatial features" of the facial expression at the peak frame and the "temporal features" of facial expression motion. Finally, a fusion strategy is carried out to achieve better fusion effect of temporal and spatial features based on overall fine-tuning. Our FER method has its potentials to develop further.
Keywords

订阅号|日报