头姿鲁棒的双一致性约束半监督表情识别
王宇建1, 何军2, 张建勋1, 孙仁浩2, 刘学亮3(1.重庆理工大学;2.数据空间研究院;3.合肥工业大学) 摘 要
【目的】现有表情识别方法聚焦提升模型的整体识别准确率,对方法的头部姿态鲁棒性研究不充分。在实际应用中,人的头部姿态往往变化多样,影响表情识别效果,因此研究头部姿态对表情识别的影响,并提升模型在该方面的鲁棒性显得尤为重要。为此本文深入分析头部姿态对表情识别的影响,提出一种能够基于无标签非正脸表情数据提升模型头部姿态鲁棒性的半监督表情识别方法。【方法】 首先按头部姿态对典型表情识别数据集AffectNet重新划分,构建了AffectNet-Yaw数据集,支持在不同角度上进行模型精度测试,提升了模型对比公平性。其次,提出一种基于双一致性约束的半监督表情识别方法(dual-consistency semi-supervised learning for facial expression recognition,DCSSL),利用空间一致性模块对翻转前后人脸图像的类别激活一致性进行空间约束,使模型训练时更关注面部表情关键区域特征;利用语义一致性模块通过非对称数据增强和自学式学习方法不断地筛选高质量非正脸数据用于模型优化。在无须对非正脸表情数据人工标注的情况下,方法直接从有标签正脸数据和无标签非正脸数据中学习。最后,联合优化了交叉熵损失、空间一致性约束损失和语义一致性约束损失函数,以确保有监督学习和半监督学习之间的平衡。【结果】 实验表明:头部姿态对自然场景表情识别有显著影响;提出AffectNet-Yaw具有更均衡的头部姿态分布,有效促进了对这种影响的全面评估;DCSSL方法结合空间一致性和语义一致性约束充分利用无标签非正脸表情数据,显著提高了模型在头部姿态变化下的鲁棒性,较MA-NET和EfficientFace等全监督方法平均表情识别精度分别提升了5.40%和17.01%。【结论】 本文提出的双一致性半监督方法能充分利用正脸和非正脸数据,显著提升了模型在头部姿态变化下的表情识别精度;新数据集有效支撑对头部姿态对表情识别影响的全面评估。
关键词
Semi-supervised facial expression recognition robust to head pose empowered by dual consistency constraints
WANG YUJIAN, HE JUN1, ZHANG JIANXUN2, SUN RENHAO1, LIU XUELIANG3(1.Institute of Dataspace;2.Chongqing University of Technology;3.Hefei University of Technology) Abstract
【Objective】 The field of facial expression recognition (FER) has long been a vibrant area of research, with a focus on improving the accuracy of identifying expressions across a wide range of faces. However, despite these advancements, a crucial aspect that has not been adequately explored is the robustness of FER models to changes in head pose. In real-world applications, where faces are captured under various angles and poses, existing methods often struggle to accurately recognize expressions in faces with significant pose variations. This limitation has created an urgent need to understand the extent to which head pose affects FER models and to develop robust models that can handle diverse poses effectively. In this study, we first delve deeper into the impact of head pose on FER. Through rigorous experimentation, we provide compelling evidence that existing FER approaches are indeed vulnerable when faced with faces exhibiting large head poses. This vulnerability not only limits the practical applicability of these methods but also highlights the critical need for research focused on enhancing the pose robustness of FER models. To address this challenge, a semi-supervised framework is proposed, leveraging unlabeled non-frontal facial expression samples to enhance the pose robustness of FER models. This framework aims to overcome the limitations of existing methods by exploring unlabeled data to supplement labeled frontal face data, allowing the model to learn representations that are invariant to head pose variations. Incorporating unlabeled data expands the model"s exposure to a wider range of poses, ultimately enhancing robustness and accuracy in FER. This study highlights the importance of pose robustness in FER and proposes a semi-supervised framework to address this critical limitation. Through rigorous experimentation and analysis, insights into the impact of head pose on FER are provided, and a robust model is developed that can accurately recognize facial expressions across diverse poses. This approach paves the way for more practical and reliable FER systems in real-world applications. 【Method】 Specifically, to examine the impact of head pose on FER, we propose to reorganize the AffectNet dataset using a deterministic resampling procedure. In this procedure, we uniformly and randomly sample the same number of faces from different expression categories and head pose intervals to build a new challenging FER dataset called AffectNet-Yaw, whose test samples are balanced both in the category axis and the head pose axis. The AffectNet-Yaw dataset enables a deep investigation into how head pose affects the performance of a FER model. To improve the robustness of the model to head poses, we present a semi-supervised framework for FER dubbed dual consistency constraints, also called DCSSL for short. This framework, on the one hand, leverages a spatial consistency module to force the model to produce consistent category activation maps for each face and its flipped mirror during training, which tailors the model to prioritizing the key facial regions for FER. On the other hand, it employs a semantic consistency module to force the model to extract consistent features of two augmentations of the same face that exhibit similar semantics. Particularly, we apply two different data augmentations to a face. One of the augmentations is weak, the other is strong. Then we flip the weakly augmented face and obtain model predictions of it and its flipped mirror. Only those unlabeled non-frontal faces for which the model makes the same prediction with high confidence are retained. Their predicted categories together with their strongly augmented variant comprise "date-label" pairs that are utilized for model training as pseudo-labeled positives. This increases data variation to benefit model optimization. Within the framework, we devise a joint optimization target that integrates the cross-entropy loss, the spatial consistency constraint, and the semantic consistency constraint to balance between the supervised learning and the semi-supervised learning. Thanks to the joint training, our proposed framework requires no manual labeling of non-frontal faces. Instead, it directly learns from labeled frontal faces and unlabeled non-frontal faces, highly boosting its robustness and generalization capacity. 【Result】 The evaluation results of various fully-supervised FER methods on both the AffectNet and AffectNet-Yaw datasets underscore the profound impact of head pose variability in real-world scenarios, emphasizing the critical need to enhance FER model robustness against such challenges. Empirical findings confirm that the AffectNet-Yaw dataset serves as a rigorous and effective platform for comprehensive investigations into how head pose influences FER model performance. Comparative analyses between baseline models and state-of-the-art (SOTA) methods on the AffectNet and AffectNet-Yaw datasets reveal compelling insights. Specifically, the novel dual consistency constraints (DCSSL) framework significantly enhances the model"s ability to adapt to head pose variations, showing substantial improvement over existing benchmarks. Using MA-NET and EfficientFace as benchmarks, the DC-SSL framework achieves significant average performance improvements of 5.40% and 17.01%, respectively. In addition, we illustrate the effectiveness of this approach by comparing the models that have performed well in the field of expression recognition in the last three years on two separate datasets. In terms of weighting parameter settings, different weighting choices have a significant impact on model performance. We conducted a series of parameter selection experiments using the control variable approach, and the model achieves optimal expression recognition performance on AffectNet-Yaw test data when the weight of spatial consistency constraint loss is set to 1 and that of semantic consistency constraint loss is set to 5.These results highlight the efficacy of our proposed DCSSL framework in mitigating the detrimental effects of head pose variations on FER accuracy. By integrating spatial and semantic consistency modules, this approach not only improves model robustness but also demonstrates its capability to adapt and generalize effectively across diverse head poses encountered in real-world applications. This study not only contributes a challenging new dataset, AffectNet-Yaw, for advancing FER research under realistic conditions but also establishes a novel methodology, DCSSL, that sets a new standard in addressing head pose challenges in FER. These advancements are pivotal for enhancing the reliability and applicability of FER systems in practical settings where head pose variability is prevalent. 【Conclusion】 The proposed DCSSL framework in this work can efficiently exploit both frontal and non-frontal faces, successfully boosting the accuracy of FER under diverse head poses. The new AffectNet-Yaw dataset has a more balanced data distribution, both along the category axis and the head pose axis, enabling a comprehensive study of the impact of head poses on FER. Both of these elements hold substantial value for building robust FER models.
Keywords
|