多级特征全局一致性的伪造人脸检测
摘 要
目的 随着深度伪造技术的快速发展,人脸伪造图像越来越难以鉴别,对人们的日常生活和社会稳定造成了潜在的安全威胁。尽管当前很多方法在域内测试中取得了令人满意的性能表现,但在检测未知伪造类型时效果不佳。鉴于伪造人脸图像的伪造区域和非伪造区域具有不一致的源域特征,提出一种基于多级特征全局一致性的人脸深度伪造检测方法。方法 使用人脸结构破除模块加强模型对局部细节和轻微异常信息的关注。采用多级特征融合模块使主干网络不同层级的特征进行交互学习,充分挖掘每个层级特征蕴含的伪造信息。使用全局一致性模块引导模型更好地提取伪造区域的特征表示,最终实现对人脸图像的精确分类。结果 在两个数据集上进行实验。在域内实验中,本文方法的各项指标均优于目前先进的检测方法,在高质量和低质量FaceForensics++数据集上,AUC(area under the curve)分别达到99.02%和90.06%。在泛化实验中,本文的多项评价指标相比目前主流的伪造检测方法均占优。此外,消融实验进一步验证了模型的每个模块的有效性。结论 本文方法可以较准确地对深度伪造人脸进行检测,具有优越的泛化性能,能够作为应对当前人脸伪造威胁的一种有效检测手段。
关键词
Multi-level features global consistency for human facial deepfake detection
Yang Shaocong, Wang Jian, Sun Yunlian, Tang Jinhui(School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China) Abstract
Objective Human facial images interpretation is based on personal identity information like communication,access control and payment.However,advanced deep forgery technology causes faked facial information intensively.It is challenged to distinguish faked information from real ones.Most of the existing deep learning methods have weak generalization ability to unseen forgeries.Our method is focused on detecting the consistency of source features.The source features of deepfakes are inconsistent while source features are consistent in real images.Method First,a destructive module of facial structure is designed to reshuffle image patches.It allows our model to local details and abnormal regions.It restricts overfitting to face structure semantics,which are irrelevant to our deepfake detection task.Next,we extract the shallow,medium and deep features from the backbone network.We develop a multi-level feature fusion module to guide the fusion of features at different levels.Specifically,shallower leveled features can provide more detailed forged clues to the deeper level,while deeper features can suppress some irrelevant details in the features at the shallower level and extend the regional details of the abnormal region.The network can pay attention to the real or fake semantics better at all levels.In the backbone network,the shallow,medium and deep features are obtained via a channel attention module,and then merge them into a guided dual feature fusion module.It is accomplished based on the guided dual fusion of shallow features to deep features and the guided fusion of deep features to shallow features.The feature maps output are added together by the two fusion modules.In this way,we can mine forgery-related information better in each layer.Third,we extract a global feature vector from the fused features.To obtain a consistency map,we calculate the similarity between the global feature vector and each local feature vector (i.e.,the feature vector at each local position).The inconsistent areas are highlighted in this map.We multiply the output of the multi-level feature fusion module by the consistency map.The obtained result is combined with the output of the backbone network to the classifier for the final binary classification.We use forged area labels to learn better and label the forged area of each face image in sequential:1) to align the forged face image for its corresponding real face image and calculate the difference between their corresponding pixel values;2) to generate a difference image of those spatial size are kept equivalent between real and fake face images;3) to convert the difference image to a grayscale image through converting each pixel value to[0,1]linearly,the difference image is binarized with a threshold of 0.1,resulting in the final forged area label.Our main contributions are shown as below:1) to capture the inconsistent source features in forged images,a global consistency module is developed;2) to make the network pay more attention to forged information and suppress irrelevant background details at the same time,a multi-level feature guided fusion module is facilitated;3) to prevent our model from overfitting to face structure semantics,a destructive module of facial structure is designed to distinguish fake faces from real ones.Our method achieves good performance for the intra-dataset test and the cross-dataset test both.The test results show that we achieve highly competitive detection accuracy and generalization performance.During the experiment,we take 30 frames with equal spacing for each video in the training set,100 frames for each video in the test set.For each image,we choose the largest detected sample and convert its size to 320×320 pixels.For experimental settings,we use Adam optimization with a learning rate of 0.000 2.The batch size is 16.Result Our method is compared to 8 latest methods on two datasets (including five forgery methods).In the FaceForensics++(FF++) dataset,we obtain the best performance.On the FF++(low-quality) dataset,the area under the curve (AUC) value is increased by 1.6% in comparison with the second best result.For the generalization experiment of four forgery methods in FF++,we achieve better result beyond the baseline.For the cross-dataset generalization experiment (trained on FF++dataset and tested on Celeb-DF dataset),we achieve the best AUC on both datasets.In addition,the ablation experiments are carried out on FF++dataset to verify the effectiveness of each module as well.Conclusion This method can detect deepfakes accurately and has good generalization performance.It has potential to deal with the threat of deep forgery.
Keywords
face forgery detection deep fakes multi-level feature learning global consistency attention mechanism
|