Current Issue Cover
多元软混合样本驱动的图文对齐人脸伪造检测方法

王诗雨, 冯才博, 刘春晓, 金逸胜(浙江工商大学)

摘 要
目的 随着人脸图像合成技术的快速发展,基于深度学习的人脸伪造技术对社会信息安全的负面影响日益增长。然而,由于不同伪造方法生成的样本之间的数据分布存在较大差异,现有人脸伪造方法准确性不高,泛化性较差。为了解决上述问题,提出一种多元软混合样本驱动的图文对齐人脸伪造检测新方法,充分利用图像与文本的多模态信息对齐,捕捉微弱的人脸伪造痕迹。方法 考虑到传统人脸伪造检测方法仅在单一模式的伪造图像上训练,难以应对复杂伪造模式,本文提出了一种多元软混合的数据增广方式(multivariate and soft blending augmentation,MSBA),促进网络同时捕捉多种伪造模式线索的能力,增强了网络模型对复杂和未知的伪造模式的检测能力。由于不同人脸伪造图像的伪造模式与伪造力度多种多样,导致网络模型真伪检测性能下降。本文基于MSBA方式设计了多元伪造力度估计模块(multivariate forgery intensity estimation,MFIE),有效针对不同模式和力度的人脸伪造图像进行学习,引导图像编码器提取更加具有泛化性的特征,提高了整体网络框架的检测准确性。结果 在域内实验中,与已有性能最好方法相比,本文方法的准确率(Accuracy,ACC)与area under the curve(AUC)指标分别提升3.32%和4.08%。在跨域实验中,本文方法与已有的6种典型方法在5个数据集上进行了性能测试与比较,平均AUC指标提高3.27%。消融实验结果表明本文提出的MSBA方式和MFIE模块对于人脸伪造检测性能的提升均有较好的表现。结论 本文面向人脸伪造检测任务设计的CLIP网络框架大大提高了人脸伪造检测的准确性,提出的MSBA方式和MFIE模块均起到了较好的助力效果,取得了超越已有方法的性能表现。
关键词
Multivariate and Soft Blending Samples Driven Image-Text Alignment for Deepfake Detection

WANG Shiyu, Feng Caibo, LIU CHUN Xiao, JIN Yisheng(Zhejiang Gongshang University)

Abstract
Objective With the rapid development of facial image synthesis technology, from simple image editing techniques to complex generative adversarial networks, people can easily create highly realistic fake facial images and videos, which increasingly have a negative impact on social information security. Due to the significant differences in data distribution among samples generated by various forgery methods, the accuracy of existing face forgery methods is not high, and their generalization capability is poor. To address this challenge, we propose a method called "Multivariate and Soft Blending Samples Driven Image-Text Alignment for Deepfake detection.", which fully utilize the multi-modal alignment of images and text to capture subtle traces of face forgery. Method Considering that traditional face forgery detection methods are only trained on single-mode forged images and struggle with complex forgery modes, we introduce the multivariate and soft blending augmentation (MSBA) method. By randomly mixing forged images of different forgery modes with various weights, we generate multivariate and soft blending images. Our network model learns to estimate the blending weights of each forgery mode and the forgery intensity map from these images, enhancing the ability to capture multiple forgery clues simultaneously, thereby further enhancing the detection capabilities for complex and unknown forgery patterns. Due to the diverse forgery modes and intensities present in the face forgery images, the performance of the network model in distinguishing true and false detections can decline. To address this issue, we designed a multivariate forgery intensity estimation (MFIE) module based on the MSBA method. This module effectively learns from face forgery images with varying modes and intensities, guiding the image encoder to extract more generalized features and improving the overall detection accuracy of our network framework. Our main contributions include: (1) Being the first to integrate the CLIP model into the face forgery detection task, we propose a multiple-and-soft-blending samples driven image-text alignment network framework for face forgery detection, leveraging the multi-modal information alignment of images and text to significantly enhance detection accuracy. (2) To enhance our network model"s ability to recognize various forgery patterns, we introduce a multivariate and soft blending augmentation (MSBA) method, which is utilized to synthesize multiple-and-soft-blending images encompassing complex forgery patterns. Building upon the MSBA approach, we have further developed a multivariate forgery intensity estimation (MFIE) module that guides our network model to deeply mine features related to forgery patterns and intensities within facial forgery images. The MSBA method and MFIE module, working in tandem, drive our backbone network to selectively extract targeted forgery cues from images that encompass a range of forgery patterns, thus enhancing the model"s generalization and robustness. (3) Our experimental results demonstrate highly competitive performance in both in-domain and cross-domain tests across datasets including FaceForensics++ (FF++), Celeb-DF, DeepFake Detection Challenge (DFDC), DeepFake Detection Challenge Preview (DFDCP), Deepfake Detection (DFD) and DeeperForensics-1.0 (DFV1). In our experimental process, we extracted 16 frames from each video for the training dataset and 32 frames for the testing dataset, with all images resized to 224x224 and normalized to the range [0,1] before network input. In our experimental setup, we initialized the network model using a pre-trained CLIP model with a 16x16 image patch size. For the training process, we employed the AdaN optimizer, setting the initial learning rate to 2E-5 and the batch size to 64. After 75 training epochs, we applied a cosine annealing strategy for 25 additional epochs, reducing the learning rate to 2E-7. Our method is implemented within the PyTorch framework and is trained using a single NVIDIA GeForce RTX 3090 GPU graphics card. Referring to previous work, we primarily employ the area under the ROC curve (AUC) and accuracy (ACC) metrics to evaluate the performance of the network. Results In in-domain experiments, our approach has achieved a notable enhancement in performance metrics when juxtaposed with the most proficient existing methodologies. Specifically, there is a marked improvement of 3.32% in ACC and a significant increase of 4.08% in the AUC. In cross-domain experiments, our method was tested and compared with six existing methods on the image level across five datasets, resulting in an average improvement of 3.27% in the AUC metric. Ablation study results indicate that the proposed MSBA method and the MFIE module both contribute positively to the enhancement of face forgery detection performance. Conclusion We have designed the CLIP network framework for the face forgery detection task, which significantly enhances the accuracy of detecting forged faces. Our proposed method of MSBA, coupled with the MFIE module, plays a crucial supportive role. These contributions have led to performance gains that surpass those of existing methods. Due to the use of large-scale language-vision models, the model parameters and computational complexity of our method are relatively high, which leads to certain limitations in response speed. In future work, we will consider reducing the computational overhead of the model while maintaining or even further improving the accuracy and robustness of face forgery detection.
Keywords

订阅号|日报