CNN结合Transformer的深度伪造高效检测
李颖1,2, 边山1,2, 王春桃1,2, 卢伟3(1.华南农业大学数学与信息学院, 广州 510642;2.广州市智慧农业重点实验室, 广州 510642;3.中山大学计算机学院, 广州 510006) 摘 要
目的 深度伪造视频检测是目前计算机视觉领域的热点研究问题。卷积神经网络和Vision Transformer(ViT)都是深度伪造检测模型中的基础结构,二者虽各有优势,但都面临训练和测试阶段耗时较长、跨压缩场景精度显著下降问题。针对这两类模型各自的优缺点,以及不同域特征在检测场景下的适用性,提出了一种高效的CNN(convolutional neural network)结合Transformer的联合模型。方法 设计基于EfficientNet的空间域特征提取分支及频率域特征提取分支,以丰富单分支的特征表示。之后与Transformer的编码器结构、交叉注意力结构进行连接,对全局区域间特征相关性进行建模。针对跨压缩、跨库场景下深度伪造检测模型精度下降问题,设计注意力机制及嵌入方式,结合数据增广策略,提高模型在跨压缩率、跨库场景下的鲁棒性。结果 在FaceForensics++的4个数据集上与其他9种方法进行跨压缩率的精度比较,在交叉压缩率检测实验中,本文方法对Deepfake、Face2Face 和Neural Textures伪造图像的检测准确率分别达到90.35%、71.79%和80.71%,优于对比算法。在跨数据集的实验中,本文模型同样优于其他方法,并且同设备训练耗时大幅缩减。结论 本文提出的联合模型综合了卷积神经网络和Vision Transformer的优点,利用了不同域特征的检测特性及注意力机制和数据增强机制,改善了深度伪造检测在跨压缩、跨库检测时的效果,使模型更加准确且高效。
关键词
CNN and Transformer-coordinated deepfake detection
Li Ying1,2, Bian Shan1,2, Wang Chuntao1,2, Lu Wei3(1.College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China;2.Guangzhou Key Laboratory of Intelligent Agriculture, Guangzhou 510642, China;3.School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China) Abstract
Objective The research of deepfake detection methods has become one of the hot topics recently to counter deepfake videos. Its purpose is to identify fake videos synthesized by deep forgery technology on social networks, such as WeChat, Instagram and TikTok. Forged features are extracted on the basis of a convolutional neural network (CNN) and the final classification score is determined in terms of the features-forged classifier. When facing the deep forged video with low quality or high compression, these methods improve the detection performance by extracting deeper spatial domain information. However, the forged features left in the spatial domain decrease with the compression, and the local features tend to be similar, which degrades the performances severely. This also urges us to retain the frequency domain information of forged image artifacts as one of the clues of forensics, which contains less interference caused by JPEG compression. The CNN-based spatial domain feature extraction method can be conducted to capture facial artifacts via stacking convolution. But, its receptive field is limited, so it is better at modelling local information but ignores the relationship between global pixels. Transformer has its potentials at long-term dependency modelling in relevant to natural language processing and computer vision tasks, therefore it is usually employed to model the relationship between pixels of images and make up for the CNN-based deficiency in global information acquisition. However, the transformer can only process sequence information, making it still need the cooperation of convolutional neural network in computer vision tasks. Method First, we develop a novel joint detection model, which can leverage the advantages of CNN and transformer, and enriches the feature representation via frequency domain-related information. The EfficientNet-b0 is as the feature extractor. To optimize more forensics features, in the spatial feature extraction stage, the attention module is embedded in the shallow layer and the deep features are multiplied with the activation map obtained by the attention module. In the frequency domain feature extraction stage, to better learn the frequency domain features, we utilize the discrete cosine transform as the frequency domain transform means and an adaptive part is added to the frequency band decomposition. In the training process, to accelerate the memory-efficient training, we adopt the method of mixed precision training. Then, to construct the joint model, we link the feature extraction branches to a modified Transformer structure. The Transformer is used to model inter-region feature correlation using global self-attention feature encoding through an encoder structure. To further realize the information interaction between the dual-domain features, the cross attention is calculated between branches on the basis of the cross-attention structure. Furthermore, we design and implement a random data augmentation strategy, which is coordinated with the attention mechanism to improve the detection accuracy of the model in the scenarios of cross compression rate and cross dataset. Result Our joint model is compared to 9 state-of-the-art deepfake detection methods on two datasets called FaceForensics++(FF++) and Celeb-DF. In the experiments of cross compression-rate detection on the FF++ dataset, our detection accuracy can be reached to 90.35%, 71.79% and 80.71% for Deepfakes, Face2Face and Neural Textures(NT) manipulated images, respectively. In the cross-dataset experiments, i.e., training on FaceForensics++ and testing on Celeb-DF, our training time is reduced. Conclusion The experiments demonstrate that our joint model proposed can improve datasets-crossed and compression-rate acrossed detection accuracy. Our joint model takes advantage of the EfficientNet and the Transformer, and combines the characteristics of different domain features, attention, and data augmentation mechanism, making the model more accurate and efficient.
Keywords
deepfake detection convolutional neural network(CNN) Vision Transformer(ViT) spatial domain frequency domain
|