Current Issue Cover
多关键帧特征交互的人脸篡改视频检测

祝恺蔓1, 徐文博1, 卢伟1, 赵险峰2,3(1.中山大学计算机学院, 广州 510006;2.中国科学院信息工程研究所信息安全国家重点实验室, 北京 100195;3.中国科学院大学网络空间安全学院, 北京 100195)

摘 要
目的 深度伪造是新兴的一种使用深度学习手段对图像和视频进行篡改的技术,其中针对人脸视频进行的篡改对社会和个人有着巨大的威胁。目前,利用时序或多帧信息的检测方法仍处于初级研究阶段,同时现有工作往往忽视了从视频中提取帧的方式对检测的意义和效率的问题。针对人脸交换篡改视频提出了一个在多个关键帧中进行帧上特征提取与帧间交互的高效检测框架。方法 从视频流直接提取一定数量的关键帧,避免了帧间解码的过程;使用卷积神经网络将样本中单帧人脸图像映射到统一的特征空间;利用多层基于自注意力机制的编码单元与线性和非线性的变换,使得每帧特征能够聚合其他帧的信息进行学习与更新,并提取篡改帧图像在特征空间中的异常信息;使用额外的指示器聚合全局信息,作出最终的检测判决。结果 所提框架在FaceForensics++的3个人脸交换数据集上的检测准确率均达到96.79%以上;在Celeb-DF数据集的识别准确率达到了99.61%。在检测耗时上的对比实验也证实了使用关键帧作为样本对检测效率的提升以及本文所提检测框架的高效性。结论 本文所提出的针对人脸交换篡改视频的检测框架通过提取关键帧减少视频级检测中的计算成本和时间消耗,使用卷积神经网络将每帧的人脸图像映射到特征空间,并利用基于自注意力的帧间交互学习机制,使得每帧特征之间可以相互关注,学习到有判别性的信息,使得检测结果更加准确,整体检测过程更高效。
关键词
Deepfake video detection with feature interaction amongst key frames

Zhu Kaiman1, Xu Wenbo1, Lu Wei1, Zhao Xianfeng2,3(1.School of Computer Science and Engineering, Sun Yat-sen University, GuangZhou 510006, China;2.State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100195, China;3.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100195, China)

Abstract
Objective Images and videos manipulation is becoming more easy-use and indistinguishable with development of deep learning. Deepfake is a sort of face manipulation technique which poses a great threat to social security and individual rights. Researchers have been working to propose various detection models or frameworks, which can be divided into three categories combined with their inputs factors like frame level, clip level and video level, respectively. Detection models of frame level have focused on single frame and ignore temporal information only, potentially leading to low confidence in videos detection. Although detection models of clip level make use of a sequence of frames simultaneously, the length of sequence is relatively shorter than the real length of a video. Thus, a clip cannot well represent a video. Moreover, video clips are fragmented and may have adverse effect on video level detection. The consecutive frames in a short clip have little difference and cause redundant information, which may cut the detection performance. The video level detection methods use frames of large interval as input and capture more key features to represent qualified video. The existing methods ignore the impact of sample extraction procedure and its expensive computation of decoding video stream. To solve this problem and provide more efficient detection method on face-swap manipulation videos, a detection framework based on the interaction of key frames' features is illustrated.Method The proposed detection framework has consisted of two parts:key frames extraction in context of face region images extraction and the detection model. First, an amount of key frames from the video stream have been extracted and checked. Inter-frame decoding is avoided and computation time is deducted via key frames extraction. Next, multitask cascaded convolutional neural networks(MTCNN) is applied to locate the position of face region on the extracted frames. Face images are cropped with 80 margins from them. MTCNN is re-applied to the images extracted before. Compact face images are extracted from them. The face images input are mapped into high dimensional embedding space by Inception-ResNet-V1. This convolution neural network is initialized by pre-trained parameters in face recognition task and updated end-to-end implementation. At last, these features of key frames are melted into an interaction learning module, which contains various self-attention-based encoders. In this module, each key frame feature can learn from every other key frame and update itself. Distinctive abnormal features of manipulated images are extracted via part of linear and non-linear transformations. A global classification vector is concatenated at the first of key frame features, updating along with them, and makes the final decision.Result The detection framework has been evaluated on five mainstream datasets listed below:Deepfakes, FaceSwap, FaceShifter, DeepFakeDetection and Celeb-DF, respectively. The three datasets of Deepfakes, FaceSwap, FaceShifter are from FaceForensics++. It achieves accuracies of 97.50%, 97.14%, 96.79%, 97.09% and 98.64%, respectively, with a small quantity of key frames. Original 3D convolution models and LSTM-based models are compared with the illustrated detection model on Celeb-DF in terms of 16 key frames as input. A demonstrated lightweight 3D model(L3D) for deepfake detection has been tested as well. As the samples size is smaller than that of exisited work, R3D, C3D, I3D and L3D have demonstrated poor detection performance while LSTM-based one achieves an accuracy of 98.06%. The demonstrated model is much better than before (99.61%). In the condition that the input is changed to consecutive frames, the proposed model has shown qualified performance 98.64% as well. The time cost of detection is evaluated and illustrated that our framework can detect a video in an average time of 3.17 s, less than major models or with consecutive frames as input. The research strategy of key frame extraction and the framework proposed are shown to be efficient based on the experiments results. A realistic scene has been considered, in which key frames quantity of the video has been checked. A little more frames than training can achieve higher accuracy as the detection model has learned the relation well amongst frames and can be generalized well, but fewer frames can also lead to insufficient information and worse performance. In general, the proposed model can achieve good and stable detection performance, training with 16 key frames.Conclusion An efficient detection framework for face-swap manipulation videos has been demonstrated. It takes the advantage of key frame extraction that it skips the procedure of inter-frame decoding and get time cutting in the preprocessing step. Based on face region images being cropped from valid key frames' pictures, Inception-ResNet-V1 maps them to a standardized embedding space followed by several layers of self-attention based encoders and linear or non-linear transformations. More meaningful and distinguishing information is captured when every frame feature can learn from each other. The experiments on Celeb-DF dataset demonstrate that the illustrated model outperforms other sequential model and 3D convolution neural networks. The time cost is relatively deducted and the effiency of the proposed framework is improved.
Keywords

订阅号|日报