扩散模型生成视频数据集及其检测基准研究
郑天鹏, 陈雁翔, 温心哲, 李严成, 王志远(合肥工业大学计算机与信息学院) 摘 要
目的 扩散模型在视频生成领域取得了非常显著的成功,目前用于视频生成的扩散模型简单易用,也更容易让此类视频被随意滥用。目前,视频取证相关的数据集更多聚焦在人脸伪造领域上,缺少通用场景的描述,让生成视频检测的研究具有局限性。随着视频扩散模型的发展,视频扩散模型可以生成通用场景视频,但目前生成视频数据集类型单一,数据量少,且部分数据集不包含真实视频,不适用于生成视频检测任务。为了解决这些问题,本文提出了包含文本到视频(text to video, T2V)和图片到视频(image to video, I2V)两种方法的多类型、大规模的生成视频数据集与检测基准。 方法 使用现有的文本到视频和图片到视频等扩散视频生成方法,生成类型多样,数量规模大的生成视频数据,结合从网络获取的真实视频得到最终数据集。T2V视频生成中,使用15种类别的提示文本生成场景丰富的T2V视频,I2V使用下载的高质量图片数据集生成高质量的I2V视频。为了评估数据集生成视频的质量,使用目前先进的生成视频评估方法对视频的生成质量进行评估,以及使用视频检测方法进行生成视频的检测工作。结果 创建了包含T2V和I2V两类生成视频的通用场景生成视频数据集,扩散模型生成视频数据集(Diffusion generated video dataset,DGVD)并结合当前先进的生成视频评估方法EvalCrafter和AIGCBench提出了包含T2V和I2V的生成视频质量估计方法。生成视频检测基准使用了4种图片级检测方法CNNdet (CNN Detection)、DIRE(DIffusion Reconstruction Error )、WDFC(Wavelet Domain Forgery Clues)和 DIF(Deep Image Fingerprint)和6种视频级检测方法 I3D(Inflated 3D)、X3D(Expand 3D)、C2D、Slow、SlowFast和MViT(Multiscale Vision Transformer),其中图片级检测方法无法对未知数据进行有效检测,泛化性较差,而视频级检测方法能够对同一骨干网络实现的方法生成的视频有较好的表现,具有一定泛化能力,但仍然无法在其他网络中实现较好的指标。结论 本文创建了生成类别丰富,场景多样的大规模视频数据集,该数据集和基准完善了生成视频检测任务在此类场景下数据集和基准不足的问题,有助于促进生成视频检测领域的发展。论文相关代码:https://github.com/ZenT4n/DVGD
关键词
Research on diffusion model generated video datasets and detection benchmarks
Zheng Tianpeng, Chen Yanxiang, Wen Xinzhe, Li Yancheng, Wang Zhiyuan(School of Computer Science and Information Engineering, Hefei University of Technology) Abstract
Objective Diffusion video model have showed remarkable success in video generation, such as Open AI sora, we can simply use a text or image to generate a video. However, this convenient video generation also raises concerns regarding the potential abuse of generated videos for deceptive purposes. While existing detection techniques primarily target face video, there exists a noticeable lack of datasets specialize for the detection of forged scene videos, which generated by diffusion model. Moreover, many existing datasets suffer from limitations such as single conditional modality and insufficient data volume. To address these challenges, we propose a multi-conditional video generation datasets and corresponding detection benchmark. Traditional generated videos detection methods often rely on a single conditional modality, such as text or image, which restricts their ability to effectively detect a wide range of generated videos. For instance, algorithms trained solely on videos generated by text-to-video (T2V) model may fail to identify video generated by image-to-video (I2V) model. By introducing multi-conditional generated video, we aim to provide a more comprehensive and robust dataset that encompasses T2V and I2V generated videos. The proposed dataset development process involves collecting diverse videos generated with multi-condition and real videos downloaded in Internet. The generation method employed by each generation produces substantial videos that can be utilized to train the detection model. These diverse condition and large number generated videos ensure that the dataset contain the broad feature of diffusion videos, thereby enhancing the effectiveness of generated video detection models trained on this dataset. Method Generated video dataset can provide training data for the detection work, allowing the detector to recognize whether the video is AI-generated or not. Our dataset uses the existing advanced diffusion video model to generate numerous videos, which include T2V model generated video and I2V model generated video. One of keys to generate high-quality video is prompt text. we use the ChatGPT to generate the prompt texts. To get more general prompt texts, we set 15 entity words, such as dog, cat and then combine them with a template as the input for ChatGPT. In this way, we gain 231 prompt texts and then generate T2V videos. Different from T2V model, the I2V model use the image as the input condition to make the content of image moving. Using the existing advanced T2V and I2V methods for video generation, combined with real videos obtained from the web to build the final dataset. The generation quality of the video is evaluated using the existing advanced methods of generative video evaluation. We combine the metrics of the EvalCrafter and AIGCBench to evaluate generated videos. For the detection module, we use the advanced video detection methods, which contains 4 image level video detectors and 6 video level video detectors, to evaluate the performance of existing detectors on our dataset with different experiments. Result We introduce a generalized scene generative video dataset, (Diffusion generated video dataset, DGVD) and a detection benchmark that constructed using multiple generated video detect methods. A generative video quality estimation method containing T2V and I2V is proposed by combining the current state-of-the-art generative video evaluation methods EvalCrafter and AIGCBench. Generated video detection experiments were conducted on 4 image-level detector CNNdet (CNN Detection), DIRE (DIffusion Reconstruction Error), WDFC (Wavelet Domain Forgery Clues), DIF (Deep Image Fingerprint) and 6 video-level detectors I3D (Inflated 3D), X3D (Expand 3D), C2D, Slow, SlowFast, MViT (Multiscale Vision Transformer). We set 2 experiments, within-classes detection and cross-classes detection. For the within-classes detection, we train the detectors on T2V train dataset, and evaluate the perform in T2V test dataset. For the cross-classes detection, we train the detectors on T2V as the same in within-classes detection, but evaluate the perform in I2V test dataset. The results of the experiment demonstrate that image-level detection methods are unable to effectively detect unknown data and exhibit poor generalization. Conversely, video-level detection methods demonstrate the capacity to perform better on video generated by methods implemented in the same backbone network. However, they still cannot achieve better generalizability in other classes. The result indicates that the existing video detector is unable to identify the majority of videos generated by the diffusion video model. Conclusion In our paper, we introduce a novel dataset, diffusion generated video dataset (DGVD), designed to encompass a diverse array of categories and generative scenarios to address the need for advancements in generative video detection. By providing comprehensive dataset and benchmarks, we offer a more challenging environment for training and evaluating detection models. This dataset and benchmarks not only highlight the current gaps in generative video detection benchmark but also serve as assistance for further progress in the field. We hope significant strides towards enhancing the robustness and effectiveness of generative video detection systems, ultimately driving innovation and advancement in the field. The Code at https://github.com/ZenT4n/DVGD
Keywords
|