语音深度伪造及其检测技术研究进展
许裕雄1,2,3, 李斌1,2,3, 谭舜泉1,2,4, 黄继武1,2,3(1.广东省智能信息处理重点实验室, 深圳 518060;2.深圳市媒体信息内容安全重点实验室, 深圳 518060;3.深圳大学电子与信息工程学院, 深圳 518060;4.深圳大学计算机与软件学院, 深圳 518060) 摘 要
语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化,推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升,同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先,介绍以语音合成(speech synthesis,SS)和语音转换(voice conversion,VC)为代表的伪造技术。然后,介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上,从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言,从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响,从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点,从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后,总结分析语音深度伪造检测技术存在的挑战性问题,并对未来研究进行展望。本文汇总的相关数据集和代码可在https://github.com/media-sec-lab/Audio-Deepfake-Detection访问。
关键词
Research progress on speech deepfake and its detection techniques
Xu Yuxiong1,2,3, Li Bin1,2,3, Tan Shunquan1,2,4, Huang Jiwu1,2,3(1.Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen 518060, China;2.Shenzhen Key Laboratory of Media Security, Shenzhen 518060, China;3.College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China;4.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China) Abstract
Speech deepfake technology, which employs deep learning methods to synthesize or generate speech, has emerged as a critical research hotspot in multimedia information security. The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake techniques. These advancements have significantly enhanced the naturalness, fidelity, and diversity of synthesized speech. However, they have also presented great challenges for speech deepfake detection technology. To address these challenges, this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques. Based on an extensive literature survey, this study first introduces the research background of speech forgery and its detection and compares and analyzes previously published reviews in this field. Second, this study provides a concise overview of speech deepfake generation, especially speech synthesis (SS) and voice conversion (VC). SS, which is commonly known as text-to-speech (TTS), analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text description. Various deep models are employed in TTS, including sequence-to-sequence models, flow models, generative adversarial network models, variational auto-encoder models, and diffusion models. VC involves modifying acoustic features, such as emotion, accent, pronunciation, and speaker identity, to produce speech resembling human-like speech. VC algorithms can be categorized as single, multiple, and arbitrary target speech conversion depending on the number of target speakers. Third, this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets. This study briefly introduces two commonly used evaluation metrics in speech deepfake detection: the equal error rate and the tandem detection cost function. This study analyzes and categorizes the existing deep speech forgery detection techniques in detail. The pros and cons of different detection techniques are studied and compared in depth, focusing primarily on data processing, feature extraction and optimization, and learning mechanisms. Notably, this study summarizes the experimental results of existing detection techniques on the ASVspoof 2019 and 2021 datasets in tabular form. Within this context, the primary focus of this study is to investigate the generality of current detection techniques in the field of speech deepfake detection without focusing on specific forgery attack methods. Data augmentation involves a series of transformations on the original speech data. These include speech noise addition, mask enhancement, channel enhancement, and compression enhancement, each aiming to simulate complex real-world acoustic environments more effectively. Among them, one of the most common data processing methods is speech noise addition, which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible. Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection techniques. Transmission channel enhancement focuses on solving the problems of signal attenuation, data loss, and noise interference caused by changes in the codec and transmission channel of speech data. Compression enhancement techniques address the problem of degradation of speech quality during data compression. In particular, the main data compression methods are MP3, M4A, and OGG. From the perspective of feature extraction and optimization, speech deepfake detection can be divided into handcrafted feature-, hybrid feature-, deep feature-, and feature fusion-based methods. Handcrafted features refer to speech features extracted with the help of certain prior knowledge, which mainly include constant-Q transform, linear frequency cepstral coefficients, and Mel-spectrogram. By contrast, feature-based hybrid forgery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks. End-to-end forgery detection methods directly learn feature representation and classification models from raw speech signals, which eliminates the need for handcrafted feature extraction. This way allows the model to discover discriminative features from the input data automatically. Moreover, these detection techniques can be trained using a single feature. Alternatively, feature-level fusion forgery detection can be employed to combine multiple features, whether they are identical or different. Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion. The detection techniques can capture richer speech information by fusing these features, which improves performance. For the learning mechanism, this study explores the impact of different training methods on forgery detection techniques, especially self-supervised learning, adversarial training, and multi-task learning. Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models. Fine-tuning the self-supervised-based pretrained model can effectively distinguish between real and forged speech. Then, adversarial training-based forgery detection enhances the robustness and generalization of the model by adding adversarial samples to the training data. In contrast to binary classification tasks, the forgery detection based on multi-task learning captures more comprehensive and useful speech feature information from different speech-related tasks by sharing the underlying feature representations. This approach improves the detection performance of the model while effectively utilizing speech training data. Although speech deepfake detection techniques have achieved excellent performance in some datasets, their performance is less satisfactory when testing speech data from natural scenarios. Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets, study adversarial samples or data enhancement methods for enhancing the robustness of speech deepfake detection techniques, establish generalized speech deepfake detection techniques, and explore interpretable speech deepfake detection techniques. The relevant datasets and code mentioned can be accessed from https://github.com/media-sec-lab/Audio-Deepfake-Detection.
Keywords
speech deepfake speech deepfake detection speech synthesis(SS) voice conversion(VC) artificial intelligence-generated content(AIGC) self-supervised learning adversarial training
|