隐蔽图像后门攻击
摘 要
目的 图像后门攻击是一种经典的对抗性攻击形式,后门攻击使被攻击的深度模型在正常情况下表现良好,而当隐藏的后门被预设的触发器激活时就会出现恶性结果。现有的后门攻击开始转向为有毒样本分配干净标签或在有毒数据中隐蔽触发器以对抗人类检查,但这些方法在视觉监督下很难同时具备这两种安全特性,并且它们的触发器可以很容易地通过统计分析检测出来。因此,提出了一种隐蔽有效的图像后门攻击方法。方法 首先通过信息隐藏技术隐蔽图像后门触发,使标签正确的中毒图像样本(标签不可感知性)和相应的干净图像样本看起来几乎相同(图像不可感知性)。其次,设计了一种全新的后门攻击范式,其中毒的源图像类别同时也是目标类。提出的后门攻击方法不仅视觉上是隐蔽的,同时能抵御经典的后门防御方法(统计不可感知性)。结果 为了验证方法的有效性与隐蔽性,在ImageNet、MNIST、CIFAR-10数据集上与其他3种方法进行了对比实验。实验结果表明,在3个数据集上,原始干净样本分类准确率下降均不到1%,中毒样本分类准确率都超过94%,并具备最好的图像视觉效果。另外,验证了所提出的触发器临时注入的任意图像样本都可以发起有效的后门攻击。结论 所提出的后门攻击方法具备干净标签、触发不可感知、统计不可感知等安全特性,更难被人类察觉和统计方法检测。
关键词
Image-imperceptible backdoor attacks
Zhu Shuwen, Luo Ge, Wei Ping, Li Sheng, Zhang Xinpeng, Qian Zhenxing(School of Computer Science, Fudan University, Shanghai 200433, China) Abstract
Objective Backdoor attack-oriented adversarial attacks can yield deep model-attacked to play well in regular, whereas behaves maliciously in terms of triggers-predefined hidden backdoors-activated. But, deep models are vulnerable against multiple adversarial attacks. The aim of backdoor attacks is oriented to penetrate the predesigned backdoor triggers into the portion of the training data (e.g., specific patterns like a square, noises, strips, or warpings). To guarantee attacking effectiveness, existing backdoor attacks are focused on assigning clean-label to the poisoned samples or hiding triggers in the poisoned data against human inspection. Nevertheless, it is still challenged to possess visual-supervised in-situ security features. To resolve this problem,we develop an imperceptible and effective backdoor attack method, which is imperceptible against human inspection, filters, and statistic detector. Method To generate poisoned samples, a smaller image as trigger can be embedded into image-profiling, which are mixed with clean samples as the final training data. Hiding the trigger naturally, the label-imperceptive poisoned sample is similar to the corresponding clean sample (image imperceptibility), and it can also defend the most advanced statistical analysis (statistic imperceptibility) methods. We develop a one-to-oneself attack paradigm of those class-sourced for poisoning is oriented on the target class itself. Differentiated from the previous attack paradigms (all-to-one and all-to-all), a portion of target class-derived images are selected as pre-poisoned samples. With the correct label corresponding to the target class, these images could be imperceptible in the presence of human inspection. However, the classical attack paradigms all-to-one and all-to-all are based on unmatched or error labels, and the target class cannot be sourced by itself. Human inspection-against input-label pairs-mislabeled (like bird-cat) might arouse definite suspicion, which can be used to reveal the attack. Following a filtering process, the rest of samples (most of them are clean) could invalidate the attack. We can launch a quick attack on pre-trained model in terms of same data-poisoned fine-tuning. Our accuracy-consistent backdoor attack illustrates that the imperceptibility can be originated from label, image, and statistic aspects. Result To verify the effectiveness and invisibility of proposed method, experiments are compared to 3 kind of popular methods on ImageNet, MNIST, and CIFAR-10 datasets. For one-to-oneself attack, it can confuse the high accuracies-poisoned model through poisoning a small proportion (7%) of original clean samples on ImageNet, MNIST, and CIFAR-10. Compared to the clean model on all three datasets, the backdoor is inactivated by the trigger when clean samples are tested. There is slight decrease of poisoned accuracy, which is less than 1%. It should be noted that the label of poisoned image is changed to the target label with some backdoor attack, mislabeled input-label pairs will be detected in practice easily. Hence, we did not modify the triggers-injected label of image, while every input-label pair in the training sets of some classical methods is correct-matched. For classical all-to-one attack, the proposed method could classify the same accuracy-based clean samples, and it have comparable attack success rates (more than 99%) when poisoned samples are tested. The trigger of BadNet beyond is invisible against human visual inspection. The trigger-embedded are imperceptible, and the poisoned image is natural and hard to be distinguished from the original clean image. We use learned perceptual image patch similarity(LPIPS), peak signal-to-noise ratio(PSNR), and structural similarity(SSIM) as the metrics for invisibility to quantify as well. Compared with the three methods, the mean distance between the poisoned images generated by ours and original images is almost zero with a near-zero LPIPS value. With the highest SSIM values as well on three datasets, our poisoned samples are more similar to their corresponding benign ones. Meanwhile, our attack achieves the highest PSNR values (more than 43 dB on ImageNet, MNIST, CIFAR-10). For MNIST, PSNR score can be optimized more and reached to 52.4 dB. Conclusion An imperceptible backdoor attack is proposed, where the poisoned image have its label-validated invisible trigger. Hidden-data based triggers are embedded in images invisibly. The poisoned images are similar to original clean ones in this way as well. The user can be imperceptive during the whole process and could not be aware of the abnormality, while other attackers cannot utilize the trigger. And, a new attack paradigm, one-to-oneself attack, is designed for clean-label backdoor attack. Specifically, the original label can keep in consistency when trigger-selected is used for poisoning the images. Behind the success of the new attack paradigm, most defenses will be invalid, which are based on the assumption that samples-poisoned may have a changed label. Finally, our backdoor attack proposed has its potentials to imperceptibility in relevant to label, image and statistic-contexts.
Keywords
|