Current Issue Cover
基于双专家的巡检影像多模态零样本缺陷检测

吴华, 贾栋豪, 张婷婷, 白晓静, 孙笠, 蒲梦杨(华北电力大学)

摘 要
目的 电力设备巡检影像缺陷检测对于提高电力传输的安全性和电网运行的可靠性具有重要作用。但由于相应训练数据集的构造成本高昂,传统的监督学习方法难以适应电力设备巡检影像缺陷检测。同时电力设备巡检影像中通常含有复杂多样的背景,严重干扰了模型对缺陷的检测。方法 基于视觉语言模型并结合文本提示,提出了电力设备巡检影像零样本缺陷检测模型。模型中含有多个双专家模块,在由视觉语言模型获得文本特征和视觉特征后,经多个双专家模块处理并融合,得到像素级的缺陷检测结果。同时,构建了具有像素级掩码标注的电力设备巡检影像数据集对模型性能进行全面评测。结果 实验在本文构建的电力设备巡检影像测试数据集上,与SAA+(segment any anomaly +)、AnomalyGPT、WinCLIP(window-based CLIP)、PaDiM(patch distribution modeling)、PatchCore进行了比较。在像素级的缺陷分割性能表现上,AUROC(area under the receiver operating characteristic curve)平均提升了18.1%,F1-max(F1 score at optimal threshold)平均提升了26.1%;在图像级的缺陷分类性能表现上,AUROC平均提升了20.2%,AP(average precision)平均提升10.0%。具体到数据集中的各个电力设备,模型在像素级缺陷分割性能表现上,均获得了最好的结果。同时也进行了消融实验,证明了双专家模块对提升模型缺陷检测精度的显著效果。结论 本文模型以零样本的方式,避免了构造电力设备巡检影像数据集的高昂成本。同时提出的双专家模块,使模型减少了受巡检影像复杂背景区域的干扰。
关键词
Multimodal zero-shot anomaly detection using dual-experts for electrical power equipment inspection images

Wu Hua, Jia Donghao, Zhang Tingting, Bai Xiaojing, Sun Li, Pu Mengyang(North China Electric Power University)

Abstract
Objective Anomaly detection in electrical power equipment inspection images plays an important role in improving the safety of power transmission and the reliability of grid operations. Traditional anomaly detection methods mostly rely on supervised learning and heavily depend on high-quality datasets. However, in inspection images, power equipment is typically in normal working conditions, with very few instances of abnormal devices, resulting in a severe imbalance between normal and abnormal samples in the inspection image data. Moreover, constructing datasets of electrical power equipment inspection images involve complex steps, such as image acquisition, image screening, and pixel-level segmentation mask annotation, requiring significant costs. These factors make it challenging for traditional supervised learning methods to adapt to anomaly detection in power equipment inspection images. Furthermore, the working environments of power equipment vary greatly, resulting in inspection images with diverse and varying backgrounds such as forests, power towers, snow-capped mountains, and grasslands. Additionally, inspection images are usually captured by unmanned aerial vehicles (UAVs), making it difficult to control factors such as weather, location, and time during capture. This leads to a large difference in the illumination and viewing angle of the same type of power equipment in the image, which seriously interferes with the model"s accurate identification of the defect area. Method Multimodal large models based on the Transformer framework are pre-trained on massive data and possess powerful zero-shot generalization capability. The visual language models (VLMs), which are a kind of multimodal large models, can understand image content based on textual prompts. To avoid a number of problems caused by the construction of inspection image datasets during training, we propose a zero-shot anomaly detection model for electrical power equipment inspection images based on the VLM and combined with textual prompts. The model uses textual descriptions of normal and abnormal conditions as prompts, and obtains textual features by processing the textual prompts through the text encoder of the VLM. Meanwhile, the images to be inspected obtain multi-scale visual information by processing through the VLM"s visual encoder and obtaining multiple visual features from multiple intermediate layers. We further process multiple visual features and textual features using multiple dual-expert modules. Two experts independently process visual features, combine them with textual features to obtain decisions from the two experts, and then integrate the decisions to obtain joint decision. This results in the dual-expert module being able to mitigate the effects of varying backgrounds, different illumination and viewing angle conditions in inspection images, thus focusing on the defect areas. To incorporate more contextual and local detail information from images, multiple joint decisions are fused to obtain anomaly detection results. The dual-expert modules are pre-trained on public industrial anomaly detection datasets, with the VLM visual and text encoders frozen. Currently, there is a lack of power inspection image datasets with pixel-level segmentation mask annotations. In order to comprehensively evaluate our model, we constructed a corresponding anomaly detection test dataset with diverse backgrounds, lighting and viewpoints in the images. Result The experiment compared our method with SAA+(segment any anomaly +)、AnomalyGPT、WinCLIP(window-based CLIP)、PaDiM(patch distribution modeling)、PatchCore on the electrical power equipment inspection image dataset we constructed. In terms of pixel-level anomaly segmentation performance, the AUROC (area under the receiver operating characteristic curve) average improved by 18.1%, and the F1-max (F1 score at optimal threshold) average improved by 26.1%. In terms of image-level anomaly classification performance, the AUROC average improved by 20.2% and the AP (average precision) average improved by 10.0%. Specifically, our model achieved the best results in pixel-level anomaly segmentation performance for various electrical power equipments. In the pixel-level anomaly segmentation performance of various insulator categories, our model achieved at least a 15% improvement in F1-max, and it also showed excellent performance for other electrical power equipment. In terms of the AUROC metric, our method performed best on most power equipment. It is worth noting that the image-level anomaly classification AUROC of all the methods performed poorly on line clamps. This is because the backgrounds in line clamp images are mostly composed of objects such as pylons and wire rods, which have similar textures or colours to the line clamp, making it difficult for the model to understand the semantic content of the entire image. However, our model uses multi-layer features and dual-expert modules to reduce interference from background objects with similar foregrounds, achieving relatively good performance on line clamp images. We also conducted an ablation study. When the model used a single expert, it showed widespread false positives. In contrast, the dual-expert module enables reasonable cooperation between experts, focusing on defects and avoiding attention to irrelevant areas, thus demonstrating a significant improvement in anomaly detection accuracy. Conclusion Our model use both normal and abnormal text prompts to achieve zero-shot anomaly detection in electrical power equipment inspection images, thus avoiding the problem of imbalance between normal and abnormal samples in electrical inspection image datasets and the high cost of constructing such datasets. By using multi-layer features from the image encoder, our model incorporates a wider range of contextual information and local detail information. In the dual-expert module, the design of the joint decision between two experts effectively focuses on defect areas, thereby reducing the interference from background areas. In addition, to meet the evaluation requirements of pixel-level anomaly segmentation and image-level anomaly classification performance of electrical power equipment anomaly detection models in outdoor work scenarios, we constructed an electrical power equipment inspection image dataset and conducted experiments on various models on this dataset. Our model outperforms other zero-shot anomaly detection methods based on VLM in both anomaly segmentation and anomaly classification. Furthermore, an ablation study demonstrated the excellent performance of the dual-expert module in focusing on defect areas and mitigating background.
Keywords

订阅号|日报