基于归一化流的多模态多尺度工业场景缺陷检测
曲海成, 林俊杰(辽宁工程技术大学) 摘 要
目的 工业缺陷检测是现代工业质量控制中至关重要的一环,针对工业多模态缺陷检测场景下,捕捉不同形状大小、在RGB图像上感知度低的缺陷,以及减少单模态原始特征空间内存在的噪声对多模态信息交互的干扰的挑战,提出了一种基于归一化流的多模态多尺度缺陷检测方法。方法 首先,使用Vision Transformer和Point Transformer对RGB图像和3D点云两个模态的信息提取第1、3、11块的特征构建特征金字塔,保留低层次特征的空间信息助力缺陷定位任务,并提高模型对不同形状大小缺陷的鲁棒性;其次,为了简化多模态交互,使用过点特征对齐算法将3D点云特征对齐至RGB图像所在平面,通过构建对比学习矩阵的方式实现无监督多模态特征融合,促进不同模态之间信息的交互;此外,通过设计代理任务的方式将信息瓶颈机制扩展至无监督,并在尽可能保留原始信息的同时,减少噪声干扰得到更充分有力的多模态表示;最后,使用多尺度归一化流结构捕捉不同尺度的特征信息,实现不同尺度特征之间的交互。结果 本文提出的方法在MVTec-3D AD数据集上进行性能评估,实验结果表明在Detection AUCROC达到93.3%,Segmentation AUPRO达到96.1%,Segmentation AUCROC达到98.8%,优于大多数现有的多模态缺陷检测方法。结论 本文提出的方法对于不同形状大小、在RGB图像上感知度低的缺陷有较好的检测效果,不但减少了原始特征空间内噪声对多模态表示的影响,并且提高了方法对不同形状大小缺陷的泛化能力,较好地满足了现代工业对于缺陷检测的要求。
关键词
Multimodal multi-scales anomaly detection via flows
quhaicheng, linjunjie(Liaoning Technical University) Abstract
Abstract: Objective Defect detection stands as a fundamental cornerstone in modern industrial quality control frameworks. As industries advance, the array of defect types becomes increasingly diverse. Some defects present formidable challenges, as they are scarcely perceptible when examined through individual RGB images. This necessitates additional information from complementary modalities to aid in detection. Consequently, conventional deep learning methodologies, reliant solely on single modal data for defect identification, have proven inadequate to meet the dynamic demands of contemporary industrial environments. Addressing the nuanced challenges inherent in multimodal defect detection scenarios prevalent in modern industries, where defects vary significantly in shape and size and often exhibit low perceptibility within individual modalities, this paper proposes an innovative approach. By addressing the inherent noise interference within single modal feature spaces and harnessing the synergies between multimodal information, our method introduces a novel multimodal multi-scale defect detection framework grounded in the principles of normalizing flows. Method The proposed method is structured into four main components: feature extraction, unsupervised feature fusion, an information bottleneck mechanism, and multi-scale normalizing flow. Firstly, in the feature extraction stage, we consider that features at different levels tend to contain varying degrees of spatial and semantic information. Low-level features contain more spatial information, whereas high-level features convey richer semantic information. Given the emphasis on spatial detail information in pixel-level defect localization tasks, Therefore, we utilize Vision Transformer and Point Transformer to extract features from RGB images and 3D point clouds, focusing on blocks 1, 3, and 11, to obtain multimodal representations at different levels, which are then fused and structured into a feature pyramid. This approach not only preserves spatial information from low-level features to aid in defect localization but also enhances the model"s robustness to defects of varying shapes and sizes. Secondly, in the unsupervised feature fusion stage, to streamline multimodal interaction, we employ the Point Feature Alignment technique to align 3D point cloud features with the RGB image plane. Subsequently, we achieve unsupervised multimodal feature fusion by constructing a contrastive learning matrix. This facilitates interaction between different modalities. Moreover, in the information bottleneck mechanism stage, a proxy task is designed to extend the information bottleneck mechanism to unsupervised settings. This aims to obtain a more comprehensive and robust multimodal representation by minimizing noise interference within single modal raw feature spaces while preserving the original information as much as possible. Lastly, in the multi-scale normalizing flow stage, the structure utilizes parallel flows to capture feature information at different scales. Through the fusion of these flows, interactions between features at various scales are realized. Additionally, an innovative approach for anomaly scoring is employed, wherein the average of the top K values in the anomaly score map replaces traditional methods such as using the mean or maximum values. This approach yields the final defect detection results. Result The proposed method undergoes evaluation on the MVTec-3D AD dataset, this dataset is meticulously curated, encompassing 10 distinct categories of industrial products, with a comprehensive collection of 2656 training samples and 1137 testing samples. Each category is meticulously segmented into subclasses, delineated by the nature of the defects. Our method undergoes thorough experimental validation, producing compelling results that highlight its exceptional performance. We achieved Detection AUCROC of 93.3%, Segmentation AUPRO of 96.1%, and Segmentation AUCROC of 98.8%. These metrics not only reflect the method"s effectiveness but also signify its advancement over the majority of existing multimodal defect detection methodologies. Moreover, we conduct visualizations on selected samples, comparing the detection outcomes using solely RGB images against those utilizing RGB images in conjunction with 3D point clouds. The latter combination has unveiling defects that remained elusive when relying solely on RGB imagery. This empirical evidence firmly establishes the advantage of integrating data from both modalities, as posited in our hypothesis. The ablation studies conducted provide additional insights into the efficacy of our method. The introduction of an information bottleneck resulted in incremental improvements across all three metrics: 1.4% in Detection AUCROC, 2.1% in Segmentation AUPRO, and 3.5% in Segmentation AUCROC. And, the integration of a multi-scale normalizing flow further enhanced performance, with gains of 2.5%, 3.6%, and 1.6% across the respective metrics. These findings are indicative of the substantial contributions that both the information bottleneck and the multi-scale normalizing flow impart to the overall performance of our defect detection framework. Conclusion The main contributions of this paper are as follows: we employ unsupervised feature fusion to encourage information exchange between different modalities. To mitigate the impact of noise in the original feature space of single modalities on multimodal interaction, we incorporate an information bottleneck within the feature fusion module. Additionally, we utilize multimodal representations at different levels to construct feature pyramids, addressing the issue of poor performance of previous flow-based methods in handling defects of varying scales. The proposed method demonstrates promising detection performance across defects of diverse shapes and sizes, including those with low perceptibility on RGB images. By mitigating the impact of noise within the original feature space on multimodal representation, our approach not only improves the robustness of the method but also enhances its ability to generalize to defects of varying characteristics. This effectively aligns with the stringent demands of modern industry for accurate and reliable defect detection methodologies.
Keywords
multimodal and multi-scale industrial scene anomaly detection unsupervised feature fusion pretext task normalizing flow
|