多层混合注意力机制的类激活图可解释性方法
张剑1, 张一然1, 邵将2, 王梓聪1(1.武汉数字工程研究所;2.西北工业大学) 摘 要
目的 深度卷积神经网络在视觉任务中的广泛应用,使得其作为黑盒模型的复杂性和不透明性引发了对决策机制的关注。类激活图已被证明能有效提升图像分类的可解释性从而提高决策机制的理解程度,但现有方法在高亮目标区域时,常存在边界模糊、范围过大和细粒度不足的问题。为此,提出了一种多层混合注意力机制的类激活图方法(spatial attention-based multi-layer fusion for high-quality class activation maps,SAMLCAM),以优化这些局限性。方法 在以往的类激活图方法中忽略了空间位置信息只关注通道级权重,降低目标物体的定位性能,在所提出的SAMLCAM方法中提出一种结合了通道注意力机制和空间注意力机制的混合注意力机制,实现增强目标物体定位减少无效位置信息的效果。在得到有效物体定位结果之后,根据神经网络多层卷积层的特点,改进多层特征图融合的方式提出多层加权融合机制,改善类激活图的边界效果范围过大和细粒度不足的问题,从而增强类激活图的视觉解释性。结果 引用广泛用于计算机视觉模型的基准测试ILSVRC 2012数据集和MS COCO2017数据集,对提出方法在多种待解释卷积网络模型下进行评估,包括消融实验、定性评估和定量评估。消融实验中证明了各模块的有效性;同时定性评估对其可解释性效果进行视觉直观展示,证明效果的提升;定量评估中数据表明,SAMLCAM在Loc1和Loc5指标性能比较中相较于最低数据均有大于7%的提升,在能量定位决策指标的比较中相较于最低数据均有大于9.85%的提升。由于改进方法减少了目标样本区域的上下文背景区域,使得其对结果置信度存在负影响,但在可信度指标中,与其他方法比较仍可以保持不超过2%的差距并维持较高性能。
关键词
Spatial Attention-based Mukti-layer Fusion Method For High-Quality Class Activation Maps
ZhangJian, ZhangYiran1, ShaoJiang2, WangZicong(1.武汉数字工程研究所;2.Northwestern Polytechnical University) Abstract
Objective The success of Deep Convolutional Neural Networks (DCNNs) in image classification, object detection, and semantic segmentation has revolutionized the field of artificial intelligence. These models have demonstrated exceptional accuracy and have been deployed in various real-world applications. However, a major drawback of DCNNs is their lack of interpretability, often referred to as the "black-box" problem. When a DCNN makes a prediction, it is challenging to understand how and why it arrived at that decision. This lack of transparency hinders our ability to trust and rely on the model's outputs, especially in critical domains such as healthcare, autonomous driving, and finance. For instance, in medical diagnosis, it is crucial for healthcare professionals to comprehend the reasoning behind a model's diagnosis to make informed decisions about patient care. Explainable Artificial Intelligence (XAI) aims to address this issue by providing human-interpretable explanations for the decisions made by complex machine learning models. XAI seeks to bridge the gap between model performance and model interpretability, allowing users to understand the inner workings of the model and have confidence in its outputs. Researchers have been actively developing techniques and methods to enhance the interpretability of deep learning models. One approach is to generate visual explanations through techniques like CAMs, Grad-CAM, and Smooth Grad-CAM. These methods provide heatmaps or attention maps that highlight the areas of an input image that influenced the model's decision the most. By visualizing this information, users can gain insights into the features and patterns the model focuses on when making predictions. Experimental evidence has shown that class activation map methods can effectively enhance the interpretability of image classification. However, existing methods only provide rough range explanations and suffer from the issues of excessively large boundary effects and insufficient granularity.
To tackle these problems, Spatial Attention-based Mukti-layer Fusion Method For High-Quality Class Activation Maps (SAMLCAM) is proposed. It combines channel attention mechanisms and spatial attention mechanisms based on Grad-CAM. SAMLCAM achieves more effective object localization and enhances visual interpretability by addressing the issues of excessively large activation map boundaries and lack of fine granularity through multi-layer fusion. Method In the current class activation map methods, only the channel weights are considered, while the beneficial information from spatial position, which contributes to target localization, is often overlooked. In our paper, a hybrid attention mechanism combining channel attention and spatial attention is proposed to enhance the interpretability of target localization. The spatial attention mechanism focuses on the spatial relationship among different regions in the feature maps. By assigning higher weights to regions that are more likely to contain the target object, SAMLCAM can enhance the precision of object localization while reducing false positives. This attention mechanism allows the model to allocate more attention to discriminative features, leading to improved object localization. One key improvement of SAMLCAM lies in its multi-layer attention mechanism. Previous methods often suffer from boundary effects, where the activation maps tend to have excessively large boundaries that might include irrelevant regions. SAMLCAM addresses this issue by refining the attention maps at multiple layers of the network. It not only relies on the results from the final convolutional layer but also takes into account multiple aspects, including the attention to shallow layers. This enriches the reference information, resulting in a more comprehensive understanding of the semantic information of the target object while reducing unnecessary background information. This multi-layer attention mechanism helps to gradually refine the boundaries and improve the localization accuracy by reducing the influence of irrelevant regions. Moreover, SAMLCAM tackles the problem of insufficient granularity in class activation maps. In some cases, the activation maps generated by previous methods lack fine details, making it challenging to precisely identify the object of interest. SAMLCAM overcomes this limitation by leveraging the multi-layer attention mechanism to capture more detailed information in the activation maps. This results in high-quality class activation maps with enhanced visual interpretability. The ILSVRC 2012 dataset is a large-scale image classification dataset, consisting of over a million labeled images from 1,000 different categories, widely used for benchmarking computer vision models. The evaluation results on the ILSVRC 2012 validation dataset demonstrate the effectiveness of SAMLCAM in improving object localization metrics and energy localization decision metrics. The proposed method contributes to the field by offering a more comprehensive understanding of how deep models make decisions in visual tasks and provides insights into improving their interpretability. The proposed SAMLCAM method is evaluated on five backbone convolutional network models using the ILSVRC 2012 validation dataset, compared with five state-of-the-art saliency models, namely, Grad-CAM, Grad-CAM++, XGradCAM、ScoreCAM, LayerCAM. The results demonstrate the performance improvement of SAMLCAM compared to the lowest-performing methods in both Loc1 and Loc5 metrics, with an increase of over 7%. Additionally, when comparing energy localization decision metrics, SAMLCAM shows an improvement of more than 9.85% compared to the lowest-performing methods. It should be noted that the improved method reduces the contextual background areas surrounding the target sample region, which negatively affects the confidence metric. However, in terms of the credibility metric, SAMLCAM maintains a difference of no more than 2% compared to other methods. In addition, we conducted a series of comparative experiments to clearly demonstrate the effectiveness of the fusion algorithm in the form of images. Result In conclusion, the SAMLCAM method presents a novel approach to enhancing the interpretability of deep convolutional neural network models. By incorporating channel attention and spatial attention mechanisms, it improves object localization and overcomes the limitations of previous methods such as excessive boundary effects and lack of fine granularity in class activation maps. The evaluation results on the ILSVRC 2012 and MC COCO2017 dataset highlight the performance improvement of SAMLCAM compared to other methods in terms of localization metrics and energy localization decision metrics. The proposed method contributes to advancing the field of visual deep learning and offers valuable insights into understanding and improving the interpretability of black-box models.
Keywords
|