红外与可见光图像分组融合的视觉Transformer
摘 要
目的 针对传统红外与可见光图像融合方法中人工设计特征提取和特征融合的局限性,以及基于卷积神经网络(convolutional neural networks,CNN)的方法无法有效提取图像中的全局上下文信息和特征融合过程中融合不充分的问题,本文提出了基于视觉Transformer和分组渐进式融合策略的端到端无监督图像融合网络。方法 首先,将在通道维度上进行自注意力计算的多头转置注意力模块和通道注意力模块组合成视觉Transformer,多头转置注意力模块解决了自注意力计算量随像素大小呈次方增大的问题,通道注意力可以强化突出特征。其次,将CNN和设计的视觉Transformer并联组成局部—全局特征提取模块,用来提取源图像中的局部细节信息和全局上下文信息,使提取的特征既具有通用性又具有全局性。此外,为了避免融合过程中信息丢失,通过将特征分组和构造渐进残差结构的方式进行特征融合。最后,通过解码融合特征得到最终的融合图像。结果 实验在TNO数据集和RoadScene数据集上与6种方法进行比较。主观上看,本文方法能够有效融合红外图像和可见光图像中的互补信息,得到优质的融合图像。从客观定量分析来说,在TNO数据集中,本文方法相比于性能第2的方法,在归一化互信息、非线性相关信息熵、平均梯度和标准差指标上分别提高了30.90%、0.58%、11.72%和11.82%;在RoadScene数据集中,本文方法在以上4个指标上依然取得了最好成绩。结论 由于有效特征的提取和融合的复杂性以及融合过程中噪声的干扰,现有的融合方法都存在一定的局限性,或者融合效果质量不理想。相比之下,本文提出的基于视觉Transformer的图像融合方法在融合质量上取得了巨大提升,能够有效突出红外显著目标,并且将源图像中的背景信息和细节纹理有效保留在融合图像中,在对比度和清晰度方面也具有优越性。
关键词
Vision transformer for fusing infrared and visible images in groups
Sun Xuhui, Guan Zheng, Wang Xue(School of Information Science and Engineering, Yunnan University, Kunming 650500, China) Abstract
Objective Current image fusion can be as one the key branches of information fusion. Infrared and visible image fusion (IVF) is developed for image fusion dramatically. A visible light sensor-derived image can capture light-reflected. It is rich in texture detail information and fit to the human eye observation pattern. The image-fused can integrate rich detail information and thermal radiation information. Therefore, it is essential for such applications like object tracking, video surveillance, and autonomous driving. To resolve the constraints of manual-designed feature extraction and feature fusion in traditional infrared and visible image fusion methods, as well as the problems that convolutional neural network based (CNN-based) methods cannot effectively extract global contextual information in images and inadequate fusion during feature fusion, We develop an visual transformer-based end-to-end unsupervised fusion network via group-layered fusion strategies. Method First, a channel attention-based transformer is designed, which enhances the features further through computing the self-attention in the channel dimension and using the channel attention in series as the feed-forward network of the transformer. After that, to extract features from the source image, the transformer module and CNN are used in parallel to form a local-global feature extraction module. The features-extracted have the generality of the features extracted by the CNN model to avoid manual design of extraction rules. The global nature of the features extracted by the transformer can be used to make up for the shortage of convolutional operations. In addition, to alleviate the loss of feature information, we design a newly layer-grouped fusion module to fuse the local-global features-extracted by grouping the features of multiple sources in the channel dimension, fusing the features of the corresponding groups initially, and then fusing the features of different groups via a hierarchical residual structure. Result Our analysis is experimented on publicly available datasets TNO and RoadScene in comparison with six popular methods, which include traditional and deep learning-based methods both. Qualitative and quantitative evaluation methods are used to assess its effectiveness together. The qualitative analysis focuses on the clarity, contrast of the images perceived by the human eye. On the basis of qualitative evaluation, our method is capable to restore information-added in the infrared and visible images more effectively and maximize the useful information. At the same time, the fused images have contrast and definition and visual effects better. A quantitative-based comparison is carried out using six different metrics as well. On the TNO dataset, the proposed method achieved the best results in metrics normalized mutual information (NMI), nonlinear correlation information entropy (QNICE), average gradient (AG) and standard deviation (SD), improving by 30.90%, 0.58%, 11.72% and 11.82% compared to the second method. On the RoadScene dataset, the method achieves the best results in metrics normalized mutual information (NMI), nonlinear correlation information entropy (QNICE), average gradient (AG), standard deviation (SD) and visual fidelity improving by 32.74%,0.64%,24.53%,31.40%,31.73% compared to the second method. Conclusion Due to the complexity of effective feature extraction and fusion as well as the interference of noise in the fusion process, existing fusion methods have some challenging issues on fusion quality. In contrast, the visual transformer-based method has its potentials: 1) the infrared salient targets are highlighted effectively, 2) the background information and detailed textures in the source image are retained in related to image fusion, and 3) the contrast and definition are optimized as well. The future research can be concerned of designing more general and efficient image fusion algorithms beyond the fusion of infrared and visible images.
Keywords
|