用于多光谱和高光谱图像融合的联合自注意力Transformer
摘 要
目的 将高光谱图像和多光谱图像进行融合,可以获得具有高空间分辨率和高光谱分辨率的光谱图像,提升光谱图像的质量。现有的基于深度学习的融合方法虽然表现良好,但缺乏对多源图像特征中光谱和空间长距离依赖关系的联合探索。为有效利用图像的光谱相关性和空间相似性,提出一种联合自注意力的Transformer网络来实现多光谱和高光谱图像融合超分辨。方法 首先利用联合自注意力模块,通过光谱注意力机制提取高光谱图像的光谱相关性特征,通过空间注意力机制提取多光谱图像的空间相似性特征,将获得的联合相似性特征用于指导高光谱图像和多光谱图像的融合;随后,将得到的融合特征输入到基于滑动窗口的残差Transformer深度网络中,探索融合特征的长距离依赖信息,学习深度先验融合知识;最后,特征通过卷积层映射为高空间分辨率的高光谱图像。结果 在CAVE和Harvard光谱数据集上分别进行了不同采样倍率下的实验,实验结果表明,与对比方法相比,本文方法从定量指标和视觉效果上,都取得了更好的效果。本文方法相较于性能第二的方法EDBIN (enhanced deep blind iterative network),在CAVE数据集上峰值信噪比提高了0.5 dB,在Harvard数据集上峰值信噪比提高了0.6 dB。结论 本文方法能够更好地融合光谱信息和空间信息,显著提升高光谱融合超分图像的质量。
关键词
Joint self-attention Transformer for multispectral and hyperspectral image fusion
Li Miaoyu, Fu Ying(School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100089, China) Abstract
Objective Hyperspectral image(HSI)contains rich spectral information and has advantages over multispectral image(MSI)in accurately distinguishing different types of materials. Therefore, HSI has been widely used in many computer vision tasks, including vegetation detection, face recognition, and feature segmentation. However, due to the limitations in hardware equipment and the acquisition environment, an inevitable trade-off arises between spatial resolution and spectral resolution. Thus, HSIs under real scenes often have low spatial resolution, which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI(LR-HSI)with a high-resolution MSI(HR-MSI)under the same scene using the HSI super-resolution algorithm, the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors(e. g., matrix/tensor factorization, total variation, and low rank)to utilize the intrinsic statistics of observed spectral images. However, these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods, they do not jointly explore the inner self-similarity of multi-source spectral images, where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition, the weights of these convolutionbased networks are learned during training but are fixed during testing, hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images, we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer. Method Given that LR-HSI has reliable information in the spectral dimension, the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI, our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First, the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically, the spectral similarity features from LR-HSI are extracted by the channel attention module, and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second, to achieve a deep representation and explore the long-range dependencies of the fusion features, the preliminary fusion features are fed into the deep Transformer network, which comprises a shift window attention module, LayerNorm, and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third, the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data, the training images are cropped to the size of 96 × 96 × 31, resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera. Result We compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR), structural similarity index measure(SSIM), erreur relative globale adimensionnelle de Synthèse(ERGAS), and spectral angle mapper(SAM)are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model, we perform experiments on two widely used HSI datasets, namely, the CAVE and Harvard datasets. For the CAVE dataset, the first 20 images are selected for training, and the last 12 images are used for testing. Similarly, for the Harvard dataset, the first 30 images are selected for training, and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8, the PSNR, SAM, and ERGAS of the proposed method is improved by 0. 5 dB, 0. 13, and 0. 2, respectively, compared to EDBIN, which is the second best-performing method on the CAVE dataset. Under a scale factor 16, the PSNR of the proposed method is improved by at least 0. 4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process. Conclusion In this paper, we propose a Transformer-based MSI and HSI fusion network with a joint selfattention fusion module, which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-ofthe-art HSI fusion methods.
Keywords
super-resolution hyperspectral images multispectral images joint self-attention Transformer fusion method
|