深度嵌套式Transformer网络的高光谱图像空谱解混方法
游雪儿1, 苏远超1, 蒋梦莹2, 李朋飞1, 刘东升3, 白晋颖3(1.西安科技大学测绘科学与技术学院, 西安 710054;2.西安交通大学电子与信息学部, 西安 710049;3.航天宏图信息技术股份有限公司, 北京 100195) 摘 要
目的 基于深度学习的解混方法在信息挖掘和泛化性能上优于传统方法,但主要关注光谱信息,对空间信息的利用仍停留在滤波、卷积的表层处理。这使得构建解混网络时需要堆叠多层网络,易丢失部分图像信息,影响解混准确性。Transformer网络因其强大的特征表达能力广泛应用于高光谱图像处理,但将其直接应用于解混学习容易丢失图像局部细节。本文基于Transformer网络提出了改进方法。方法 本文以TNT(Transformer in Transformer)构架为基础提出了一种深度嵌套式解混网络(deep embedded Transformer network, DETN),通过内外嵌入式策略实现编码器中局部与整体空间信息共享,不仅保留了高光谱图像的空间细节,而且在编码器中只涉及少量卷积运算,大幅度提升了学习效率。在解码器中,通过一次卷积运算来恢复数据结构以便生成端元与丰度,并在最后使用Softmax 层来保障丰度的物理意义。结果 最后,本文分别采用模拟数据集和真实高光谱数据集进行对比实验,在50 dB模拟数据集中平均光谱角距离和均方根误差取得最优值,分别为0.038 6 和0.004 5,在真实高光谱数据集Samson、Jasper Ridge中取得最优平均光谱角距离,分别为0.119 4,0.102 7。结论 实验结果验证了DETN 方法的有效性和优势,并且能为实现深度解混提供新的技术支撑和理论参考。
关键词
Deep embedded Transformer network with spatial-spectral information for unmixing of hyperspectral remote sensing images
You Xueer1, Su Yuanchao1, Jiang Mengying2, Li Pengfei1, Liu Dongsheng3, Bai Jinying3(1.College of Geomatics, Xi'an University of Science and Technology, Xi'an 710054, China;2.Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China;3.PIESAT Information Technology Co., Ltd., Beijing 100195, China) Abstract
Objective In hyperspectral remote sensing, mixed pixels often exist due to the complex surface of natural objects and the limitation of spatial resolution of instruments. Mixed pixels typically refer to the situation where a pixel in the hyperspectral images usually contains multiple spectral features, which hinders the application of hyperspectral images in various fields such as target detection, image classification, and environmental monitoring. Therefore, the decomposition (unmixing) of mixed pixels is a main concern in the processing of hyperspectral remote sensing images. Spectral unmixing aims to overcome the limitations of image spatial resolution by extracting pure spectral signals (endmembers) representing each land cover class and their respective proportions (abundances) within each pixel. It is based on a spectral mixing model at the sub-pixel level. The rise of deep learning has brought many advanced modeling theories and architecture tools to the field of hyperspectral mixed pixel decomposition and has also spawned many deep learning-based unmixing methods. Although these methods have advantages over traditional methods in terms of information mining and generalization performance, deep networks often need to combine multiple layers of stacked network layers to achieve optimal learning outcomes. Therefore, deep networks may cause damage to the internal structure of the data during the training process, which leads to the loss of important information in hyperspectral data and affects the accuracy of unmixing. In addition, most existing deep learning-based unmixing methods focus only on spectral information, but the exploit of spatial information is still limited to surface processing stages such as filtering and convolution. In recent years, autoencoder has been one of the research hotspots in the field of deep learning, and many variant networks based on autoencoder networks have emerged. Transformer is a novel deep learning network with an autoencoder-like structure. It has garnered considerable attention in various fields such as natural language processing, computer vision, and time series analysis due to its powerful feature representation capability. The Transformer, as a neural network primarily based on the self-attention mechanism, can better explore the underlying relationships among different features and more comprehensively aggregate the spectral and spatial correlations of pixels. This way enhances the ability of abundance learning and improves the accuracy of unmixing. Although the Transformer network has recently been used to design unmixing methods, using unsupervised Transformer models directly to obtain features can lose many local details and cause difficulty in exploiting the long-range dependency properties of Transformers effectively. Method To address these limitations, the study proposes a deep embedded Transformer network (DETN) based on the Transformer-in-Transformer architecture. This network adopts an autoencoder framework that consists of two main parts: node embedding (NE) and blind signal separation. In the first part, the input hyperspectral image is first uniformly divided twice, and the divided image patches are mapped into sub-patch sequences and patch sequences through linear transformation operations. Then, the sub-patch sequences are processed through an internal Transformer structure to obtain pixel spectral information and local spatial correlations, which are then aggregated into the patch sequences for parameter and information sharing. Finally, the local detail information in the patch sequences is retained, and the patch sequences are processed through an external Transformer structure to obtain and output pixel spectral information and global spatial correlation information containing local information. In the second part, the input NE is first reconstructed into an abundance map and smoothed during this process using a single layer of 2D convolution layer to eliminate noise. A SoftMax layer is used to ensure the physical meaning of the abundances. Finally, a single-layer 2D convolution layer is used to reconstruct the image, which optimizes and estimates the endmembers in the convolution layer. Result To evaluate the effectiveness of the proposed method, experiments are conducted using simulated datasets and some real hyperspectral datasets, including the Samson dataset, the Jasper Ridge dataset, and a part of the real hyperspectral farmland data in Nanchang City, Jiangxi Province, obtained by the Gaofen-5 satellite provided by Beijing Shengshi Huayao Technology Co., Ltd. In addition, resources from the ZY1E satellite provided by Beijing Shengshi Huayao Technology Co., Ltd. are used to obtain partial hyperspectral data of the Marseille Port in France for comparative experiments with different methods. The experimental results are quantitatively analyzed using spectral angle distance (SAD) and root mean square error (RMSE). In addition, the method evaluates the proposed DETN compared with several state-of-the-art deep learning-based unmixing algorithms: fully strained least squares (FCLS), deep autoencoder networks for hyperspectral unmixing (DAEN), autoencoder network for hyperspectral unmixing with adaptive abundance smoothing (AAS), the untied denoising autoencoder with sparsity (uDAS), hyperspectral unmixing using deep imageprior (UnDIP), and hyperspectral unmixing using Transformer network (DeepTrans-HSU). Results demonstrate that the proposed method outperforms the compared methods in terms of spectral angle distance (SAD), root mean square error (RMSE), and other evaluation metrics. Conclusion The proposed method effectively captures and preserves the spectral information of pixels at local and global levels, as well as the spatial correlations among pixels. This method results in accurate extraction of endmembers that match the ground truth spectral features. Moreover, the method produces smooth abundance maps with high spatial consistency, even in regions with hidden details in the image. These findings validate that the DETN method provides new technical support and theoretical references for addressing the challenges posed by mixed pixels in hyperspectral image unmixing.
Keywords
remote sensing image processing hyperspectral remote sensing hyperspectral unmixing deep learning Transformer network
|