面向高光谱场景分类的空—谱模型蒸馏网络
摘 要
目的 现有场景分类方法主要面向高空间分辨率图像,但这些图像包含极为有限的光谱信息,且现有基于卷积神经网络(convolutional neural network,CNN)的方法由于卷积操作的局部性忽略了远程上下文信息的捕获。针对上述问题,提出了一种面向高光谱场景分类的空—谱模型蒸馏网络(spatial-spectral model distillation network for hyperspectral scene classification,SSMD)。方法 选择基于空—谱注意力的ViT方法(spatial-spectral vision Transformer,SSViT)探测不同类别的光谱信息,通过寻找光谱信息之间的差异性对地物进行精细分类。利用知识蒸馏将教师模型SSViT捕获的长距离依赖信息传递给学生模型VGG16(Visual Geometry Group 16)进行学习,二者协同合作,教师模型提取的光谱信息和全局信息与学生模型提取的局部信息融合,进一步提升学生分类性能并保持较低的时间代价。结果 实验在3个数据集上与10种分类方法(5种传统CNN分类方法和5种较新场景分类方法)进行了比较。综合考虑时间成本和分类精度,本文方法在不同数据集上取得了不同程度的领先。在OHID-SC(Orbita hyperspectral image scene classification dataset)、OHS-SC(Orbita hyperspectral scene classification dataset)和HSRS-SC(hyperspectral remote sensing dataset for scene classification)数据集上的精度,相比于性能第2的模型,分类精度分别提高了13.1%、2.9%和0.74%。同时在OHID-SC数据集中进行的对比实验表明提出的算法有效提高了高光谱场景分类精度。结论 提出的SSMD网络不仅有效利用高光谱数据目标光谱信息,并探索全局与局部间的特征关系,综合了传统模型和深度学习模型的优点,使分类结果更加准确。
关键词
Spatial-spectral model distillation network for hyperspectral scene classification
Xue Jie1, Huang Hong1, Pu Chunyu2, Yang Yinming1, Li Yuan1, Liu Yingxu1(1.Key Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University, Chongqing 210046, China;2.National Key Laboratory of Electromagnetic Space Security, Chengdu 610036, China) Abstract
Objective In recent years, the development of remote sensing technology has enabled the acquisition of abundant remote sensing images and large datasets. Scene classification tasks, as key areas in remote sensing research, aim to distinguish and classify images with similar scene features by assigning fixed semantic labels to each scene image. Various scene classification methods have been proposed, including handcrafted feature- and deep learning-based methods. However, handcrafted feature-based methods have limitations in describing scene semantic information due to high requirements for feature descriptors. Meanwhile, deep learning-based methods for scene classification of remote sensing images have shown powerful feature extraction capabilities and have been widely applied in scene classification. However, current scene classification methods mainly focus on remote sensing images with high spatial resolution, which are mostly three-channel images with limited spectral information. This limitation often leads to confusion and misclassification in visually similar categories such as geometric structures, textures, and colors. Therefore, integrating spectral information to improve the accuracy of scene classification has become an important research direction. However, existing methods have some shortcomings. For example, convolutional operations have translation invariance and are sensitive to local information, which causes difficulty in capturing remote contextual information. Meanwhile, although Transformer methods can extract long-range dependency information, they have limited capability in learning local information. Moreover, combining convolutional neural networks (CNNs) and Transformer methods incurs high computational complexity, which hinders the balance between inference efficiency and classification accuracy. This study proposes a high spectral scene classification method called the spatial-spectral model distillation (SSMD) network to address the aforementioned issues. Method In this study, we utilize spectral information to improve the accuracy of scene classification and overcome the limitations of existing methods. First, we propose a spatial-spectral joint self-attention mechanism based on ViT (SSViT) to fully exploit the spectral information of hyperspectral images. SSViT integrates spectral information into the Transformer architecture. By exploring the intrinsic relationships between pixels and between spectra, SSViT extracts richer features. In the spatial-spectral joint mechanism, SSViT leverages the spectral information of different categories to identify the differences between them, which enables fine-grained classification of land cover and improves the accuracy of scene classification. Second, we introduce the concept of knowledge distillation to further enhance the classification performance. In the framework of teacher-student models, SSViT is used as the teacher model, and a pretrained model, that is, Visual Geometry Group 16(VGG16), is used as the student model to capture contextual information of complex scenes. The teacher model extracts spectral information and global features among samples, while the student model focuses on capturing local features. The student model can learn and mimic the prior knowledge of the teacher model, which improves the discriminative ability of the student model. The joint training of the teacher-student models enables comprehensive extraction of land cover features, which improves the accuracy of scene classification. Specifically, the image is divided into 64 image patches in the spatial dimension, and 32 spectral bands in the spectral dimension. Each patch and band can be regarded as a token. Each patch and band are flattened into row vectors and mapped to a specific dimension through a linear layer. The learned vectors are concatenated with the embedded samples for the final prediction of image classification of the teacher model. A position vector is generated and directly concatenated with the token mentioned above as the input to the Transformer. The multi-head attention mechanism outputs encoded representations containing information from different subspaces to model global contextual information, which improves the representation capacity and learning effectiveness of the model. Finally, feature integration is performed through a multi-layer perceptron and a classification layer to achieve classification. The process of knowledge distillation consists of two stages. The first stage optimizes the teacher and student models by minimizing the loss function with distillation coefficients. In the second stage, the student model is further adjusted using the loss function, which leverages the supervision from the performance-excellent complex model to train the simple model. This adjustment aims for higher accuracy and better classification performance. The complex model is referred to as the teacher model, while the simpler model is referred to as the student model. The training mode of knowledge distillation provides the student model with more informative content, which allows it to directly learn the generalization ability of the teacher model. Result We compare our model with 10 models, including 5 traditional CNN classification methods and 5 latest scene classification methods on 3 public datasets, namely, OHID-SC (Orbita hyperspectral image scene classification dataset), OHS-SC (another Orbita hyperspectral scene classification dataset), and HSRS-SC (Hyperspectral remote sensing dataset for scene classification). The quantitative evaluation metrics include overall accuracy, standard deviation, and confusion matrix, and the confusion matrix on the three datasets is provided to clearly display the classification results of the proposed algorithm. Experimental results show that our model outperforms all other methods on OHID-SC, OHS-SC, and HSRS-SC datasets, and the classification accuracies on OHID-SC, OHS-SC, and HSRS-SC datasets are improved by 13.1%,2.9%, and 0.74%, respectively, compared with the second-best model. Meanwhile, comparative experiments on OHID-SC dataset show that the proposed algorithm can effectively improve the classification accuracy of hyperspectral scenes. Conclusion In this study, the proposed SSMD network not only effectively utilizes the target spectral information of hyperspectral data but also explores the feature relationship between global and local levels, synthesizes the advantages of traditional and deep learning models, and produces more accurate classification results.
Keywords
hyperspectral scene classification convolutional neural network (CNN) Transformer spatial-spectral joint self-attention mechanism knowledge distillation (KD)
|