自监督学习下小样本遥感图像场景分类
摘 要
目的 卷积神经网络(convolutional neural network,CNN)在遥感场景图像分类中广泛应用,但缺乏训练数据依然是不容忽视的问题。小样本遥感场景分类是指模型只需利用少量样本训练即可完成遥感场景图像分类任务。虽然现有基于元学习的小样本遥感场景图像分类方法可以摆脱大数据训练的依赖,但模型的泛化能力依然较弱。为了解决这一问题,本文提出一种基于自监督学习的小样本遥感场景图像分类方法来增加模型的泛化能力。方法 本文方法分为两个阶段。首先,使用元学习训练老师网络直到收敛;然后,双学生网络和老师网络对同一个输入进行预测。老师网络的预测结果会通过蒸馏损失指导双学生网络的训练。另外,在图像特征进入分类器之前,自监督对比学习通过度量同类样本的类中心距离,使模型学习到更明确的类间边界。两种自监督机制能够使模型学习到更丰富的类间关系,从而提高模型的泛化能力。结果 本文在NWPU-RESISC45(North Western Polytechnical University-remote sensing image scene classification)、AID (aerial image dataset)和UCMerced LandUse (UC merced land use dataset)3个数据集上进行实验。在5-way 1-shot条件下,本文方法的精度在3个数据集上分别达到了72.72%±0.15%、68.62%±0.76%和68.21%±0.65%,比Relation Net*模型分别提高了4.43%、1.93%和0.68%。随着可用标签的增加,本文方法的提升作用依然能够保持,在5-way 5-shot条件下,本文方法的精度比Relation Net*分别提高3.89%、2.99%和1.25%。结论 本文方法可以使模型学习到更丰富的类内类间关系,有效提升小样本遥感场景图像分类模型的泛化能力。
关键词
Self-supervised learning based few-shot remote sensing scene image classification
Zhang Rui, Yang Yixin, Li Yang, Wang Jiabao, Miao Zhuang, Li Hang, Wang Ziqi(Command and Control Engineering College, Army Engineering University of PLA, Nanjing 210007, China) Abstract
Objective Convolutional neural networks (CNNs) have been widely used in remote sensing scene image classification, but data-driven models are restricted by the data scarcity-related over fitting and low robustness issue. The problems of few labeled samples are still challenged to train model for remote sensing scene image classification task. Therefore, it is required to design an effective algorithm that can adapt to small-scale data. Few-shot learning can be used to improve the generalization ability of model. Current meta-learning-based few-shot remote sensing scene image classification methods can resilient the data-intensive with no higher robustness. A challenging issue of the remote sensing scene samples is derived of small inter-class variation and large intra-class variation, which may lead to low robustness for few-shot learning. Our research is focused on a novel self-supervised learning framework for few-shot remote sensing scene image classification, which can improve the generalization ability of the model via rich intra-class relationships learnt. Method Our self-supervised learning framework is composed of three modules in relation to data preprocessing, feature extraction and loss function. 1) Data preprocessing module is implemented for resizing and normalization for all inputs, and the supporting set and the query set are constructed for few-shot learning. The supporting set is concerned about small scale labeled images, but the query set has no labels-relevant samples. Few-shot learning method attempts to classify the query samples of using same group-derived supporting set. Furthermore, data preprocessing module can construct a numerous of multiple supporting sets and query sets. 2) Feature extraction module is aimed to extract the features from the inputs, consisting of the supporting features and the query features. The distilled "student-related" knowledge has dual-based feature extraction networks. The "teacher-related" feature extraction module is based on ResNet-50, and the "student-related" dual module has two Conv-64 networks. 3) Loss function module can produce three losses-relevant like few-shot, knowledge distillation and self-supervised contrast. The few-shot loss uses the inherent labels to update the parameters of the "student-related" network, which is produced by metric-based meta-learning. Knowledge-distilled loss is originated from KL (Kullback-Leibler) loss, which calculates the similarity of probability distribution between the "student-related" dual networks and the teachers-related network using the soft labels. The knowledge distillation learning is based on two-stage training process. The "teacher-related" network is used for metric based meta-learning. Then, the "student-related" networks and the "teacher-related" network are trained with the same data, and the output of the "teacher-related" network is used to guide the learning of the "student-related" network by knowledge distillation loss. Additionally, the self-supervised contrastive loss is calculated by measuring the distance between the centers of two classes. We use the self-supervised contrastive loss to perform instance discrimination pretext task through reducing the distances from same classes, and amplifying the different ones. The two self-supervising mechanisms can enable the model to learn richer inter-class relationships, which can improve the generalization ability. Result Our method is evaluated on North Western Polytechnical University-remote sensing image scene classification (NWPU-RESISC45) dataset, aerial image dataset (AID), and UC merced land use dataset (UCMerced LandUse), respectively. The 5-way 1-shot task and 5-way 5-shot task is carried out on each dataset. Our method is also compared to other five methods, and our benchmark is Relation Net*, which is a metric-based meta-learning method. For the 5-way 1-shot task, it can achieve 72.72%±0.15%, 68.62%±0.76%, and 68.21%±0.65% on the three datasets, respectively, which is 4.43%, 1.93%, and 0.68% higher than Relation Net*. For the 5-way 5-shot task, our result is 3.89%, 2.99%, and 1.25% higher than Relation Net*. The confusion matrix is visualized on the AID and UCMerced LandUse as well. The confusion matrix shows that our self-supervised method can reduce the error outputs from the indistinguishable classes. Conclusion We develop a self-supervised method to resolve the data scarcity-derived problem of low robustness, which consists of a dual-based "student-related" knowledge distillation mechanism and a self-supervised contrastive learning mechanism. Dual-based "student-related" knowledge distillation uses the soft labels of the "teacher-related" network as the supervision information of the "student-related" network, which can improve the robustness of few-shot learning through richer inter-class relationship and intra-class relationship. The self-supervised contrastive learning method can evaluate the similarity of different class center in a representation space, making the model to learn a class center better. The feasibility of self-supervised distillation and contrastive learning is clarified. It is necessary to integrate self-supervised transfer learning tasks with few-shot remote sensing scene image classification further.
Keywords
few-shot learning remote sensing scene classification self-supervised learning distillation learning contrastive learning
|