多源特征自适应融合网络的高分遥感影像语义分割
张文凯1,2, 刘文杰1,2,3,4, 孙显1,2, 许光銮1,2, 付琨1,2(1.中国科学院空天信息创新研究院, 北京 100190;2.中国科学院网络信息体系重点实验室, 北京 100190;3.中国科学院大学, 北京 100190;4.中国科学院大学电子电气与通信工程学院, 北京 100190) 摘 要
目的 在高分辨率遥感影像语义分割任务中,仅利用可见光图像很难区分光谱特征相似的区域(如草坪和树、道路和建筑物),高程信息的引入可以显著改善分类结果。然而,可见光图像与高程数据的特征分布差异较大,简单的级联或相加的融合方式不能有效处理两种模态融合时的噪声,使得融合效果不佳。因此如何有效地融合多模态特征成为遥感语义分割的关键问题。针对这一问题,本文提出了一个多源特征自适应融合模型。方法 通过像素的目标类别以及上下文信息动态融合模态特征,减弱融合噪声影响,有效利用多模态数据的互补信息。该模型主要包含3个部分:双编码器负责提取光谱和高程模态的特征;模态自适应融合模块协同处理多模态特征,依据像素的目标类别以及上下文信息动态地利用高程信息强化光谱特征,使得网络可以针对特定的对象类别或者特定的空间位置来选择特定模态网络的特征信息;全局上下文聚合模块,从空间和通道角度进行全局上下文建模以获得更丰富的特征表示。结果 对实验结果进行定性、定量相结合的评价。定性结果中,本文算法获取的分割结果更加精细化。定量结果中,在ISPRS (International Society for Photogrammetry and Remote Sensing) Vaihingen和GID (Gaofen Image Dataset)数据集上对本文模型进行评估,分别达到了90.77%、82.1%的总体精度。与DeepLab V3+、PSPNet (pyramid scene parsing network)等算法相比,本文算法明显更优。结论 实验结果表明,本文提出的多源特征自适应融合网络可以有效地进行模态特征融合,更加高效地建模全局上下文关系,可以广泛应用于遥感领域。
关键词
Multi-source features adaptation fusion network for semantic segmentation in high-resolution remote sensing images
Zhang Wenkai1,2, Liu Wenjie1,2,3,4, Sun Xian1,2, Xu Guangluan1,2, Fu Kun1,2(1.Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;2.The Key Laboratory of Network Information System Technology (NIST), Chinese Academy of Sciences, Beijing 100190, China;3.University of Chinese Academy of Sciences, Beijing 100190, China;4.School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China) Abstract
Objective In the semantic segmentation of high-resolution remote sensing images, it is difficult to distinguish regions with similar spectral features (such as lawn and trees, roads and buildings) only using visible images for their single-angles. Most of the existing neural network-based methods focus on spectral and contextual feature extraction through a single encoder-decoder network, while geometric features are often not fully mined. The introduction of elevation information can improve the classification results significantly. However, the feature distribution of visible image and elevation data is quite different. Multiple modal flow features cascading simply fails to utilize the complementary information of multimodal data in the early, intermediate and latter stages of the network structure. The simple fusion methods by cascading or adding cannot deal with the noise generated by multimodal fusion clearly, which makes the result poor. In addition, high-resolution remote sensing images usually cover a large area, and the target objects have problems of diverse sizes and uneven distribution. Current researches has involved to model long-range relationships to extract contextual features. Method We proposed a multi-source features adaptation fusion network in our researchanalysis. In order to dynamically recalibrate the scene contexted feature maps, we utilize the modal adaptive fusion block to model the correlations explicitly between the two modal feature maps. To release the influence of fusion noise and utilize the complementary information of multi-modal data effectively, modal features are fused by the target categories and context information of pixels in motion. Meanwhile, the global context aggregation module is facilitated to improve the feature demonstration ability of the full convolutional neural network through modeling the remote relationship between pixels. Our model consists of three aspects as mentioned below:1)the double encoder is responsible for extracting the features of spectrum modality and elevation modality; 2)the modality adaptation fusion block is coordinated to the multi-modal features to enhance the spectral features based on the dynamic elevation information; 3) the global context aggregation module is used to model the global context from the perspective of space and channel. Result Our efficiency unimodal segmentation architecture (EUSA) is evaluated on the International Society for Photogrammetry and Remote Sensing(ISPRS) Vaihingen and Gaofen Image Dataset(GID) validation set, and the overall accuracy is 90.64% and 82.1%, respectively. Specifically, EUSA optimizes the overall accuracy value and mean intersection over union value by 1.55% and 3.05% respectively in comparison with the value of baseline via introducing a small amount of parameters and computation on ISPRS Vaihingen test set. This proposed modal adaptive block increases the overall accuracy value and mean intersection over union value of 1.32% and 2.33% each on ISPRS Vaihingen test set. Our MSFAFNet has its priorities in terms of the ISPRS Vaihingen test set evaluation, which achieves 90.77% in overall accuracy. Conclusion Our experimental results show that the efficient single-mode segmentation framework EUSA can model the long-range contextual relationships between pixels. To improve the segmentation results of regions in the shadow or with similar textures, we proposed MSFAFNet to extract more effectivefeatures of elevation information.
Keywords
semantic segmentation remote sensing images multi-modal data modality adaptation fusion global context aggregation
|