Current Issue Cover
一种结合监督注意力的无锚定向目标检测方法

余凌霄, 郝洁, 左量(南京航空航天大学)

摘 要
目的 :任意方向的多尺度目标和复杂的背景信息,使得在遥感图像上的目标检测相比一般目标检测任务具有更大的挑战。尽管一些现有的检测模型取得了令人满意的成果,但它们主要基于锚框实现,其检测性能严重依赖于预定义锚框的设计。本文在Faster RCNN的基础上,通过结合无锚检测思想与监督掩码注意力技术,提出了一种新的两阶段无锚检测模型。方法: 首先,本文对于骨干网络提取的特征金字塔构建了一个监督掩码注意力模块,通过注意力机制和掩码监督方法引导检测网络关注目标区域,减少背景噪声的干扰,从而提高目标特征的质量。其次,本文通过结合FCOS的回归思想以及中心点偏移技术设计了一个基于关键点的无锚定向区域提议网络,并且在训练阶段采用动态调整的软标签策略,实现标签的合理分配,从而提高模型的检测精度。结果: 本文在两个公开的遥感数据集DOTA(Dataset for Object Detection in Aerial Images)、HRSC(High-Resolution Ship Collections)上进行了大量实验,其平均精确率(mean average precision,mAP)分别达到了76.36%,90.51%,超过了大多的定向检测模型,证明了本文方法的先进性和有效性。结论 :本文提出的检测模型通过将区域提取网络的无锚框设计、监督方法以及注意力机制进行结合,能够有效适应复杂的遥感图像中的定向目标检测场景。
关键词
An anchor-free oriented object detection method that combines supervised attention

YuLingxiao, HaoJie, zuoliang(Nanjing University of Aeronautics and Astronautics)

Abstract
Objective: With the rise of convolutional networks in recent years, deep learning-based object detection algorithms for remote sensing images have achieved significant effectiveness. Compared with the object detection in natural scenes, the object detection in remote sensing images face more challenges, including arbitrary orientation, dense arrangement and multi-scale distribution. Traditional bounding box annotations use horizontal bounding boxes aligned with the coordinate axis to represent the object locations. However, using the horizontal boxes as anchor boxes or proposals for detecting oriented objects leads to significant drawbacks. Therefore, the object detection in remote sensing images usually employs oriented bounding boxes to accurately describe the object. However, existing oriented object detection algorithms mostly identify potential regions through predefined anchor boxes. While the predefined anchor boxes help oriented object detection algorithms to some extent in recognizing objects of different scales and shapes, they also have significant drawbacks. Firstly, since the anchor boxes are predefined, when the object dimensions deviate from these predefined ones, it will lead to degradation of the detection accuracy, or even missed rate or false detection rate. In addition, the number, scale, and aspect ratio of anchor boxes need to be determined based on experience or parameter tuning. This experience-dependent design may affect the model"s ability to generalize to different scenarios or datasets. Secondly, in order to achieve a high recall rate, anchor-based object detection models usually define a large number of anchor boxes to cover the object of various sizes and aspect ratios. However, this significantly increases the computational complexity and further exacerbates the positive-negative sample imbalance problem. Additionally, remote sensing images usually show highly complex scenes, such as dense urban buildings, resulting in a large amount of disturbing information in the images. This makes it difficult for traditional feature extraction networks based on classic backbone networks and feature pyramid networks to accurately extract and highlight the significant features of the objects. To address these challenges, this paper proposes a novel high-precision oriented object detection model in remote sensing images by combining supervised mask attention module (SMAM) with anchor-free oriented region proposal networks (AFORPN). Method: In this paper, we propose a two-stage detector based on Faster RCNN for oriented object detection in remote sensing images. Our model consists of four components: the feature extraction backbone network; the SMAM; the AFORPN and the Region-based Convolutional Neural Network (RCNN). The main contributions are as follows: In order to pay more attention to the object region, suppress the interference of background noise, and achieve fine feature extraction, we construct a SMAM for the extracted feature pyramid. The SMAM consists of three parts, namely, multi-scale feature fusion, spatial attention and supervised mask enhancement module. First, for the multi-scale fusion of feature pyramid, we adopt sub-pixel convolution technique to achieve the unity of spatial scale by upsampling the feature maps through convolutional learning and channel rearrangement, which retains more object information compared with adjacent up-sampling; then, the channel attention mechanism is used to learn the weight coefficients of each channel to adaptively fuse the feature maps of different levels, so as to improve the representation ability of the model for input features. Subsequently, for the processing of fused feature maps, we employ a self-attention mechanism in the design of the spatial attention module, allowing each pixel in the feature map to consider the information from all other pixels, establishing a global wise dependency. This contributes to a better understanding of the image background and semantic correlations within the model, thereby enhancing its ability to comprehend the surrounding environment and suppress the blending effects induced by fusion. Finally, in the design of the supervised mask enhancement module, we guide the model to learn the semantic features and object contour information in the form of mask loss feedback, which significantly improves the accuracy of classification and localization. In order to avoid complex and redundant anchor box designs, we propose an AFORPN. The AFORPN consists of a localization branch and a classification branch. In the design of the localization branch, we adopt the keypoint regression method based on FCOS and introduce the gliding vertex technique to achieve the generation of oriented bounding boxes. The prediction of midpoint offsets effectively alleviates the sensitivity issue to angle variations, resulting in that the bounding boxes better conform to the object shape and achieve improved keypoint regression performance. In the training stage, based on the spatial alignment between the samples and the objects as well as the regression performance of the samples, we propose a new label assignment criterion and gradually improve the weight assignment of the regression performance through dynamic adjustment, which realizes an accurate label assignment and effectively mitigates the inconsistency between classification and regression that may be caused by the static heuristic assignment rule. Result :To demonstrate the effectiveness of our method, we performed ablation analysis on the DOTA dataset, and the experimental results showed that our proposed SMAM and AFORPN improved the mean accuracy precision (mAP) by 1.52% and 2.65%, respectively, compared to the baseline. To demonstrate the state-of-the-art of our approach, we evaluated it on DOTA and HRSC2016 and compared it with the other oriented detection algorithms. Without special processing, we achieved a mAP of 75.36% on the DOTA dataset, that is higher than most oriented object detection models. On the HRSC dataset, our model also achieved a mAP up to 90.51%. Conclusion: Extensive experimental results demonstrate that the proposed SMAM improves the quality of feature maps, while the proposed AFORPN generates high-quality region proposals, thereby further enhancing the detection performance of oriented objects. In conclusion, the proposed oriented detection model, which combines SMAM with AFORPN, exhibits promising detection capabilities and can effectively adapt to complex oriented object detection scenarios in remote sensing images.
Keywords

订阅号|日报