红外与可见光图像特征动态选择的目标检测网络
摘 要
目的 基于可见光和红外双模态图像融合的目标检测算法是解决复杂场景下目标检测任务的有效手段。然而现有双光检测算法中的特征融合过程存在两大问题:一是特征融合方式较为简单,逐特征元素相加或者并联操作导致特征融合效果不佳;二是算法结构中仅有特征融合过程,而缺少特征选择过程,导致有用特征无法得到高效利用。为解决上述问题,提出了一种基于动态特征选择的可见光红外图像融合目标检测算法。方法 本文算法包含特征的动态融合层和动态选择层两个创新模块:动态融合层嵌入在骨干网络中,利用Transformer结构,多次对多源的图像特征图进行特征融合,以丰富特征表达;动态选择层嵌入在颈部网络中,利用3种注意力机制对多尺度特征图进行特征增强,以筛选有用特征。结果 本文算法在FLIR、LLVIP(visible-infrared paired dataset for low-light vision)和VEDAI(vehicle detection in aerial imagery) 3个公开数据集上开展实验验证,与多种特征融合方式进行平均精度均值(mean average precision,mAP)性能比较,mAP50指标相比于基线模型分别提升了1.3%、0.6%和3.9%;mAP75指标相比于基线模型分别提升了4.6%、2.6%和7.5%;mAP指标相比于基线模型分别提升了3.2%、2.1%和3.1%。同时设计了相关结构的消融实验,验证了所提算法的有效性。结论 提出的基于动态特征选择的可见光红外图像融合目标检测算法,可以有效地融合可见光和红外两种图像模态的特征信息,提升了目标检测的性能。
关键词
Infrared-visible image object detection algorithm using feature dynamic selection
Xu Ke1, Liu Xinpu1, Wang Hanyun2, Wan Jianwei1, Guo Yulan1(1.College of Electronic Science and Technology, National University of Defense Technology, Changsha 410005, China;2.School of Surveying and Mapping, Information Engineering University, Zhengzhou 450001, China) Abstract
Objective In recent years, considerable attention has been given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images. This algorithm serves as an effective approach for addressing object detection tasks in complex scenes. The process of object detection algorithms can be roughly divided into three stages. The first stage is feature extraction, which aims to extract geometric features from the input data. Next, the extracted features are fed into the neck network for multi-scale feature fusion. Finally, the fused features are input into the detection network to output object detection results. Similarly, dual-modal detection algorithms follow the same process to achieve object localization and classification. The difference lies in the fact that traditional object detection focuses on single-modal visible images, while dual-modal detection focuses on visible and infrared image data. The dual-modal detection algorithm aims to simultaneously utilize information from infrared and visible images. It merges these images to obtain more comprehensive and accurate target information, which enhances the accuracy and robustness of the object detection process. Traditional fusion methods encompass pixel-level fusion and feature-level fusion. Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images, which enhances the contrast and edge information of the targets. Meanwhile, feature-level fusion extracts features from the infrared and visible images and combines them to enhance the representation capability of the targets. However, the feature fusion process of existing dual-modal detection algorithms faces two major issues. First, the feature fusion methods employed are relatively simple, which involves the addition or parallel operation of individual feature elements. Consequently, these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection. Second, the algorithm structure solely focuses on the feature fusion process, which neglects the crucial feature selection process. This deficiency results in the inefficient utilization of valuable features. Method In this study, we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selection to address the two issues mentioned above. Overall, we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone, neck, and detection head components. We select CSPDarkNet53 as the backbone, which possesses an identical structure for visible and infrared image branches. The algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. The proposed algorithm includes embedding the dynamic fusion layer in the backbone network, which utilizes the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression. Moreover, it employs the dynamic selection layer in the neck network, which uses three attention mechanisms (i.e., scale, space, and channel) to improve multi-scale feature maps and screen useful features. These mechanisms are implemented with SENet and deformable convolutions. In line with standard practices in target detection algorithms, we utilize the detection head of YOLOv5 to generate detection results. The loss function employed for algorithm training is the combined sum of bounding box regression loss, classification loss, and confidence loss, which are implemented with generalized intersection over union, cross entropy, and squared-error functions, respectively. Result In this study, we validate our proposed algorithm through experimental evaluation on three publicly available datasets: FLIR, visible-infrared paired dataset for low-light vision (LLVIP), and vehicle detection in aerial imagery (VEDAI). We use the mean average precision (mAP) for evaluation. Compared with the baseline model that adds features individually, our algorithm achieves improvements of 1.3%, 0.6%, and 3.9% in mAP50 scores and 4.6%, 2.6%, and 7.5% in mAP75 scores. In addition, our algorithm demonstrates enhancements of 3.2%, 2.1%, and 3.1% in mAP scores on the respective datasets, which effectively reduces the probability of object omission and false alarms. Moreover, we conduct ablation experiments on two innovative modules: the dynamic fusion layer and the dynamic selection layer. The complete algorithm model, which incorporates the two layers, achieves the best performance on all three test datasets. This performance validates the effectiveness of our proposed algorithm. We also compare the network model size and computational efficiency of these state-of-the-art algorithms, and experiments show that our algorithm can significantly improve algorithm performance while slightly increasing parameter computation. Furthermore, we visualize the attention weight matrices of the three dynamic fusion layers in the backbone to better reveal the mechanism of the dynamic fusion layer. The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from visible and infrared images. Conclusion In this study, we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy. This algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. Through extensive experiments, we demonstrate that our algorithm effectively integrates feature information from visible and infrared image modalities, which enhances the performance of object detection. However, the proposed algorithm has a little increasing computational complexity and requires pre-registration of the input visible and infrared images, which limits some application scenarios of the algorithm. The research on lightweight fusion modules and algorithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.
Keywords
|