Current Issue Cover
互补特征交互融合的RGB_D实时显著目标检测

叶欣悦, 朱磊, 王文武, 付云(武汉科技大学信息科学与工程学院, 武汉 430081)

摘 要
目的 通过融合颜色、深度和空间信息,利用RGB_D这两种模态数据的显著目标检测方案通常能比单一模态数据取得更加准确的预测结果。深度学习进一步推动RGB_D显著目标检测领域的发展。然而,现有RGB_D显著目标检测深度网络模型容易忽略模态的特异性,通常仅通过简单的元素相加、相乘或特征串联来融合多模态特征,如何实现RGB图像和深度图像之间的信息交互则缺乏合理性解释。为了探求两种模态数据中的互补信息重要性及更有效的交互方式,在分析了传统卷积网络中修正线性单元(rectified linear unit,ReLU)选通特性的基础上,设计了一种新的RGB和深度特征互补信息交互机制,并首次应用于RGB_D显著目标检测中。方法 首先,根据该机制提出了互补信息交互模块将模态各自的“冗余”特征用于辅助对方。然后,将其阶段式插入两个轻量级主干网络分别用于提取RGB和深度特征并实施两者的交互。该模块核心功能基于修改的ReLU,具有结构简单的特点。在网络的顶层还设计了跨模态特征融合模块用于提取融合后特征的全局语义信息。该特征被馈送至主干网络每个尺度,并通过邻域尺度特征增强模块与多个尺度特征进行聚合。最后,采用了深度恢复监督、边缘监督和深度监督3种监督策略以有效监督提出模型的优化过程。结果 在4个广泛使用的公开数据集NJU2K(Nanjing University2K)、NLPR(national laboratory of pattern recognition)、STERE(stereo dataset)和SIP(salient person)上的定量和定性的实验结果表明,以Max F-measure、MAE(mean absolute error)以及Max E-measure共3种主流测度评估,本文提出的显著目标检测模型相比较其他方法取得了更优秀的性能和显著的推理速度优势(373.8帧/s)。结论 本文论证了在RGB_D显著目标检测中两种模态数据具有信息互补特点,提出的模型具有较好的性能和高效率推理能力,有较好的实际应用价值。
关键词
RGB_D salient object detection algorithm based on complementary information interaction

Ye Xinyue, Zhu Lei, Wang Wenwu, Fu Yun(School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430081, China)

Abstract
Objective By fusing color,depth,and spatial information,using RGB_D data in salient object detection typically achieves more accurate predictions compared with using a single modality. Additionally,the rise of deep learning technology has further propelled the development of RGB_D salient object detection. However,existing RGB_D deep network models for salient object detection often overlook the specificity of different modalities. They typically rely on simple fusion methods,such as element-wise addition,multiplication,or feature concatenation,to combine multimodal features. However,the existing models of significant object detection in RGB_D deep networks often ignore the specificity of different modes. They often rely on simple fusion methods,such as element addition,multiplication,or feature joining,to combine multimodal features. These simple fusion techniques lack a reasonable explanation for the interaction between RGB and depth images. These methods do not effectively take advantage of the complementary information between RGB and depth modes nor do they take advantage of the potential correlations between them. Therefore,more efficient methods must be proposed to facilitate the information interaction between RGB images and depth images so as to obtain more accurate significant object detection results. To solve this problem,the researchers simulated the relationship between RGB and depth by analyzing traditional neural networks and linear correction units(ReLU)(e. g. ,structures,such as constructed recurrent neural networks or convolutional neural networks). Finally,a new interactive mechanism of complementary information between RGB and depth features is designed and applied to RGB_D salient target detection for the first time. This method analyzes the correlations between RGB and depth features and uses these correlations to guide the fusion and interaction process. To explore the importance of complementary information in both modalities and more effective ways of interaction,we propose a new RGB and depth feature complementary information interaction mechanism based on analyzing the selectivity of ReLU in traditional convolutional networks. This mechanism is applied for the first time in RGB_D salient object detection. Method First,on the basis of this mechanism,a complementary information interaction module is proposed to use the“redundancy”characteristics of each mode to assist each other. Then,it is inserted into two lightweight backbone networks in phases to extract RGB and depth features and implement the interaction between them. The core function of the module is based on the modified ReLU,which has a simple structure. At the top layer of the network,a cross-modal feature fusion module is designed to extract the global semantic information of the fused features. These features are passed to each scale of the backbone network and aggregated with multiscale features via a neighborhood scale feature enhancement module. In this manner,not only local and scale sensing features can be captured but also global semantic information can be obtained,thus improving the accuracy and robustness of salient target detection. At the same time, three monitoring strategies are adopted to supervise the optimization of the model effectively. First,the accuracy of depth information is constrained by depth recovery supervision to ensure the reliability of depth features. Second,edge supervision is used to guide the model to capture the boundary information of important targets and improve the positioning accuracy. Finally,deep supervision is used to improve the performance of the model further by monitoring the consistency between the fused features and the real significance graph. Result By conducting quantitative and qualitative experiments on widely used public datasets(Nanjing University 2K(NJU2K),national laboratory of pattern recognition(NLPR),stereo dataset (STERE),and salient person(SIP)),the salient object detection model in this study shows remarkable advantages on three main evaluation measures:Max F-measure,mean absolute error(MAE),and Max E-measure. The model performed relatively well,especially on the SIP dataset,where it achieved the best results. In addition,the processing speed of the model remarkably improved to 373. 8 frame/s,while the parameter decreased to 10. 8 M. Compared with the other six methods,the proposed complementary information aggregation module remarkably improved in the effect of salient target detection. By using the complementary information of RGB and depth features and through the design of cross-modal feature fusion module,the model can better capture the global semantic information of important targets and improve the accuracy and robustness of detection. Conclusion The proposed salient object detection model in this study is based on the design of complementary information interaction module,lightweight backbone network,and cross-modal feature fusion module. The method maximizes the complementary information of RGB and depth features and achieves remarkable performance improvement through optimized network structure and monitoring strategy. Compared with other methods,this model shows better results in terms of accuracy,robustness,and computational efficiency. In RGB_D data,this work is of crucial to deepening the understanding of the importance of multimodal data fusion and promoting the research and application in the field of salient target detection.
Keywords

订阅号|日报