面向单目深度估计的多层次感知条件随机场模型
贾迪, 宋慧伦, 赵辰, 徐驰(辽宁工程技术大学) 摘 要
目的 从单张影像中估计景深已成为计算机视觉研究热点之一,现有方法常通过提高网络的复杂度回归深度,增加了数据的训练成本及时间复杂度,为此提出一种面向单目深度估计的多层次感知条件随机场模型。方法 首先,采用自适应混合金字塔特征融合策略,捕获图像中不同位置间的短距离和长距离依赖关系,从而有效聚合全局和局部上下文信息,实现信息的高效传递。其次,引入条件随机场解码机制,以此精细捕捉像素间的空间依赖关系。再次,结合动态缩放注意力机制增强对不同图像区域间依赖关系的感知能力,引入偏置学习单元模块避免网络陷入极端值问题,保证模型的稳定性。最后,针对不同特征模态之间的交互情况,通过层次感知适配器扩展特征映射维度增强空间和通道交互性能,提高模型的特征学习能力。结果 在NYU Depth v2数据集上进行消融实验,实验结果表明,本文给出的网络可以显著提高所有性能指标,相较于以往先进方法,绝对相对误差(Absolute Relative Error, Abs Rel)减小至0.1以内,降低6.8%,均方根误差(Root Mean Square Error, RMSE)降低13.1%。为验证模型在真实道路环境中的实用性,在KITTI数据集上进行对比实验,上述指标均优于目前主流方法,其中RMSE降低53%,阈值(, 准确度接近100%,此外,在MatterPort3D数据集上验证了模型强大的泛化性。从可视化实验结果上看,在复杂环境下本文方法可以更好地估计困难区域的深度。结论 本文采用多层次特征提取器及混合金字塔特征融合策略,优化了信息在编码器和解码器间的传递过程,通过全连接解码获取像素级别的输出,能够有效提高单目深度估计精度。
关键词
A Multi-level Perceptual Conditional Random Field Model for Monocular Depth Estimation
Jiadi, Song Huilun, Zhao Chen, Xu Chi(Liaoning Technical University) Abstract
Objective Predicting scene depth from a single RGB photograph is a complex and challenging issue. Accurate depth estimate is essential in various computer vision applications, including 3D reconstruction, autonomous driving, and robotics navigation. Accurately determining depth information from a two-dimensional image is a difficult task because of the ambiguity and absence of clear depth indicators. Modern approaches to this issue involve creating intricate neural networks that attempt to estimate depth maps in a direct and approximate way. These networks frequently utilize deep learning methods and large quantities of labeled data to understand the complex relationships between RGB pixels and their associated depth values. Although these methods have demonstrated promising outcomes, they frequently encounter issues such as computational inefficiency, overfitting, and poor generalization skills. This research suggests a multi-level perceptual conditional random field model that relies solely on the Swin Transformer. Method Firstly, an adaptive hybrid pyramid feature fusion approach is a fundamental component of the whole architecture. This technique is precisely crafted to encompass the various dependencies present across multiple spatial positions, including both short-distance and long-distance linkages. By smoothly combining feature fusion techniques that include various kernel shapes, it efficiently gathers both overall and specific contextual information, offering a thorough comprehension of the data. This consolidation not only guarantees the smooth transmission of information within the model but also greatly boosts the distinguishing ability of the feature representations. As a result, the model becomes better at recognizing and understanding complex patterns and structures in the data, resulting in enhanced performance and accuracy. Secondly, the decoder includes dynamic scaling attention, a clever approach that greatly enhances the model"s capacity to capture complex dependency relationships among various regions in the input image. The attention mechanism enables the model to selectively concentrate on the most pertinent areas while disregarding irrelevant or noisy data. This optimizes the model"s efficiency and robustness against different sorts of distortions and noise. A distinct update initialization mechanism is used to find and adjust the most appropriate parameters for the task. This method successfully avoids problems related to linear projection limitations and severe network behaviors, guaranteeing a more seamless and steady learning experience. Finally, an hierarchical perception adaptor is presented to handle the intricate interplay among several feature modalities. This adaptor acts as an intermediary between several feature representations, increasing the feature mapping dimension and enabling improved interaction between channels. The model"s feature learning ability is greatly enhanced by encouraging this interaction, allowing it to effectively manage increasingly intricate jobs. It is especially important in situations when various sources of information must be combined and understood efficiently, like in tasks involving picture identification, object detection, or semantic segmentation. Result Meticulous comparative tests and ablation studies were done on the NYU Depth v2 dataset to thoroughly assess the performance of the proposed network. The trial findings demonstrated a notable enhancement in all performance metrics, conclusively confirming the superiority of our approach. Our solution outperformed the previous advanced methods by a significant margin of 6.8% on the Absolute Relative Error (Abs Rel) indicator, achieving a low error rate of 0.088. This enhancement showcases the precision and accuracy of our network in calculating depth from a solitary RGB image. Our technique outperformed others with a Root Mean Squared Error (RMSE) score of 0.0316, representing an 13% performance improvement. This notable enhancement demonstrates our model"s capability to manage intricate scenarios and generate precise depth maps. Furthermore, our approach resulted in a 5% improvement in both the Root Mean Square Logarithmic Error (log RMSE) and Square Relative Error (Sq Rel) measures. This highlights the strength of our model in handling pixels with extreme depth values, guaranteeing accurate depth predictions in many situations. The proposed technique showed substantial enhancements in terms of accuracy. The indicator, which quantifies the proportion of estimated depths that are within a specified range of the actual depth, rose by 10%. The more rigorous criteria of scored remarkably high at 99.8%, nearing the optimal level. The results provide additional proof of the model"s high accuracy and dependability in depth estimation tasks. We did pre-training on the KITTI dataset with 50 rounds to confirm the effectiveness of our model in real-world situations. The KITTI dataset, known for its lifelike urban environments, served as a difficult platform for assessing the generalization abilities of our model. Our methodology showed substantial enhancements in all assessment criteria on the KITTI dataset when compared to existing advanced depth estimation methods. The RMSE demonstrated a performance improvement of 53%, showcasing the superior depth estimation abilities of our model. The threshold indications ( and ) showed about 100% accuracy, demonstrating the strength and flexibility of our model in handling the complexity and variations found in street situations. In addition, the strong generalization of the model was verified on the MatterPort3D dataset, and significant improvements were achieved in all indicators. Conclusion A multi-level feature extractor that greatly improves the Swin Transformer design is presented in the paper. By reducing the semantic gap between the encoder and decoder, this innovative method seeks to enable more accurate and seamless information transfer. A mixed pyramid feature fusion methodology, which is essential to its design, allows for the extraction and integration of many features at various scales, efficiently capturing contextual information at both the local and global levels. This technique ensures that the decoder obtains relevant and in-depth feature representations while simultaneously improving network flow by bridging the semantic gaps between the encoder and decoder. This raises the output"s quality while also increasing the network"s overall efficiency. Moreover, fully connected decoding, a method that greatly improves the precision of monocular depth estimate, is included in the suggested strategy. The model produces more accurate depth maps by utilizing this decoding technique, which is a significant advancement over conventional techniques.
Keywords
Conditional random field Hybrid pyramid feature fusion Dynamic scaling attention Hierarchical awareness adapter
|