基于单目视觉惯性的同步定位与地图构建方法综述
(1.浙江大学计算机辅助设计与图形系统全国重点实验室;2.商汤研究院) 摘 要
单目视觉惯性同步定位与地图构建(visual-inertial simultaneous localization and mapping,VI-SLAM)技术因具有硬件成本低、无需对外部环境进行布置等优点,得到了广泛关注,在过去的十多年里取得了长足的进步,涌现出了诸多优秀的方法和系统。由于实际场景的复杂性,不同方法难免有各自的局限性。虽然已经有一些工作对VI-SLAM方法进行了综述和评测,但大多只针对经典的VI-SLAM方法,已不能充分反映最新的VI-SLAM技术发展现状。本文先对基于单目VI-SLAM方法的基本原理进行阐述,然后对单目VI-SLAM方法进行分类分析。为了综合全面地对比不同方法之间的优劣势,本文特别选了三个公开数据集对代表性的单目VI-SLAM方法从多个维度上进行定量评测,全面系统地分析了各类方法在实际场景尤其是增强现实应用场景中的性能。实验结果表明,基于优化或滤波和优化相结合的方法一般在跟踪精度和鲁棒性上比基于滤波的方法有优势,直接法/半直接法在在全局快门拍摄的情况下精度较高,但容易受卷帘快门和光照变化的影响,尤其是大场景下误差累积较快;结合深度学习可以提高极端情况下的鲁棒性。最后,针对深度学习与V-SLAM / VI-SLAM结合、多传感器融合以及端云协同这三个研究热点,对SLAM的发展趋势进行讨论和展望。
关键词
A review of monocular visual-inertial SLAM
Zhang Guofeng, Huang Gan1,2, Xie Weijian1,2, Chen Danpeng1,2,3, Wang Nan3, Liu Haomin3, Bao Hujun1,2(1.Zhejiang University, State Key Lab of CAD&2.CG;3.SenseTime Research) Abstract
Monocular visual-inertial simultaneous localization and mapping (VI-SLAM) is an important research topic in computer vision and robotics. It aims to estimate the pose (i.e. the position and orientation) of the device in real-time using a monocular camera with an inertial sensor while constructing the map of the environment. With the rapid development of fields such as augmented/virtual reality (AR/VR), robotics, and autonomous driving, monocular VI-SLAM has received widespread attention due to its low hardware cost, no need for external environment setup, etc. Over the past decade or so, it has made significant progress and spawned many excellent methods and systems. However, because of the complexity of real-world scenarios, different methods have their respective limitations. Although some work has reviewed and evaluated VI-SLAM methods, most of them only focus on classic VI-SLAM methods, which cannot fully reflect the latest development status of VI-SLAM technology. Based on optimization type, VI-SLAM can be divided into filtering-based methods and optimization-based methods. Filtering-based methods use filters to fuse observations from visual and inertial sensors, continuously updating the device's state information for localization and mapping. Additionally, depending on whether visual data association (or feature matching) is performed separately, existing methods can be divided into indirect methods (or feature based methods) and direct methods. Besides, with the development and widespread application of deep learning technology, researchers have started to incorporate deep learning methods into VI-SLAM to enhance robustness in extreme conditions or perform dense reconstruction. This paper first elaborates on the basic principles of monocular VI-SLAM methods, then classifies them analytically, including filtering based, optimization based, feature based, direct, and deep learning based methods. However, most of the existing datasets and benchmarks are primarily focused on applications like autonomous driving and drones, mainly evaluating pose accuracy. There are relatively few datasets specifically designed for AR. To compare the advantages and disadvantages of different methods more comprehensively, we select three public datasets to quantitatively evaluate representative monocular VI-SLAM methods from multiple dimensions: the widely used EuRoC dataset, the ZJU-Sensetime dataset suitable for mobile platform AR applications, and the LSFB dataset aimed at large-scale AR scenarios. Additionally, to enhance the variety of data types and evaluation dimensions, we have supplemented the ZJU-Sensetime dataset with a more challenging set of sequences called sequences C. This extended dataset is designed to evaluate the robustness of algorithms under extreme conditions such as pure rotation, planar motion, lighting changes, and dynamic scenes. Specifically, sequences C comprise eight sequences, labeled C0 - C7. In the C0 sequence, the handheld device moves around a room, performing multiple pure rotational motions. The C1 sequence involves the device mounted on a stabilized gimbal and moves freely . In the C2 sequence, the device moves in a planar motion, maintaining a constant height. The C3 sequence includes turning lights on and off during recording. In the C4 sequence, the device overlooks the floor while moving. The C5 sequence captures the exterior wall with significant parallax and minimal co-visibility. The C6 sequence involves viewing a monitor during recording, with slight movement and changing screen content. The C7 sequence involves long-distance recording. On the EuRoC dataset, both filtering-based and optimization-based VI-SLAM methods achieve good accuracy. MSCKF, an early filtering-based system, showed lower accuracy and struggled with some sequences. Some later methods like OpenVINS and RNIN-VIO enhanced accuracy by adding new features and deep learning-based algorithms, respectively. OKVIS, an early optimization-based system, completed all sequences but with lower accuracy. Later methods such as VINS-Mono, RD-VIO, and ORB-SLAM3 made significant optimizations, improving initialization, robustness, and overall accuracy. Direct methods like DM-VIO and SVO-Pro extended from DSO and SVO, respectively, showed significant improvements in accuracy through techniques like delayed marginalization and efficient use of texture information. Adaptive VIO, based on deep learning, achieved high accuracy by continuously updating through online learning, demonstrating adaptability to new scenarios. On the ZJU-Sensetime dataset, the comparison results of different methods are largely similar to those in EuRoC. The main difference is that the accuracy of the direct method DM-VIO significantly decreases when using a rolling shutter camera, whereas the semi-direct method SVO-Pro performs slightly better. Feature-based methods do not show a significant drop in accuracy, but the smaller field of view (FoV) of phone cameras reduces the robustness of ORB-SLAM3, Kimera, and MSCKF. ORB-SLAM3 has high tracking accuracy but a lower completeness, and Kimera and MSCKF show increased tracking errors. HybVIO, RNIN-VIO, and RD-VIO have the highest accuracy, while HybVIO slightly outperforming the two others. The deep learning-based Adaptive VIO shows a significant drop in accuracy and struggles to complete sequences B and C, indicating generalization and robustness issues in complex scenarios. On the LSFB dataset, the comparison results are consistent with those in small-scale datasets. The methods with the highest accuracy in small scenes, such as RNIN-VIO, HybVIO, and RD-VIO, continue to show high accuracy in large scenes. RNIN-VIO, in particular, demonstrates even more significant accuracy advantages in large scenes. In large-scale scenes, many feature points are distant and lack parallax, causing methods that heavily rely on visual constraints to accumulate errors more easily. The neural inertial network-based RNIN-VIO can make better use of IMU observations, reducing dependence on visual data. Additionally, VINS-Mono also shows significant advantages in large scenes. Its sliding window optimization allows for the early inclusion of small-parallax feature points, thereby better controlling error accumulation. In contrast, ORB-SLAM3, which relies on local maps, needs sufficient parallax before adding feature points to the local map, which can lead to insufficient visual constraints in distant environments, causing error accumulation and even tracking lost. The experimental results show that optimization-based or filtering and optimization combined methods generally outperform filtering-based methods in terms of tracking accuracy and robustness. Direct methods / semi-direct methods perform well when shooting with a global shutter camera, but are prone to errors accumulation especially in large scenes when affected by rolling shutter and light changes. Combining deep learning can improve robustness in extreme situations. Finally, the development trend of SLAM is discussed and prospected based on three research hotspots: combining deep learning with V-SLAM / VI-SLAM, multi-sensor fusion and end-cloud collaboration.
Keywords
|