室内大尺度全景视觉定位数据集
(1.商汤科技;2.浙江大学计算机辅助设计与图形系统全国重点实验室) 摘 要
目的 视觉定位广泛地应用于自动驾驶、移动机器人和增强现实等领域,是计算机视觉领域的关键技术之一。现有的室内视觉定位数据集在重复纹理、对称结构和相似场景等方面不能完全反映出实际应用中的挑战,以及缺少相应指标反映视觉定位在实际应用中的问题。针对这些问题,本文提出一个基于全景相机的大尺度室内视觉定位基准数据集。方法 本文选取了4种在实际应用中具有代表性的视觉定位场景,使用全景相机对选取场景进行分时段稠密采集,获取不同时间段下的室内全景数据。本文设计一种面向大尺度场景的全景建图算法对采集的数据进行高效准确地重建;同时设计一种基于建筑计算机辅助设计(computer-aided design,CAD)图的尺度恢复算法以恢复重建的真实尺度。本文通过激光测量和渲染对比方式对所提大尺度室内视觉定位数据集的精度进行定量和定性分析。此外,本文设计一种新的视觉定位算法评估指标——注册率和误定位率曲线,结合常用评估指标和本文所提指标对当前视觉定位算法进行全面地评估和分析。结果 本文所提出的室内大场景视觉定位数据集总覆盖面积超过2万平米。评估结果显示当前最先进的方法在本文所提的数据集上仍有很大的提升空间。注册率和误定位率曲线反映出当前视觉定位算法无法有效地避免误定位问题,在保持较低误定位率的条件下,当前最先进算法的注册率在多个场景下不到50%。结论 本文所提的室内视觉定位数据集和指标有助于更为全面地评估视觉定位算法,有助于研究人员对比和改进算法,有助于推动视觉定位在实际室内应用场景中的发展。数据集获取链接https://github.com/zju3dv/PanoIndoor。
关键词
An indoor visual localization benchmark dataset for large-scale and complex scenes
Yu Hailin, Liu Jiarun1,2, Ye Zhichao3, Chen Xinyu1,2, Zhan Ruohao1,2, ShenTu Yichun3, Lu Zhongyun3, Zhang Guofeng1,2(1.State Key Laboratory of CAD&2.CG, Zhejiang University;3.SenseTime) Abstract
Objective Visual localization aims to solve the 6 degree-of-freedom (DoF) pose of the query image based on the offline reconstructed map. It plays an important role in various fields, including autonomous driving, mobile robotics, and augmented reality. In the past decade or so, researchers have made significant progress in the accuracy and efficiency of visual localization. However, due to the complexity of real-world scenes, there are still many challenges in actual visual localization tasks, especially illumination changes, seasonal changes, repeating textures, symmetrical structures, and similar scenes. Aiming at the issues of illumination changes and seasonal changes under long-term conditions, a series of datasets have been proposed and provided evaluation benchmarks to facilitate researchers to compare and improve visual localization algorithms. With the rapid development of deep learning, there has been a huge leap forward in the robustness of image features compared with traditional hand-crafted ones. Researchers have proposed many keypoints detection, feature description, and feature matching methods based on deep neural networks. Some features and matching methods have already been used in visual localization, which shows very promising results in long-term visual localization tasks. However, there is still a lack of visual localization datasets specifically aiming at large-scale and complex indoor scenes. Existing indoor visual localization datasets have limited scene sizes or show relatively mild challenges. The area of most scenes in these datasets ranges from a few square meters to several thousand square meters. Many larger and more complex indoor scenes that are often encountered in daily life are not included such as underground parking lots, dining halls, office buildings, etc. These scenes often show more severe repetitive textures and symmetrical structures. Therefore, to promote the research of visual localization in large-scale and complex indoor scenes, we propose a new large-scale indoor visual localization dataset covering multiple scenarios, which include repeating textures, symmetrical structures, and similar scenes. Method In this study, we selected four commonly encountered indoor scenes in everyday life: an underground parking lot, a dining hall, a teaching building, and an office building. We used an Insta360 OneR panoramic camera as the collection device to densely capture these scenes. The size of the collected scenes ranged from 1,500 square meters to 9,000 square meters. To achieve accurate reconstruction of these scenes, our paper proposes a panoramic-based Structure-from-Motion (SfM) system. This system leverages the wide field-of-view offered by panoramas to addresses the challenge of constructing large-scale SfM models for indoor scenes. Unlike existing methods that rely on complex and costly 3D scanning equipment or extensive manual annotation, our proposed method can accurately reconstruct challenging large-scale indoor scenes using only panoramic cameras. To restore the true scale of the reconstruction, we adopt an interactive approach by aligning the dense model of reconstruction to computer-aided design (CAD) drawings. This article quantitatively and qualitatively analyzes the accuracy of the proposed large-scale indoor visual localization dataset through measurement and rendering comparison. Furthermore, to create a suitable database or reference model for evaluating current state-of-the-art visual localization methods, we converted the reconstructed panoramic images into perspective images. Each panorama was divided into 6 perspective images with a resolution of , taken at 60-degree intervals along the yaw angle direction. Each perspective image has a field-of-view of 60 degrees. Result The scale error of all scene reconstructions in the indoor visual localization dataset proposed in this paper is within 1%. Based on the four proposed indoor scenes, we conducted evaluations on multiple state-of-the-art visual localization algorithms, including four advanced image retrieval methods and eight visual localization algorithms. Through rigorous quantitative and qualitative analysis, our study demonstrates that there exists significant room for improvement in current state-of-the-art methods when confronted with large-scale and complex indoor scenarios. For instance, we observed that the state-of-the-art feature matching methods, SuperGlue and LoFTR, underperformed even when compared to the basic nearest neighbor (NN) matching approach on certain datasets proposed in this paper. In addition, both PixLoc, based on end-to-end training, and ACE, based on scene coordinates regression, showed significant performance degradation on the dataset proposed in this paper. Furthermore, this paper designs a new visual localization evaluation metric that can effectively reflect the urgent problems when the current methods are applied in practical applications, providing a new benchmark for current visual localization methods from the perspective of practical applications. The new benchmark strongly suggests the necessity for developing new criteria in these scenarios to ensure more reliable and accurate localization judgments. Conclusion The proposed large-scale and complex indoor visual localization dataset exhibits distinct characteristics compared to existing indoor datasets. On the one hand, it poses greater challenges than existing indoor datasets in terms of repetitive textures, symmetric structures, and similar scenes. On the other hand, the dataset contains a wider range of scenarios, which can provide a more comprehensive evaluation of visual localization algorithms. In addition, the benchmarks provided in this paper can be used by researchers to compare and improve visual localization algorithms, which helps to promote the development of visual localization in practical indoor scenarios. The dataset is available at https://github.com/zju3dv/PanoIndoor.
Keywords
|