多尺度代价体信息共享的多视角立体重建网络
摘 要
目的 多视角立体重建方法是3维视觉技术中的重要部分。相较于传统方法,基于深度学习的方法大幅减少重建所需时间,同时在重建完整性上也有所提升。然而,现有方法的特征提取效果一般和代价体之间的关联性较差,使得重建结果仍有可以提升的空间。针对以上问题,本文提出了一种双U-Net特征提取的多尺度代价体信息共享的多视角立体重建网络模型。方法 为了获得输入图像更加完整和准确的特征信息,设计了一个双U-Net特征提取模块,同时按照3个不同尺度构成由粗到细的级联结构输出特征;在代价体正则化阶段,设计了一个多尺度代价体信息共享的预处理模块,对小尺度代价体内的信息进行分离并传给下层代价体进行融合,由粗到细地进行深度图估计,使重建精度和完整度有大幅提升。结果 实验在DTU (Technical University of Denmark)数据集上与CasMVSNet相比,在准确度误差、完整度误差和整体性误差3个主要指标上分别提升约16.2%,6.5%和11.5%,相较于其他基于深度学习的方法更是有大幅度提升,并且在其他几个次要指标上也均有不同程度的提升。结论 提出的双U-Net提取多尺度代价体信息共享的多视角立体重建网络在特征提取和代价体正则化阶段均取得了效果,在重建精度上相比于原模型和其他方法都有一定的提升,验证了该方法的真实有效。
关键词
Multi-scale cost volumes information sharing based multi-view stereo reconstructed model
Liu Wanjun, Wang Junkai, Qu Haicheng(School of Software, Liaoning Technical University, Huludao 125105, China) Abstract
Objective Multi-view stereo (MVS) network is modeled to resilient a 3D model of a scene in the context of a set of images of a scene derived from photographic parameters-relevant multiple visual angles. This method can reconstruct small and large scales indoor and outdoor scenes both. The emerging virtual-reality-oriented 3D reconstruction technology has been developing nowadays. Traditional MVS methods mainly use manual designed similarity metrics and regularization methods to calculate the dense correspondence of scenes, which can be broadly classified into four categorized algorithms based on point cloud, voxel, variable polygon mesh, and depth map. These methods can achieve good results in ideal Lambert scenes without weakly textured areas, but it often fails to yield satisfactory reconstruction results in cases of texture scarcity, texture repetition, or lighting changes. Recent computer-vision-oriented deep learning techniques have promoted the newly reconstruction structure. The learning-based approach can learn the global semantic information. For example, there are based on the highlights and reflections of the prior for getting the more robust matching effect, so it was successively applied on the basis of the above traditional methods of deep learning. In general, MVS inherits the stereo geometry mechanism of stereo matching and improves the effect of the occlusion problem effectively, and it achieves greater improvement in accuracy and generalization as well. However, the existing methods have normal effects in feature extraction and poor correlation between cost volumes. We facilitate the multi-view stereo network with dual U-Net feature extraction sharing multi-scale cost volumes information. Method Our improvements are mainly focused on the feature extraction and cost volume regularization pre-processing. First, a dual U-Net module is designed for feature extraction. For all input images with a resolution of 512×640 pixels, after convolution and ReLU, the original image of 3 channels is conveyed to 8 channels and 32 channels, and the feature maps of 1,1/4 and 1/16 of the original image size are generated by dual pooling-maximized and convolution-sustained, respectively. In the up-sampling stage, the multi-scale features information is sewed and channel dimension is fused for thicker features. The merged convolution and upsampling are continued to obtain a 32-channel feature map with the same resolution as the original image, and it is used as the input and passed through the U-Net network once more to finally obtain three sets of feature maps with different sizes. Such a dual U-Net feature extraction module can keep more detailed features through, downsampling (reduce the spatial dimension of the pooling layer), upsampling (repair the detail and spatial dimension of the object), and side-joining (repair the detail of the target). This can make the following depth estimation results more accurate and complete. Nextly, since the initially constructed cost volumes have no connection between different scales and rely on the upsampling of the feature extraction module only to maintain the connection, resulting in the information of the cost volumes in each layer cannot be transferred, we design a multi-scale cost volume information sharing module in the pre-regularization stage, which separates the cost volumes generated in each layer and fuse them into the next layer. To improve the estimation quality of the depth map, the small-scale cost volume information is fused into the cost volume of the next layer. Result The Technical University of Denmark(DTU) dataset used in this experiment is an indoor dataset specially shot and processed for MVS, which can directly obtain the internal and external photographic parameters of each view angle. It consists of 128 multiple objects or scenes based on 79 training scenes, 18 validation scenes and 22 test scenes. The training is equipped on Ubuntu 20.04 with an Intel Core i9-10920X CPU and an NVIDIA 3090 graphics card. There are three main evaluation metrics, which are the average distance from the reconstructed point cloud to the real point cloud, which can be called accuracy (Acc), the average distance from the real point cloud to the reconstructed point cloud, which can be called completeness (Comp) and the average value of accuracy and completeness, which can be called overall (Overall). Some secondary metrics are involved in, which are absolute value of depth error, absolute error and accuracy of 2 mm, absolute error and accuracy of 4 mm, etc. The experimental results show that our three main metrics like Acc, Comp and Overall are improved by about 16.2%, 6.5% and 11.5% compared to the original method. Conclusion Our reconstructed network model is developed via the multi-view stereo network with dual U-Net feature extraction sharing multi-scale cost volumes information method. It has a significant enhancement effect on both feature extraction and cost volume regularization, and the reconstructed accuracy has its feasible potentials.
Keywords
3D reconstruction deep learning multi-view stereo network double U-Net network feature extraction cost volume information sharing
|