嵌入Transformer结构的多尺度点云补全
摘 要
目的 当前点云补全的深度学习算法多采用自编码器结构,然而编码器端常用的多层感知器(multilayer perceptron,MLP)网络往往只聚焦于点云整体形状,很难对物体的细节特征进行有效提取,使点云残缺结构的补全效果不佳。因此需要一种准确的点云局部特征提取算法,用于点云补全任务。方法 为解决该问题,本文提出了嵌入注意力模块的多尺度点云补全算法。网络整体采用编码器—解码器结构,通过编码器端的特征嵌入层和Transformer层提取并融合3种不同分辨率的残缺点云特征信息,将其输入到全连接网络的解码器中,输出逐级补全的缺失点云。最后在解码器端添加注意力鉴别器,借鉴生成对抗网络(generative adversarial networks,GAN)的思想,优化网络补全性能。结果 采用倒角距离(Chamfer distance,CD)作为评价标准,本文算法在2个数据集上与相关的4种方法进行了实验比较,在ShapeNet数据集上,相比于性能第2的PF-Net (point fractal network)模型,本文算法的类别平均CD值降低了3.73%;在ModelNet10数据集上,相比于PF-Net模型,本文算法的类别平均CD值降低了12.75%。不同算法的可视化补全效果图,验证了本文算法具有更精准的细节结构补全能力和面对类别中特殊样本的强泛化能力。结论 本文所提出的基于Transformer结构的多尺度点云补全算法,更好地提取了残缺点云的局部特征信息,使得点云补全的结果更加准确。
关键词
Multi-scale Transformer based point cloud completion network
Liu Xinpu1, Ma Yanxin2, Xu Ke1, Wan Jianwei1, Guo Yulan1(1.College of Electronic Science and Technology, National University of Defense Technology, Changsha 410005, China;2.College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410005, China) Abstract
Objective Three dimensional vision analysis is a key research aspect in computer vision research. Point cloud representation preserves the initial geometric information in 3D space under no discretization circumstances. Unfortunately, scanned 3D point clouds are incomplete due to occlusion, constrained sensor resolution and small viewing angle. Hence, a shape completion process is required for downstream 3D computer vision applications. Most deep learning based point cloud completion algorithms demonstrate an encoder-decoder structure and align multilayer perception (MLP) to extract point cloud features at the encoder. However, MLP networks tend to focus on the overall shape of the point cloud, and it is difficult to extract the local structural features of the object effectively. In addition, MLP does not generalize well to new objects, and it is difficult to complete the shape of objects with small training samples. So, it is a challenged issue that an efficient and accurate local structural feature extraction algorithm for point cloud completion. Method Multi-scale transformer based point cloud completion network (MSTCN) is illustrated. The entire network adopts an encoder decoder structure, which is composed of a multi-scale feature extractor, a pyramid point generator and a transformer based discriminator. The encoder of MSTCN extracts and aggregates the feature information of three types of incomplete point clouds with different resolutions through the transformer module, inputs them into a fully connected network based decoder, and then obtains the missing point clouds as outputs gradually. The feature embedding layer (FEL) and attention layer are melted into the encoder. The former improves the ability of the encoder to extract local structural features of point cloud via sampling and neighborhood grouping, the latter obtains the correlation information amongst points based on an improved self-attention module. As for decoder, pyramid point generator is mainly composed of a full connection layer and reshape operation. On the whole, a network adopts parallel operation on point clouds with three different resolutions, which are generated by the farthest down sampling approach. Similarly, point cloud completion is divided into three stages to achieve coarse-to-fine processing in the pyramid point generator. Based on generative adversarial network (GAN), MSTCN adds a transformer based discriminator at the back end of the decoder, so that the discriminator and the generator can promote each other in joint training and optimize the completion performance of network. The loss function of MSTCN is mainly composed of two parts:generating loss and adversarial loss. Generating loss is the weighted sum of chamfer-distance(CD) between the generated point cloud and its ground-truth of three scales, and adversarial loss is the cross entropy sum of the generated point cloud and its ground-truth through the transformer-based discriminator. Result The experiment was compared with the latest methods on the ShapeNet and ModelNet10 datasets. On the ShapeNet dataset, this paper used all of the 16 categories for training, the average CD value of category calculated by MSTCN was reduced by 3.73% as compared to the second best model. Specifically, the CD values of cap, car, chair, earphone, lamp, pistol and table are better than those of point fractal network(PF-Net). On the ModelNet10 dataset, the average CD value of each category calculated by MSTCN was decreased by 12.75% as compared to the second best model. Specifically, the CD values of bathtub, chair, desk, dresser, monitor, night-stand, sofa, table and toilet are better than those of PF-Net. According to the visualization results based on six categories of aircraft, hat, chair, headset, motorcycle and table, MSTCN can accurately complete special structures and generalize to special samples in one category. The ablation studies were also taken on the ShapeNet dataset. As a result, the full MSTCN network performs better than three other networks which were MSTCNs with no feature embedding layer, attention layer and discriminator respectively. It illustrates that the feature embedding layer can make the model more capable to extract local structure information of point clouds, the attention layer can make the model selectively refer to the local structure of the input point cloud when completing. The discriminator can promote the completion effect of network. Meanwhile, three groups of point cloud completion sub models for different missing ratios were trained on ShapeNet dataset to verify the completion robustness of MSTCN model for input point clouds with different missing ratios. The category of chair and visualized the effect of completion are opted. As a result, the MSTCN model always maintains a good point cloud completion effect although the number of input point clouds decreases gradually, in which the completion results of 25% and 50% missing ratios have similar CD values. Even the missing ratio reaches 75%, CD value of the chair category remains at a low level of 2.074/2.456. The entire chair shape can be identified and completed only in accordance with the incomplete chair legs. This demonstration verifies that MSTCN has strong completion robustness while dealing with input point clouds with different missing ratios. Conclusion A multi-scale transformer based point cloud completion network (MSTCN) for point cloud completion has been illustrated. MSTCN can better extract local feature information of the residual point cloud, which makes the result of point cloud completion more accurate. The current point cloud completion algorithm has achieved good results in the completion of a single object. Future research can focus on the completion of large-scale scenes because the incomplete point cloud in scenes has a variety of incomplete types, such as view missing, spherical missing and occlusion missing. It is more challenging and practical to complete large scenes. On the other hand, the point clouds of real scanned scenes have no ground truth point cloud for reference.The unsupervised completion algorithms have its priority than supervised completion algorithms.
Keywords
three-dimensional point cloud point cloud completion autoencoder attention mechanism generative adversarial networks(GAN)
|