利用隐式解码器的三维模型簇协同分割
摘 要
目的 为建立3维模型语义部件之间的对应关系并实现模型自动分割,提出一种利用隐式解码器(implicit decoder,IM-decoder)的无监督3维模型簇协同分割网络。方法 首先对3维点云模型进行体素化操作,进而由CNN-encoder (convolutional neural network encoder)提取体素化点云模型的特征,并将模型信息映射至特征空间。然后使用注意力模块聚合3维模型相邻点特征,将聚合特征与3维点坐标作为IM-decoder的输入来增强模型的空间感知能力,并输出采样点相对于模型部件的内外状态。最后使用max pooling聚合解码器生成的隐式场,以得到模型的协同分割结果。结果 实验结果表明,本文算法在ShapeNet Part数据集上的mIoU (mean intersection-over-union)为62.1%,与目前已知的两类无监督3维点云模型分割方法相比,分别提高了22.5%和18.9%,分割性能得到了极大提升。与两种有监督方法相比,分别降低了19.3%和20.2%,但其在部件数较少的模型上可获得更优的分割效果。相比于使用交叉熵函数作为重构损失函数,本文使用均方差函数可获得更高的分割准确率,mIoU提高了26.3%。结论 与当前主流的无监督分割算法相比,本文利用隐式解码器进行3维模型簇协同分割的无监督方法分割准确率更高。
关键词
Co-segmentation of 3D shape clusters based on implicit decoder
Yang Jun1, Zhang Minmin2(1.Faculty of Geomatics, Lanzhou Jiaotong University, Lanzhou 730070, China;2.School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China) Abstract
Objective 3D shape segmentation is an important task, without which many 3D data processing applications cannot accomplish their work. It has also become a hot research topic in areas, such as digital geometric processing and modeling, and plays a crucial role in finalizing tasks such as 3D printing, 3D shape retrieval, and medical organ segmentation. Recent years have witnessed the continuous development of 3D data acquisition equipment such as laser scanners, RGBD cameras, and stereo cameras, which has resulted in 3D point cloud data enjoying wide usage in 3D shape segmentation tasks. Based on the analysis of the shape the 3D point cloud takes, 3D point cloud segmentation methods involving deep learning solutions are divided into three categories by related research scholars:1) volumetric-based methods, 2) view-based methods, and 3) point-based methods. Volumetric-based methods first use voxels in 3D space as the definition domain to perform 3D convolution and then expand the convolutional neural network (CNN) to 3D space for feature learning. Finally, point cloud shape segmentation can be realized by aggregating the acquired features. View-based methods use spatial projection to convert the input 3D shape into multiple 2D image views, inputting the stack of images into a 2D CNN to extract the input point cloud shape features, and then, for a refinement of the segmentation results, the input 3D shape features are further processed through the view pool and the CNN. To accommodate situations in which the points of the input cloud are disorderly and irregularly dispersed, point-based methods set up a specific neural network input layer to input the 3D point cloud directly into the network for training to improve the segmentation performance of the 3D point cloud shape. The network cannot achieve efficient co-segmentation of the shape clusters by employing component reconstruction techniques because typical point cloud data lack topology and surface information, and the labeling large data sets is difficult. Considering human beings' notion of object recognition, which is based on parts, as well as other factors, such as the instability of the segmentation caused by the influence of occlusion and the illumination and projection angle in the view-based methods, voxelization of point cloud data is selected in this paper. Moreover, most of the existing deep learning methods used for 3D shape segmentation adopt a supervisory mechanism, and the implementation of automatic 3D shape segmentation methods is difficult without effective usage of the potential connections between shapes. Thus, an unsupervised 3D shape cluster co-segmentation network, based on the implicit decoder (IM-decoder), is used for the realization of the correspondence between semantically related components and the automatic segmentation of 3D shapes in this paper. Method The unsupervised 3D shape cluster co-segmentation method, based on the implicit decoder, is divided mainly into three important operations:encoding, feature aggregation, and decoding. The first task of the encoding stage is to carry out an accurate extraction of the features from the input 3D shape. The encoder network designed in this paper is based on traditional CNNs, and the encoder can only process regular 3D data. First, voxelization is carried out on all the points that represent the shape in 3D point cloud form. Then, the Hierarchical Surface Prediction method is used to improve the quality of the reconstructed 3D shape. Finally, the features of the voxelized points are extracted through the CNN encoder, and the shape information is mapped to the feature space. The feature aggregation operation further improves the quality of the extracted features by using the attention module, which aggregates the features of adjacent points in the 3D shape. During the decoding stage, the aggregated features and the 3D coordinates of the points are input to the IM-decoder for an enhancement of the spatial perception of the shape, and the internal and external states of the sampling points relative to the shape components are output after this enhancement. The final co-segmentation is accomplished by a max pooling operation, which is realized through aggregating the implicit fields generated by the decoder. Result In this paper, ablation and comparative experiments are conducted on the ShapeNet Part dataset using intersection over union (IoU) and mean intersection over union (mIoU) as evaluation criteria. Experimental results show that the mIoU achieved by our algorithm, when invoked on the ShapeNet Part dataset, reaches 62.1%. Compared with the currently known two types of unsupervised 3D point cloud shape segmentation methods, its mIoU is increased by 22.5% and 18.9%, and the segmentation performance is greatly improved. Compared with the two supervised methods, the mIoU of this algorithm is reduced by 19.3% and 20.2%, but our method could achieve a better segmentation effect on shapes with fewer parts. Moreover, the choice of using the mean square error function as the reconstruction loss function, instead of using the cross-entropy function, results in a higher segmentation accuracy, which is manifested by an improvement of 26.3%, in terms of mIoU. The ablation experiment shows that the attention module designed in this paper could improve the segmentation accuracy of the network by automatically selecting important features from each shape type. Conclusion The experimental results show that the 3D shape cluster co-segmentation method, which is based on the implicit decoder, achieves a high segmentation accuracy. On the one hand, the method uses CNN-encoder to extract the features of the 3D shape and designs the attention module such that important features are automatically selected, which can further improve the quality of the features. On the other hand, the implicit decoder, constructed by our method, performs collaborative analysis on the joint feature vector, which is composed of the selectively chosen features and the 3D coordinates of the points. Moreover, the implicit field resulting from the fine-tuning of the reconstruction loss function could effectively improve the accuracy of the segmentation.
Keywords
|