GDC2024 一致性约束引导的零样本三维模型分类网络

晏浩; 白静; 郑虎

发布时间： 2024-11-01
摘要点击次数： 32
全文下载次数： 18
DOI:
| Volume | Number

GDC2024 一致性约束引导的零样本三维模型分类网络

晏浩, 白静, 郑虎(北方民族大学)

摘要

目的零样本三维模型分类任务自提出起，始终面临大规模数据集与高质量语义信息的短缺问题。为应对这些问题，现有方法考虑引入二维图像领域中蕴含丰富的数据集和语义信息的大规模预训练模型，这些方法基于语言-图像对比学习预训练网络，取得了一定的零样本分类效果。但是，现有方法存在三维信息捕捉不全的问题，无法充分利用来自三维领域的知识，针对这一问题，本文提出一致性约束引导的零样本三维模型分类网络。方法一方面，在保留来自预训练网络中的全部二维知识的同时，通过视图一致性学习三维数据的特征，从视图层面将三维信息增补至二维视图特征中；另一方面，通过掩码一致性约束引导网络通过自监督增强网络对三维模型的整体性学习，提高网络泛化性能；同时，提出同类一致性约束引导的非互斥损失，确保网络在小规模数据集训练中学习方向的正确性与能力的泛化性。结果在ZS3D(Zero-shot for 3D dataset)、ModelNet10和Shrec2015(Shape retrieval 2015)这三个数据集上进行零样本分类，分别取得70.1%、57.8%和12.2%的分类精度，与当前最优方法相比分别取得9.2%、22.8%和2.3%的性能提升；在ScanObjectNN的三个子集OBJ_ONLY(Object only)，OBJ_BG(Object and background)及PB_T50_RS(Object augmented rot scale)上，本文方法也取得了具有竞争力的分类准确率，分别是32.4%，28.9%和19.3%。结论相较于完全依赖预训练模型能力的方法，本文方法在充分利用语言-图像预训练网络的基础上，将三维模型领域的知识引入网络，并提升网络泛化能力，使零样本分类结果更加准确。

关键词

三维模型分类零样本学习自监督学习图像文本预训练视觉语言多模态

GDC2024 Consistency Constraint Guided Network of Zero-shot 3D Classification

yanhao, BaiJing, zhenghu(North Minzu University)

Abstract

Objective: Deep learning has achieved impressive results in 3D model classification. However, most existing classification methods rely on supervised learning, limiting their recognition abilities to the model with categories seen during training. With advances in computer-aided design and LiDAR sensor technologies, an increasing number of novel 3D model classes are now being encountered. This presents a new challenge: how to effectively identify model classes that have not been encountered during training. Zero-shot learning has been proposed to address this challenge, but it faces a significant limitation: the shortage of large-scale datasets with high-quality semantic information. To overcome this issue, existing methods introduce large-scale pre-trained models with rich datasets and semantic information from the domain of 2D images, such as the Contrastive Language-Image Pre-Training (CLIP) network. While these methods project 3D models to 2D space to meet the input requirement of CLIP visual encoder and achieve fine results, they do not fully capture the 3D information from datasets and fail to leverage knowledge from the 3D domain. A straightforward approach to overcoming this limitation is to adopt the learning strategy of multi-view convolutional neural networks (MVCNN). This involves fine-tuning the CLIP visual encoder and optimizing its network parameters with a 3D model dataset. The goal is to combine the advantages of 2D data annotation with the inherent characteristics of 3D models. However, this strategy is not effective for CLIP. The fine-tuned network tends to overfit on the training set, and a large amount of 2D knowledge is gradually forgotten during the tuning process. Therefore, this strategy is not feasible. To address these problems, this paper proposes a Consistency Constraint Guided Network (CCG-Net) for zero-shot 3D model classification. Method: CCG-Net aims to leverage the strengths of both 2D and 3D domains while mitigating the issues of overfitting and knowledge forgetting. CCG-Net consists of a fixed part and a dynamic part. The fixed part employs a frozen CLIP model to learn cross-modal information from large-scale 2D vision and semantic data. Stopping the backpropagation forces the network focusing on preserving 2D information. The dynamic part is a learnable encoder for 3D model global feature extraction and emphasizes the acquisition of 3D knowledge. View consistency constraint is applied in dynamic part to guide the extraction of 3D features. This design ensures that the 2D knowledge from the pre-trained model is fully preserved while also enabling the acquisition of new information learning from the 3D data. The information from two modalities is then effectively fused into 3D model features, which are subsequently used for classification. To enhance the extraction of feature for 3D data and improve the robustness of the 3D encoding process, mask consistency constraints are proposed. This constraint guides the network in enhancing its ability to learn the 3D model through self-supervised learning. The specific approach involves employing different masking methods to obtain a diverse set of mask features. Once these features are generated, the next step is to constrain the consistency between them. By ensuring that these mask features remain consistent, the network can better learn and integrate the essential characteristics from the masked data, finally enhancing the robustness and accuracy of the model. Additionally, the pre-trained network employs a mutual exclusion loss, which assumes a mutual exclusion relationship between the labels to be classified. However, this is unsuitable for the zero-shot task of tuning on a small-scale dataset. To address this issue, a non-mutual exclusion loss, guided by the homogeneity consistency constraints, is proposed. This ensures the correctness of the learning direction and the network's ability to generalize its learning when trained on a small-scale dataset. Result: Three different consistency constraint schemes work together within the network to optimize its parameters while avoiding overfitting during fine-tuning on 3D data. This approach enhances the reliability and generalization of feature extraction, ultimately leading to improve zero-shot classification performance. Quantitatively, on the ZS3D dataset, our method achieves 70.1% classification accuracy, representing a significant 9.2% improvement over the current best results of DFG-ZS3D(Discriminative Feature-Guided Zero-Shot Learning of 3D Model Classification). Additionally, it demonstrates improvements on the dataset proposed by Cheraghian, achieving classification accuracy of 57.8%, 19.9%, and 12.2% on the ModelNet10, McGill, and Shrec 2015 subsets, respectively. These results correspond to improvements of 22.8%, 3.3%, and 2.3% over the state-of-the-art methods. The ScanObjectNN dataset, composed of 3D models obtained from real-world scans rather than synthetic data, further evaluates the effectiveness of CCG-Net. On this dataset, CCG-Net attains the highest performance across its three subsets, with classification accuracies of 32.4%, 28.9%, and 19.3% on the OBJ_ONLY(Object only)，OBJ_BG(Object and background) and PB_T50_RS(Object augmented rot scale) subsets, respectively. The performance improvement on real world datasets further validates the generalization capability of the proposed method. Additionally, ablation experiments confirm the effectiveness of the three consistency constraints. Finally, through a qualitative analysis of the confusion matrix, we demonstrate that the network can avoid overfitting to a certain extent. However, this analysis also reveals shortcomings in the network's ability to extract discriminative features, providing a perspective for future research. Conclusion: Compared to methods that rely solely on pre-trained models, the proposed approach in this paper leverages the strengths of language-image pre-trained network while incorporating knowledge from the 3D modeling domain through view consistency constraint. By designing self-supervised enhancement under mask consistency constraint and refining the homogeneity consistency constraint loss function, this method improves the network's robustness and generalization ability. As a result, it achieves accurate improvement for zero-shot 3D model classification.

Keywords

3D model classification, zero-shot learning, self-supervised learning, image-text pre-training, visual-language multimodality

在线采编平台

论文出版

年度会议

下载中心

年度信息