基于增强注意力的点云语义实例联合分割

郝雯; 左占彬; 卢翰森; 梁玮; 金海燕; 石争浩

发布时间： 2025-06-10
摘要点击次数： 14
全文下载次数： 1
DOI:
| Volume | Number

基于增强注意力的点云语义实例联合分割

郝雯, 左占彬, 卢翰森, 梁玮, 金海燕, 石争浩(西安理工大学)

摘要

目的针对已有的特征融合策略难以充分挖掘语义-实例特征之间相关性问题，提出了一种基于增强注意力的语义实例联合分割网络。方法该网络首先通过构建基于注意力机制的语义特征提取模块，有效捕获点间的局部上下文信息。然后，利用编码-解码模块获得点云的初始语义特征和初始实例特征，其中编码-解码模块包括基于注意力池化的PointNet++集合抽象层、PointConv的编码层、解码层以及PointNet++的特征传播层。接着，设计一个增强注意力模块，利用双重注意力机制，自适应地学习中心特征与邻近特征的相似性，动态确定注意力权重，并将得到的双重注意力权重进行求和，与初始语义特征相乘，获得增强后的语义特征。同时，将增强注意力模块内嵌到联合分割模块的语义分支中，将语义特征与实例特征有效融合，提高语义-实例联合分割的精度。结果相较对比算法最优值，本文方法在S3DIS数据集中语义分割平均交并比与实例分割平均加权覆盖率指标分别提升了4.2%和1.2%。在ScanNet数据集中，语义分割平均交并比与实例分割中平均加权覆盖率指标分别提升了3.2%和2.8%。结论实验结果表明，本文提出的网络模型能够有效融合提取的语义特征与实例特征，其语义分割与实例分割的准确性明显优于现有的联合分割方法。

关键词

深度学习点云语义分割实例分割增强注意力机制

Enhanced attention-based joint semantic-instance segmentation network for point clouds

HaoWen, Zuo Zhanbin, Lu Hansen, Liang Wei, Jin Haiyan, Shi Zhenghao()

Abstract

Objective With the rapid development of 3D sensing technologies such as LiDAR and depth cameras, large-scale 3D point clouds have become a crucial data source for various applications, including autonomous driving, robotic navigation, augmented reality, and urban scene reconstruction. Compared with 2D images, point clouds provide accurate spatial geometry and capture a complete view of the environment without perspective distortion. They are robust to lighting changes and texture variations. Point cloud segmentation serves as a fundamental component in scene parsing and understanding. The segmentation can be categorized into three types: semantic segmentation, instance segmentation, and joint semantic-instance segmentation. Semantic segmentation divides the 3D scene into informative regions and assigns each region to a specific class. Instance segmentation classifies objects at the point level in a scene and also distinguishes between instances belonging to the same semantic category. In recent years, researchers have attempted to jointly perform both tasks, aiming to produce more consistent and informative scene-level interpretations. Joint semantic-instance segmentation exploits the intrinsic correlation between semantic and instance segmentation, enabling the two tasks to benefit from each other. In the context of 3D point clouds, this joint approach significantly improves a system’s ability to understand complex environments and offers strong technical support for the development of intelligent systems. Consequently, it has become a highly active area of research in recent years. The existing joint semantic-instance segmentation methods always use the simple feature fusion strategy, which makes it difficult to fully exploit the potential relationship between the semantic segmentation and instance segmentation. To address the issue where existing feature fusion strategies struggle to fully exploit the correlation between semantic and instance information, we propose an enhanced attention-based joint semantic-instance segmentation network. Method EAJS-Net(Enhanced Attention-Based Joint Semantic-Instance Segmentation Neural Network) designs a semantic feature extraction module based on an attention mechanism, which focuses on the neighborhood region of each point and dynamically adjusts the attention weights to emphasize key information. It enhances the extraction of semantic features between points. In addition, an enhanced attention mechanism-based semantic/instance feature fusion module is designed to adaptively learn the similarity between central features and adjacent features, reinforcing important characteristics and thoroughly exploring the correlation between instance segmentation and semantic segmentation to improve segmentation accuracy. EAJS-Net integrates PointNet++ and PointConv as its backbone network and comprises three main components: a point feature enhancement module, an encoder-decoder module, and an enhanced attention-based joint segmentation module. The input to EAJS-Net is N×9 dimensional point cloud data, where N represents the number of points, and the 9 dimensions include coordinate values (XYZ), color information (RGB), and normalized coordinates. A semantic feature extraction module is constructed based on the attention-based mechanism to effectively capture local contextual information between points. These enhanced features are then passed to the encoding layer, which consists of four encoding modules: one attention pooling-based set abstraction layer from PointNet++ and three feature encoding layers from PointConv. The corresponding decoding layer includes four decoding modules: three deep feature decoding layers from PointConv and one feature propagation layer from PointNet++. Using the attention pooling-based set abstraction layer of PointNet++, the network extracts spatial geometric relationships among features. The initial semantic and instance features of the point clouds are extracted by employing the encoding and decoding layers. An enhanced attention module is designed to adaptively learn the similarity between central and neighboring features using dual attention mechanisms, dynamically determining attention weights. The dual attention weights are summed and multiplied by the initial semantic features to obtain enhanced semantic features. This enhanced attention module is embedded into the semantic branch of the joint segmentation module, effectively integrating semantic and instance features to improve the accuracy of joint semantic-instance segmentation. The encoded features are upsampled by two parallel decoder branches to generate an instance feature matrix and a semantic feature matrix. These matrices serve as inputs to the joint segmentation module. The semantic and instance branches are integrated using the enhanced attention module. The final output comprises instance embeddings and semantic predictions. Result The proposed network is validated on the Stanford 3D indoor semantics dataset (S3DIS) and ScanNet V2 datasets to verify the performance of the point cloud segmentation. We present the results of 6-fold cross-validation of EAJS-Net in our experiments and compare it with the state-of-the-art (SOTA) methods. For the S3DIS dataset, EAJS-Net achieves a mean intersection over union (mIoU) of 65.9%, overall accuracy (oAcc) of 89.1%, and mean accuracy (mAcc) of 76%, respectively for semantic segmentation. Compared with JSNet++, the three evaluation indicators of mIoU, oAcc, and mAcc have increased by 3.5%, 0.4%, and 3.2%, respectively. EAJS-Net reaches a weighted coverage rate of 61.1% for instance segmentation. Compared with JSNet++, EAJS-Net increased by 4.1%, 4.6% and 1.2% respectively in the three indicators of mean weighted coverage (mWCov), mean coverage (mCov), and mean recall (mRec). For the ScanNet dataset, EAJS-Net improves the mIoU for semantic segmentation by 3.2%. It also improves the weighted coverage rate for instance segmentation by 2.8% compared to JSNet. We also show the visual comparison between EAJS-Net and other SOTA methods. Experimental results show that EAJS-Net can still obtain better segmentation results for complex indoor scenes. In addition, we also conduct ablation experiments to validate the effectiveness of different modules. The joint segmentation module based on the enhanced attention mechanism in EAJS-Net is capable of better capturing key features by dynamically adjusting the weights of various features. This effectively aggregates semantic features with instance features into the semantic feature space, thereby facilitating the semantic segmentation task. Conclusion To address the challenge of existing feature fusion strategies that struggle to fully exploit the inter-instance semantic correlations, this paper proposes a semantic instance joint segmentation network EAJS-Net based on an enhanced attention mechanism. The contextual features among points can be extracted by a novel attention mechanism-based semantic feature extraction module. Moreover, a novel enhanced attention module is proposed to aggregate instance features into the semantic feature space. The effective feature fusion strategy can promote the performance of joint semantic-instance segmentation. Experimental results demonstrate that EAJS-Net effectively integrates the semantic and instance features, significantly enhancing the accuracy of both semantic and instance segmentation compared to SOTA methods.

Keywords

deep learning,point cloud, joint semantic-instance segmentation,enhanced attention-based mechanism

在线采编平台

论文出版

年度会议

下载中心

年度信息