多尺度动态视觉网络的手术机器人场景分割
刘敏, 秦敦璇, 韩雨斌, 陈祥, 王耀南(湖南大学电气与信息工程学院) 摘 要
目的 机器人辅助腹腔镜手术指的是临床医生借助腔镜手术机器人完成外科手术。然而,腔镜手术在密闭的人体腔道完成,且分割目标的特征复杂多变,对医生的手术技能有较高要求。为了辅助医生完成腔镜手术,本文提出了一种高精度的腔镜手术场景分割方法,并搭建分体式腔镜手术机器人对所提出的方法进行了验证。方法 首先,本文提出了多尺度动态视觉网络(multi-scale dynamic visual network, MDVNet)。该网络采用编码器-解码器结构。在编码器部分,动态大核卷积注意力模块(dynamic large kernel attention module, DLKA)可以通过多尺度大核注意力提取不同分割目标的多尺度特征,并通过动态选择机制进行自适应的特征融合。在解码器部分,低秩矩阵分解模块(low-rank matrix decomposition module, LMD)引导不同分辨率的特征图进行融合,可以有效滤除特征图中的噪声。边界引导模块(boundary guided module, BGM)可以引导模型学习手术场景的边界特征。最后,本文展示了基于Lap Game腹腔镜模拟器搭建的分体式腔镜手术机器人,网络模型的分割结果可以映射在手术机器人的视野中,辅助医生进行腔镜手术。结果 MDVNet在三个手术场景数据集上取得了最先进的结果,平均交并比分别为51.19%、71.28%和52.47%。结论 本文提出了适用于腔镜手术场景分割的多尺度动态视觉网络MDVNet,并在搭建的分体式腔镜手术机器人上对所提出方法进行了验证。本文代码开源地址:https://github.com/YubinHan73/MDVNet。
关键词
A multi-scale dynamic visual network for surgical robots scene segmentation
liu min, qin dunxuan, han yubin, chen xiang, wang yaonan(College of Electrical and Information Engineering, Hunan University, Changsha) Abstract
Objective Robot-assisted endoscopic surgery refers to the surgery with the help of intelligent endoscopic surgical robots, which can effectively reduce trauma, shorten the recovery period, and improve surgical success rates. Endoscopic surgery scene segmentation means the use of deep learning techniques to accurately segment the entire surgical scene, whose targets include anatomical areas and instruments. However, endoscopic surgery is completed in a closed human body cavity, and the whole process is accompanied by frequent cutting, traction, and other surgical operations, which makes the features of the segmentation targets complex and variable. In advancing robot-assisted surgery, it is crucial to develop high-precision surgical scene segmentation algorithms that can assist surgeons in performing surgeries. In this paper, we propose an innovative surgical scene segmentation network named Multi-scale Dynamic Visual Network (MDVNet), which aims to address three major challenges in endoscopic surgical scene segmentation, including target size variation, intraoperative complex noises, and indistinguishable boundaries. Method MDVNet adopts an encoder-decoder structure. In the encoder part employs, Dynamic Large Kernel Convolutional Attention (DLKA) module can extract multi-scale features of the surgical scene. The DLKA module consists of multiple branches each equipped with large kernel convolutions of different sizes, allowing the network to capture details and a wider range of features of targets with different sizes, which are 7, 11, and 21 in this paper. Besides, the dynamic selection mechanism helps to adaptively fuse features to meet the needs of different sized segmentation targets in endoscopic surgery scene. This new-designed module directly addresses the problem of target size variability, which is a major obstacle faced by previous surgical scene segmentation methods. In the decoder part, this paper further proposes two key modules: low-rank matrix decomposition module (LMD) and boundary guided module (BGM) to solve the intraoperative complex noises and indistinguishable boundaries challenges in endoscopic images. The core idea of LMD is to separate the noise components from the useful feature information in the feature map through the low-rank matrix decomposition technique. In endoscopic surgical scenes, noise is generated in surgical images due to motion blur, blood splash, and water mist on the tissue surface caused by surgical operations. These noises will reduce the segmentation accuracy of the network. LMD decomposes the feature map into a low-rank matrix and a sparse matrix through the non-negative matrix factorization technique, where the low-rank matrix contains the main feature information of the image, while the sparse matrix contains the noise and outliers. Through this process, the LMD is able to effectively remove the noise and provide a high-quality feature map for subsequent segmentation tasks. In surgical scenes, the boundaries between different tissues and instruments are highly indistinguishable due to contact, occlusion or similar texture features. To solve this problem, BGM uses a combination of boundary-sensitive Laplace convolutions and normal convolutions to compute the boundary maps of the ground truth and the highest resolution feature maps, respectively. In addition, BGM uses a combination of cross-entropy loss function and Dice loss function to guide the network to learn the boundary features, which can help the network to pay more attention to the boundary region during the training process, thus improving the ability to recognize the boundary. In this study, in order to apply the proposed MDVNet to actual surgical scenarios and verify its effectiveness, the team constructed a split laparoscopic surgical robot platform. The platform was constructed with the practical operational needs of the surgical process, integrating the advanced Lap Game endoscopic simulator, the Franka robotic arms, and a high-precision endoscopic imaging system. Users can manipulate the robotic arm equipped with surgical instruments through the control handle, and complete endoscopic surgical operations such as cutting, freeing and suturing in the endoscopic simulator to simulate the process of surgery. The segmentation results of the network are displayed on the user console to assist the surgeon in performing the endoscopic surgery. Result In order to fully validate the effectiveness and potential of the proposed MDVNet, we conducted a comprehensive comparative analysis of the proposed method with other advanced surgical scene segmentation methods on three different surgical scene datasets, including the robotic surgical scene dataset (Endovis2018), the cataract surgical scene dataset (CaDIS), and the minimally invasive laparoscopic surgery dataset (MILS). The experimental results show that MDVNet achieves the best segmentation results on all three datasets, with the intersection over union(mIoU)of 51.19% on Endovis2018, 71.28% on CaDIS (Task III), and 52.47% on MILS. The visualization results on the three datasets also illustrate that MDVNet can effectively segment multiple targets such as surgical instruments and anatomical areas in surgical scenes. Moreover, we conducted a series of ablation experiments on the Endovis2018 dataset with three different modules, DLKA, LMD and BGM. The experimental results demonstrate that the different modules in MDVNet are complementary to each other and can be combined to produce a positive gain effect on the whole method. Finally, the proposed MDVNet is employed on the laparoscopic surgical robot, and the segmentation results of the network are superimposed with the original surgical images for output to assist the surgeon in performing laparoscopic surgery. Conclusion In order to solve the three major challenges of endoscopic surgical scene segmentation, including target size variation, intraoperative complex noises, and indistinguishable boundaries, this paper proposes an innovative surgical scene segmentation network named Multiscale Dynamic Vision Network. MDVNet is composed of three modules, DLKA, LMD and BGM. In the encoder part, DLKA can extract the multi-scale features of different segmentation targets by multi-scale large kernel attention and perform adaptive feature fusion by dynamic selection mechanism, which can effectively reduce the misidentification caused by the change of target sizes. In the decoder part, LMD first filters out the noise in the feature maps and obtains high-quality feature maps. BGM guides the model to learn the boundary features of the surgical scene by calculating the loss between the boundary maps of both the feature maps and the ground truth. MDVNet achieves SOTA results on three different surgical scene datasets including Endovis2018, CaDIS and MILS. Code is available at https://github.com/YubinHan73/MDVNet.
Keywords
endoscopic surgical robots semantic segmentation large kernel convolution low-rank matrix decomposition boundary segmentation
|