采用多尺度视觉注意力分割腹部CT和心脏MR图像
摘 要
目的 医学图像分割是计算机辅助诊断和手术规划的重要步骤,但是由于人体器官结构复杂、组织边缘模糊等问题,其分割效果还有待提高。由于视觉Transformer(vision Transformer,ViT)在计算机视觉领域取得了成功,受到医学图像分割研究者的青睐。但是基于ViT的医学图像分割网络,将图像特征展平成一维序列,忽视了图像的二维结构,且ViT所需的计算开销相当大。方法 针对上述问题,提出了以多尺度视觉注意力(multi scale visualattention,MSVA)为基础、Transformer作为主干网络的U型网络结构MSVA-TransUNet。其采用的多尺度视觉注意力是一种由多个条状卷积实现的注意力机制,采用一个条状卷积对近似一个大核卷积的操作,采用不同的条状卷积对近似不同的大核卷积,从不同的尺度获取图像的信息。结果 在腹部多器官分割和心脏分割数据集上的实验结果表明:本文网络与基线模型相比,平均Dice分别提高了3.74%和1.58%,其浮点数运算量是多头注意力机制的1/278,网络参数量为15.31 M,是TransUNet的1/6.88。结论 本文网络媲美当前较先进的网络TransUNet和SwinUNet,采用多尺度视觉注意力代替多头注意力,在减少计算开销的同时在分割性能上同样具有优势。本文代码开源地址:https://github.com/BeautySilly/VA-TransUNet。
关键词
Segmentation of abdominal CT and cardiac MR images with multi scale visual attention
Jiang Ting1,2, Li Xiaoning1,3(1.College of Computer Science, Sichuan Normal University, Chengdu 610101, China;2.College of Intelligent Science and Technology, Geely University, Chengdu 641423, China;3.Visual Computing and Virtual Reality Key Laboratory of Sichuan Province, Chengdu 610066, China) Abstract
Objective Medical image segmentation is one of the important steps in computer-aided diagnosis and surgery planning. However,due to the complex,diverse structure of various human organs,blurred tissue edges,size,and other problems,the segmentation performance is poor and the segmentation effect needs to be further improved,while more accurate segmentation performance can more effectively help doctors to carry out treatment and provide advice. Recently,deeplearning-based methods have become a hot spot for researching medical image segmentation. Vision Transformer(ViT), which has achieved great success in the field of natural language processing,has also flourished in the field of computer vision;therefore,it is favored by medical image segmentation researchers. However,current medical image segmentation networks based on ViT flatten image features into 1D sequences,ignoring the 2D structure of images and the connections between them. Moreover,the secondary computational complexity of the multihead self-attention(MHSA)mechanism of ViT increases the required computational overhead. Method To address the above problems,this paper proposes MSVATransUNet,a U-shaped network structure with Transformer as the backbone network based on multi scale vision attention, an attention mechanism implemented by multiple stripe convolutions. The structure is similar to the multihead attention mechanism,which uses convolutional operations to obtain long-distance dependencies. First,the network uses convolution kernels of different sizes to extract features of images of dissimilarsizes,uses a pair of strip convolution operations to approximate a large kernel convolution instead,and does not use dissimilarsizes of strip convolution to approximate diverse large kernel convolutions,which can capture local information using convolution,while large convolution kernels can also learn long-distance dependence of images. Second,strip convolution belongs to lightweight convolution,which can remarkably reduce the number of parameters and floating-point operations of the network and lower the required computational overhead,because the computational overhead of convolution is much smaller than the overhead required by the secondary computational complexity of multihead attention. Further,it avoids converting the image into a 1D sequence for input to vision Transformer and makes full use of the 2D structure of the image to learn the features of the image. Finally,replacing the first patch embedding in the encoding stage with a convolution stem avoids directly converting low channel counts to high channel counts,which runs counter to the typical structure of convolutional neural networks(CNNs)while retaining the structure of patch embeddings elsewhere. Result Experimental results on the abdominal multiorgan segmentation dataset(mainly containing eight organs)and the heart segmentation dataset(comprising three parts of the heart)show the segmentation accuracy of the proposed network in this paper is improved compared with the baseline model. The average Dice of the abdominal multiorgan segmentation dataset is improved by 3. 74%,and the average Dice of the heart segmentation dataset is improved by 1. 58%. Their floating-point operations and number of parameters are reduced compared with the MHSA mechanism and the large kernel convolution. The MHSA mechanism’s floating-point operation is 1/278 of the selfattention mechanism,and the number of network parameters is 15. 31 M,which is 1/6. 88 of the TransUNet. Conclusion Experimental results show the proposed network is comparable to or even exceeds the current state-of-the-art networks. The multiscale visual attention mechanism is used instead of the multihead self-attention mechanism,which can also capture long-distance relationships and extract graphic long-distance features. Segmentation performance is improved while reducing computational overhead,that is,the proposed network exhibit certain advantages. However,due to the specificity of the location and small size of some organs,the networks do not have enough feature learning ability for this part of the organs;hence,its segmentation accuracy still needs to be further improved,and we will continue to study how to improve the segmentation performance of this part of the organs in depth. The code of this paper will be open source soon: https://github. com/BeautySilly/VA-TransUNet.
Keywords
medical image segmentation visual attention Transformer attention mechanism abdominal multi-organ segmentation cardiac segmentation
|