融合局部与全局特征的DCE-MRI乳腺肿瘤良恶分类
赵小明1,2, 廖越辉1, 张石清2, 方江雄2, 何遐遐2, 汪国余2, 卢洪胜2(1.杭州电子科技大学计算机学院, 杭州 310018;2.台州学院附属医院(台州市中心医院), 台州 318000) 摘 要
目的 基于计算机辅助诊断的乳腺肿瘤动态对比增强磁共振成像(dynamic contrast-enhanced magnetic reso-nance imaging,DCE-MRI)检测和分类存在着准确度低、缺乏可用数据集等问题。方法 针对这些问题,建立一个乳腺DCE-MRI影像数据集,并提出一种将面向局部特征学习的卷积神经网络(convolutional neural network,CNN)和全局特征学习的视觉Transformer(vision Transformer,ViT)方法相融合的局部—全局跨注意力融合网络(local globalcross attention fusion network,LG-CAFN),用于实现乳腺肿瘤DCE-MRI影像自动诊断,以提高乳腺癌的诊断准确率和效率。该网络采用跨注意力机制方法,将CNN分支提取出的图像局部特征和ViT分支提取出的图像全局特征进行有效融合,从而获得更具判别性的图像特征用于乳腺肿瘤DCE-MRI影像良恶性分类。结果 在乳腺癌DCE-MRI影像数据集上设置了两组包含不同种类的乳腺DCE-MRI序列实验,并与VGG16(Visual Geometry Group 16-layer net-work)、深度残差网络(residual network,ResNet)、SENet(squeeze-and-excitation network)、ViT以及Swin-S(swinTransformer-small)方法进行比较。同时,进行消融实验以及与其他方法的比较。两组实验结果表明,LG-CAFN在乳腺肿瘤良恶性分类任务上分别取得88.20%和83.93%的最高准确率(accuracy),其ROC(receiver operating charac-teristic)曲线下面积(area under the curve,AUC)分别达到0.915 4和0.882 6,均优于其他方法并最接近1。结论 提出的LG-CAFN方法具有优异的局部—全局特征学习能力,可以有效提升DCE-MRI乳腺肿瘤影像良恶性分类性能。
关键词
Method of classifying benign and malignant breast tumors by DCE-MRI incorporating local and global features
Zhao Xiaoming1,2, Liao Yuehui1, Zhang Shiqing2, Fang Jiangxiong2, He Xiaxia2, Wang Guoyu2, Lu Hongsheng2(1.School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China;2.Taizhou Central Hospital, Taizhou University, Taizhou 318000, China) Abstract
Objective Among women in the United States,breast cancer(BC)is the most frequently detected type of cancer,except for nonmelanoma skin cancer,and it is the second-highest cause of cancer-related deaths in women,following lung cancer. Breast cancer cases have been on the rise in the past few years,but the number of deaths caused by breast cancer has either remained steady or decreased. This outcome could be due to improved early detection techniques and more effective treatment options. Magnetic resonance imaging (MRI),especially dynamic contrast-enhanced (DCE) -MRI,has shown promising results in the screening of women with a high risk of breast cancer and in determining the stage of breast cancer in newly diagnosed patients. As a result,MRI,especially DCE-MRI,is becoming increasingly recognized as a valuable adjunct diagnostic tool for the timely detection of breast cancer. With the development of artificial intelligence,many deep learning models based on convolutional neural network(CNN)have been widely used in medical image analysis such as VGG and ResNet. These models can automatically extract deep features from images,eliminating the need for hand-crafted feature extraction and saving much time and effort. However,CNN cannot obtain global information,and global information of medical images is very useful for the diagnosis of breast tumors. To acquire global information,the vision Transformer(ViT)has been proposed and achieved magnificent results in computer vision tasks. ViT uses convolution operation to separate the entire input image into many small image patches. Then,ViT can simultaneously process these image patches by multihead self-attention layers and capture global information in different regions of the entire input image. However,ViT inevitably loses local information while capturing global information. To integrate the advantages of CNN and ViT,studies have been proposed to combine the advantages of CNN and ViT to obtain more comprehensive feature representations for achieving better performance in breast tumor diagnosis tasks. Method Based on the above observations and inspired by integrating the CNN and ViT,a novel cross-attention fusion network is proposed based on CNN and ViT,which can simultaneously extract local detail information from CNN and global information from ViT. Then,a nonlocal block is used to fuse this information to classify breast tumor DCE-MR images. The model structure mainly contains three parts:local CNN and global ViT branches,feature coupling unit(FCU),and cross-attention fusion. The CNN subnetwork uses SENet for capturing local information,and the ViT subnetwork captures global information. For the extracted feature maps from these two branches,their feature dimensions are usually different. To address this issue,an FCU is adopted to eliminate feature dimension misalignment between these two branches. Finally,the nonlocal block is used to compute the correspondences on the two different inputs. The former two stages(stage-1 and stage-2)of SENet50 as our local CNN subnetwork and a 7-layer ViT(ViT-7)as our global subnetwork are adopted. Each stage in SENet50 is composed of some residual blocks and SEblocks. Each residual block contains a 1 × 1 convolution layer,a 3 × 3 convolution layer,and a 1 × 1 convolution layer. Each SEblocks contains a global average pooling layer,two fully connected(FC)layers,and a sigmoid activation function. Here,it is separately set to be 3 in stage-1 and 4 in stage-2 for the number of residual blocks and SEblocks. The 7-layer ViT contains seven encoder layers,which include two LayerNorms,a multihead self-attention module,and a simple multi-layer perception(MLP)block. The FCU contains a 1 × 1 convolution,a BatchNorm layer,and a nearest neighbor Interpolation. The nonlocal block consists of four 1 × 1 convolutions and a softmax function. Result The model performance is compared with the five other deep learning models such as VGG16,ResNet50, SENet50,ViT and Swin-S(swin-Transformer-small),and two sets of experiments that use different breast tumor DCE-MRI sequences are conducted to evaluate the robustness and generalization of the model. The quantitative evaluation metrics contain accuracy and area under the receiver operating characteristic(ROC)curve(AUC). Compared with VGG16 and ResNet50 in two sets of experiments,the accuracy increases by 3. 7%,3. 6% and AUC increases by 0. 045,0. 035 on average,respectively. Compared with SENet50 and ViT-7 in two sets of experiments,the accuracy increases by 3. 2% and 1. 1%,and AUC increases by 0. 035 and 0. 025 on average,respectively. Compared with Swin-S in two sets of experiments,the accuracy increases by 3. 0% and 2. 6%,and AUC increases by 0. 05 and 0. 03. In addition,the class activation map of learned feature representations of models is generated to increase the interpretability of the models. Finally,a series of ablation experiments is conducted to prove the effectiveness of our proposed method. Specially,different fusion methods such as feature- and decision-level fusion are compared with our cross-attention fusion module. Compared with the feature-level fusion method in two sets of experiments,the accuracy increases by 1. 6% and 1. 3%,and AUC increases by 0. 03 and 0. 02. Compared with the decision-level fusion method in two sets of experiments,the accuracy increases by 0. 7% and 1. 8%,and AUC increases by 0. 02 and 0. 04. In the end,comparative experiments with three recent methods such as RegNet,ConvNext,and MobileViT are also performed. Experimental results fully demonstrate the effectiveness of our method in the breast tumor DCE-MR image classification task. Conclusion In this paper,a novel cross-attention fusion network based on local CNN and global ViT(LG-CAFN)is proposed for the benign and malignant tumor classification of breast DCE-MR images. Extensive experiments demonstrate the superior performance of our method compared with several state-of-the-art methods. Although the LG-CAFN model is only used for the diagnosis of breast tumor DCE-MR images, this approach can be very easily transferred to other medical image diagnostic tasks. Therefore,in future work,our approach will be extended to other medical image diagnostic tasks,such as breast ultrasound images and breast CT images. In addition,automatic segmentation tasks for breast DCE-MR images will be explored to analyze breast DCE-MR images more comprehensively and to help radiologists make more accurate diagnoses.
Keywords
breast tumor dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) vision Transformer (ViT) convolutional neural network(CNN) attention fusion
|