融合多头注意力机制的新冠肺炎联合诊断与分割
李金星, 孙俊, 李超, Bilal Ahmad(江南大学,无锡 214122) 摘 要
目的 新冠肺炎疫情席卷全球,为快速诊断肺炎患者,确认患者肺部感染区域,大量检测网络相继提出,但现有网络大多只能处理一种任务,即诊断或分割。本文提出了一种融合多头注意力机制的联合诊断与分割网络,能同时完成X线胸片的肺炎诊断分类和新冠感染区分割。方法 整个网络由3部分组成,双路嵌入层通过两种不同的图像嵌入方式分别提取X线胸片的浅层直观特征和深层抽象特征;Transformer模块综合考虑提取到的浅层直观与深层抽象特征;分割解码器扩大特征图以输出分割区域。为响应联合训练,本文使用了一种混合损失函数以动态平衡分类与分割的训练。分类损失定义为分类对比损失与交叉熵损失的和;分割损失是二分类的交叉熵损失。结果 基于6个公开数据集的合并数据实验结果表明,所提网络取得了95.37%的精度、96.28%的召回率、95.95%的F1指标和93.88%的kappa系数,诊断分类性能超过了主流的ResNet50、VGG16(Visual Geometry Group)和Inception_v3等网络;在新冠病灶分割表现上,相比流行的U-Net及其改进网络,取得最高的精度(95.96%),优异的敏感度(78.89%)、最好的Dice系数(76.68%)和AUC(area under ROC curve)指标(98.55%);效率上,每0.56 s可输出一次诊断分割结果。结论 联合网络模型使用Transformer架构,通过自注意力机制关注全局特征,通过交叉注意力综合考虑深层抽象特征与浅层高级特征,具有优异的分类与分割性能。
关键词
A MHA-based integrated diagnosis and segmentation method for COVID-19 pandemic
Li Jinxing, Sun Jun, Li Chao, Bilal Ahmad(Jiangnan University, Wuxi 214122, China) Abstract
Objective In order to alleviate the COVID-19(corona virus disease 2019) pandemic, the initial implementation is focused on targeting and isolating the infectious patients in time. Traditional PCR(polymerase chain reaction) screening method is challenged for the costly and time-consuming problem. Emerging AI(artificial intelligence)-based deep learning networks have been applied in medical imaging for the COVID-19 diagnosis and pathological lung segmentation nowadays. However, current networks are mostly restricted by the experimental datasets with limited number of chest X-ray (CXR) images, and it merely focuses on a single task of diagnosis or segmentation. Most networks are based on the convolution neural network (CNN). However, the convolution operation of CNN is capable to extract local features derived from intrinsic pixels, and has the long-range dependency constraints for explicitly modeling. We develop a vision transformer network (ViTNet). The multi-head attention (MHA) mechanism is guided for long-range dependency model between pixels. Method We built a novel transformer network called ViTNet for diagnosis and segmentation both. The ViTNet is composed of three parts, including dual-path feature embedding, transformer module and segmentation-oriented feature decoder. 1) The embedded dual-path feature is based on two manners for the embedded CXR inputs. One manner is on the basis of 2D convolution with the sliding step equal to convolution kernel size, which divides a CXR to multiple patches and builds an input vector for each patch. The other manner is concerned of a pre-trained feature map (ResNet34-derived) as backbone in terms of deep CXR-based feature extraction. 2) The transformer module is composed of six encoders and one cross-attention module. The 2D-convolution-generated vector sequence is as inputs for transformer encoder. Owing that the encoder inputs are directly extracted from image pixels, they can be considered as the shallow and intuitive feature of CXR. The six encodes are in sequential, transforming the shallow feature to advanced global feature. The cross-attention module is focused on the results obtained by backbone and transformer encoders as inputs, the network can combine the deep abstract feature and encoded shallow feature, and absorb both the global information and the local information in terms of the encoded shallow feature and deep abstract feature, respectively. 3) The feature decoder for segmentation can double the size of feature map and provide the segmentation results. Our network is required to deal with two tasks simultaneously for both of classification and segmentation. A hybrid loss function is employed for their training, which can balance the training efforts between classification and segmentation. The classification loss is the sum of a contrastive loss and a multi-classes cross-entropy loss. The segmentation loss is a binary cross-entropy loss. What is more, a new five-levels CXR dataset is compiled. The dataset samples are based on 2 951 CXRs of COVID-19, 16 964 CXRs of healthy, 6 103 CXRs of bacterial pneumonia, 5 725 CXRs of viral pneumonia, and 6 723 CXRs of opaque lung. In this dataset, COVID-19 CXRs are all labeled with COVID-19 infected lung masks. In our training process, the input images were resized as 448×448 pixels, the learning rate is initially set as 2×10-4 and decreased gradually in a self-adaptive manner, and the total number of iterations is 200, the Adam learning procedure is conducted on four Tesla K80 GPU devices. Result In the classification experiments, we compared ViTNet to a general transformer network and five popular CNN deep-learning models (i.e., ResNet18, ResNet50, VGG16(Visual Geometry Group), Inception_v3, and deep layer aggregation network(DLAN) in terms of overall prediction accuracy, recall rate, F1 and kappa evaluator. It can be demonstrated that our model has the best with 95.37% accuracy, followed by Inception_v3 and DLAN with 95.17% and 94.40% accuracy, respectively, and the VGG16 is reached 94.19% accuracy. For the recall rate, F1 and kappa value, our model has better performance than the rest of networks as well. For the segmentation experiments, ViTNet is in comparison with four commonly-used segmentation networks like pyramid scene parsing network (PSPNet), U-Net, U-Net+ and context encoder network (CE-Net). The evaluation indicators used are derived of the accuracy, sensitivity, specificity, Dice coefficient and area under ROC(region of interest) curve (AUC). The experimental results show that our model has its potentials in terms of the accuracy and AUC. The second best sensitivity is performed inferior to U-Net+ only. More specifically, our model achieved the 95.96% accuracy, 78.89% sensitivity, 97.97% specificity, 98.55% AUC and a Dice coefficient of 76.68%. When it comes to the network efficiency, our model speed is 0.56 s per CXR. In addition, we demonstrate the segmentation results of six COVID-19 CXR images obtained by all the segmentation networks. It is reflected that our model has the best segmentation performance in terms of the illustration of Fig.5. Our model limitation is to classify a COVID-19 group as healthy group incorrectly, which is not feasible. The PCR method for COVID-19 is probably more trustable than the deep-leaning method, but the feedback duration of tested result typically needs for 1 or 2 days. Conclusion A novel ViTNet method is developed, which achieves the auto-diagnosis on CXR and lung region segmentation for COVID-19 infection simultaneously. The ViTNet has its priority in diagnosis performance and demonstrate its potential segmentation ability.
Keywords
corona virus disease 2019(COVID-19) automatic diagnosis lung region segmentation multi-head attention mechanism hybrid loss
|