融合帧间时序关系的标准胎儿四腔心超声切面自动获取
徐光柱1,2, 吴梦琦1,2, 钱奕凡1,2, 王阳3, 刘蓉3, 周军3, 雷帮军1,2(1.湖北省水电工程智能视觉监测重点实验室 (三峡大学), 宜昌 443002;2.三峡大学计算机与信息学院, 宜昌 443002;3.宜昌市中心人民医院超声科, 宜昌 443003) 摘 要
目的 超声医师手动探查与采集胎儿心脏切面图像时,常因频繁的手动暂停与截图操作而错失心脏切面最佳获取时机。而单纯采用深层视觉目标检测或分类网络自动获取切面时,因无法确保网络重点关注切面图像中相对较小的心脏区域的细粒度特征,导致高误检率;另外,不同的心脏解剖部件的最佳成像时刻也常常不同步。针对上述问题,提出一种目标检测与分类网络相结合,同时融合关键帧间时序关系的标准四腔心(four-chamber,4CH)切面图像自动获取算法。方法 首先,利用自行构建的胎儿心脏超声切面数据集训练目标检测网络,实现四腔心区域和降主动脉区域的快速准确定位。接着,当检测到在一定时间窗内的视频帧存在降主动脉区域时,将包含四腔心目标的候选区域提取后送入利用自建的标准四腔心区域图像集训练好的分类网络,进一步分类出标准四腔心区域。最后,通过时序关系确定出可靠的降主动脉区域,将可靠降主动脉的检测置信度及同一时间窗内各个切面图像中四腔心区域在分类模型中的输出,加权计算得到标准四腔心切面图像的得分。结果 采用本文构建的数据集训练的YOLOv5x (you only look once version 5 extra large)和Darknet53模型,在四腔心区域和降主动脉区域的检测任务上分别达到94.0%的mAP@0.5和61.1%的mAP@[.5:.95],以及69.5%的recall@0.5-0.95;在四腔心区域标准性分类任务上TOP-1准确率达到92.4%。将检测与分类模块结合后,系统对四腔心区域的误检率降低了29.38%。结论 目标检测与分类网络相结合的策略及帧间时序信息的加入能够有效调和错检与漏检间的矛盾,同时大幅降低误检率。另外,所提算法除可自动获取标准的四腔心切面图像外,还可同时给出最佳切面,具有较好的实际应用价值。
关键词
Automatic capture for standard fetal cardiac four-chamber ultrasound view by fusing frame sequential relationships
Xu Guangzhu1,2, Wu Mengqi1,2, Qian Yifan1,2, Wang Yang3, Liu Rong3, Zhou Jun3, Lei Bangjun1,2(1.Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University, Yichang 443002, China;2.College of Computer and Information Technology, China Three Gorges University, Yichang 443002, China;3.Ultrasound Department, Yichang Central People's Hospital, Yichang 443003, China) Abstract
Objective First-rank scan planes usually cannot be easily captured well because of the frequent pause and screenshot operations and fetal random movements when ultrasound sonographers manually scan a fetal heart region. This limitation discourages efficient screenings. When deep neural networks designed for visual object detection or classification are adapted for automatically capturing fetal cardiac ultrasound scan planes, these networks usually end up with a high false detection rate. One possible reason is that they cannot ensure focusing on the fine-grained features within the relatively small cardiac region. Moreover, optimal scanning moments for different cardiac parts are usually asynchronous, in which case object detection networks tend to miss numerous potential scan planes if they rely on counting coexisting cardiac parts at a moment. To solve the preceding problems, our study focuses on the most critical fetal cardiac ultrasound scan plane, namely four-chamber(4CH) scan plane, and proposes an automatic four-chamber scan plane extraction algorithm by simultaneously combining object detection and classification networks and considering the relationships of key video frames. Method To solve the problem emanating from the lack of public datasets of the four-chamber fetal echocardiographic image, 512 echocardiographic videos of 14- to 28-week-old fetuses were collected from our partners. Each video was recorded by experienced sonographers with mainstream ultrasound equipments. Most of these videos consist of continuous scan views from the gastric vesicle to the heart and to the three vessels thereafter. When labeling the standard fourchamber planes, to ensure that the detection model learns considerable information on the standard four-chamber scan plane, the standard four-chamber plane dataset used in subsequent experiments was manually screened from the image frames of video Nos. 1-100 and Nos. 144-512 to ensure each image has positive sample targets. In addition, the fourchamber heart region and the descending aorta (DAO) region in each image were labeled. Thereafter, these standard fourchamber scan planes were divided into training, verification, and test sets according to the ratio of 5:2:3. They were used for subsequent training and evaluation of the detection model on the standard four-chamber scan plane image set. During the training of the detection and classification models, the YOLOv5x network was first trained with the marked fourchamber scan plane image dataset. Thereafter, the trained detection model was used to evaluate the previously unmarked video frames(regarded as non-standard four-chamber planes) under the appropriate threshold setting. The false detected images were extracted as the negative dataset for the following classification model's training. Lastly, the four-chamber regions were extracted to train the Darkent53 classification model according to the position coordinates of manually labeled (as standard) and mistakenly detected by YOLOv5x(as non-standard) four-chamber regions. During the reasoning process, the trained detection model was first used to achieve rapid and accurate locating of the four-chamber and descending aorta regions. Thereafter, when a descending aorta region was detected in a video frame within a certain time window, the candidate regions containing the four-chamber objects were extracted and sent to the classification model, which is welltrained with the self-built qualified four-chamber region dataset to further classify the qualified four-chamber regions. Lastly, the reliable descending aorta region was determined through the time series relationship. The score of a standard four-chamber scan plane was calculated by a weighted sum of the detection confidence of the reliable descending aorta and the quality metrics of the four-chamber regions of those frames in the same time window. Result Given that there are several standard four-chamber scan planes in any fetal cardiac ultrasound video and that this research mainly studies the optimal automatic extraction of the standard four-chamber scan planes, we focus considerably on the false detection rate when analyzing the performance of the YOLOv5x(for detection) and Darknet53(for classification) modules before and after their combination. The objective is to achieve a relatively low false detection rate while ensuring low false detection rate. Experimental results show that with the detection confidence threshold increasing(0. 3-0. 9), the false detection rate of YOLOv5x gradually decreases(from 36. 25% to 11. 20%), but the missed detection rate continuously increases(from 0. 31% to 27. 17%). This result indicates the difficulty of ensuring a low false detection rate by merely adjusting the detection confidence of YOLOv5x. With the confidence threshold increasing, the missed detection rate of YOLOv5x is also increasing. Therefore, determining whether there is a standard four-chamber heart region in each frame is not possible by simply adjusting the detection confidence threshold of YOLOv5x. When the detection confidence threshold is set to 0. 3 and the Darknet53 classification module is added, although the system's missed detection rate increases by 19. 72%, the false detection rate decreases by 35. 18%. When the detection confidence threshold is 0. 4-0. 6, after the Darknet53 classification module is combined, although there are still a few missed detections for the entire system, its false detection rate is significantly reduced compared with the case when only the YOLOv5x detection module is used. Moreover, when the confidence threshold is 0. 5, the overall system's error rate reaches the lowest level, with an error rate of 21. 06%, and the false detection rate decreases from 30. 25% to 0. 87%(a decrease of 29. 38%). When the detection confidence level is 0. 7- 0. 9, combining the Darknet53 classification module can further reduce the false detection rate of the system, but the missed detection rate will increase with the increase of confidence level(from 20. 96% to 40. 22%). Therefore, to ensure a low false detection rate and obtain a low missed rate, the setting of the confidence threshold as 0. 5 and intersection ratio as 0. 5 are adopted in this study. Although the experimental data show that the missed rate is nearly 21% in the best situation, the false detection rate is the key index for the practical problems faced in this research. Through the proposed algorithm, the false detection rate can be reduced to under 1%. In real application scenarios, the effective four-chamber video frame will often appear multiple times;in this case, the low false detection rate and high missed rate can meet the actual needs. Conclusion Experimental results show that the combination of the target detection and classification networks, combined with the inter-frame sequential information, can effectively reconcile the contradiction between error detection and missed detection and significantly reduce false detection rate. Lastly, the proposed algorithm can automatically extract the standard four-chamber plan and also recommend the best one, which has good practical application value.
Keywords
deep learning convolutional neural network (CNN) object detection image classification frame sequential relationships
|