语义微调和跨模态检索增强的中文医学报告生成
(1.山东财经大学计算机科学与技术学院;2.山东第一医科大学第一附属医院;3.山东大学软件学院) 摘 要
目的 医学报告生成旨在根据医学影像生成准确的诊断结果,以减轻医生负担、提高临床工作效率。然而,中文医学报告生成在准确理解医学影像及规范描述医学报告方面仍存在局限,并存在幻觉问题。为解决上述问题,本文提出了一种基于语义微调和跨模态检索增强的中文医学报告生成(semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation,FRCM)模型。方法 基于多模态大模型LLaVA,本文对其视觉编码器和大语言模型进行领域适配与微调,并提出了一种通用数据与垂域数据协同训练策略:利用通用数据提高模型对复杂指令的理解能力,并利用垂域数据使模型具备医学图像-文本对齐能力及专业的中文医学报告生成能力。在推理阶段,提出了一种新的跨模态检索增强策略,利用引导知识有效缓解了模型的幻觉问题,进一步提高了模型生成医学报告的准确性和鲁棒性。结果 在中文 MIMIC-CXR 数据集上,与XrayGLM和XrayPULSE模型相比,FRCM在双语评估替代指标(bilingual evaluation understudy-ngram,BLEU-n)的BLEU-1、BLEU-2、BLEU-3、BLEU-4、基于最长公共子序列的召回率指标(recall-oriented understudy for gisting evaluation-longest common subsequence,ROUGE-L)、显式顺序翻译评价指标(metric for evaluation of translation with explicit ORdering,METEOR)和基于共识的图像描述评估指标(consensus-based image description evaluation,CIDEr)等7个指标上分别提升了10.4%、10.1%、9.7%、9.1%、6.6%、9.4%、38.4%。与LLaVA和Qwen-VL上微调过的模型相比,FRCM在BLEU-1、BLEU-2、BLEU-3、BLEU-4和CIDEr等5个指标上的得分分别提升了4.1%、3.1%、3.3%、3.6%和25.1%。消融实验结果表明,FRCM使用的训练方法和关键组件能够有效提升模型的性能。实验通过两个案例分析,进一步证明FRCM生成的中文医学报告在准确性和信息丰富度上优于其他模型。结论 本文通过设计多模态大模型训练与推理策略,综合了语义微调和检索增强的优点,生成了更加详细且准确的中文医学报告。
关键词
Semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation
Li Hengtai, Liu Hui1, Chen Gongguan1, Yan Zishen1, Sheng Yurui2,3, Zhang Caiming4(1.School of Computer Science and Technology, Shandong University of Finance and Economics;2.The First Affiliated Hospital of Shandong First Medical University &3.Shandong Provincial Qianfoshan Hospital;4.School of Software, Shandong University) Abstract
Objective The task of generating medical reports involves producing accurate and comprehensive examination results based on symptoms observed in medical images. This technology can alleviate the burden on radiologists, reduce diagnostic errors due to lack of experience, and expedite clinical workflows. Medical report generation is similar to image captioning, but it presents two unique challenges: long text generation and the imbalance in medical data distribution. Current approaches tend to train a specific model for the medical report generation task from scratch using limited publicly available data. Due to insufficient ability to fuse visual and textual features and generate rich information, their performance is often suboptimal. Large multimodal models (LMMs), composed of visual encoders and large language models (LLMs), possess the ability to recognize images and generate high-quality text with rich knowledge, making them particularly suitable for image-based text generation tasks. Their emergence provides a novel solution for the medical report generation task. However, LMMs are still in the early stages in the field of Chinese medical report generation, especially in accurately understanding medical images and normatively describing medical reports. Moreover, these models have inherent hallucination issues, where the generated responses appear logical but are actually incorrect or unfounded. To address the above problems, this paper proposes a Chinese medical report generation model based on semantic fine-tuning and cross-modal retrieval-augemented (FRCM). Method Based on the LMM framework of LLaVA, this paper fine-tunes and adapts the visual encoder and LLM for the medical domain. It proposes a collaborative training strategy using general data and domain-specific data, and introduces a novel cross-modal retrieval-augemented strategy during the inference phase. The paper translates the largest dataset in the medical report generation domain, MIMIC-CXR, into Chinese and uses it as in-domain data for research on Chinese medical report generation. Firstly, considering the characteristics of medical images and Chinese medical reports, the corresponding modules of LLaVA are replaced with a medical visual encoder trained on a large amount of medical images and a medical LLM with strong Chinese processing capabilities, allowing the model to better handle data in the medical field. Secondly, a two-phase training strategy using both general and domain-specific data is employed. In the first training phase, only the projection layer is trained. The domain-specific data enables the model to achieve medical image-text alignment, while the general data enhances the model's generalization capability. In the second training phase, the parameters of the projection layer are further updated, and a low-rank adaptation method is used to fine-tune the LLM. The domain-specific data provides the model with the ability to generate professional Chinese medical reports, and the general data improves the model's understanding of complex instructions. During the entire training process, medical images are encoded by the visual encoder into global feature vectors and local feature vectors. The local feature vectors are projected into visual embeddings with the same dimensions as the LLM's embedding space. Medical reports and instructions are tokenized into text embeddings by the LLM's tokenizer and input into the LLM along with the visual embeddings for training. Finally, to further alleviate the hallucination problem of the model, a cross-modal retrieval-augemented strategy is proposed. A cross-modal similar report retrieval module is designed. During inference, the global feature vectors obtained from the visual encoder are layer normalized and input into the similar report retrieval module to perform cross-modal retrieval from image to report. The retrieved similar reports are then used as additional knowledge input to the LLM, thereby reducing hallucinations and improving the accuracy and robustness of the model in generating medical reports. Result On the Chinese MIMIC-CXR dataset, compared to the LMMs XrayGLM and XrayPLUSE for Chinese medical report generation, FRCM achieved improvements of 10.4%, 10.1%, 9.7%, 9.1%, 6.6%, 9.4%, and 38.4% in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr scores, respectively. Compared to models fine-tuned on LLaVA and Qwen-VL, FRCM achieved score improvements of 4.1%, 3.1%, 3.3%, 3.6%, and 25.1% in BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDEr, respectively. In ablation experiments, both data ablation and module ablation were conducted. Data ablation demonstrated that adding diverse general data during training enhances the model's ability to follow complex instructions, thereby improving the quality of generated medical reports by better utilizing additional knowledge. Module ablation revealed that the key components used in FRCM significantly enhance its performance. Furthermore, two case studies demonstrated that the Chinese medical reports generated by FRCM is superior to those produced by other models in terms of accuracy and information richness. Conclusion This paper proposes FRCM, aimed at generating Chinese medical reports from medical images. Unlike traditional medical report generation methods, this study leverages LMM techniques to effectively address the challenges of long text generation and imbalanced medical data in the task of medical report generation. However, LMMs are typically pre-trained on extensive general data and have limitations in recognizing medical images and generating specialized medical reports. Based on the LLaVA model framework, this paper utilizes a medical visual encoder and a medical LLM, fine-tuning them semantically. To further mitigate the inherent hallucination problem of LMMs, we designed a similar report retrieval module. This module provides additional knowledge during the inference stage to assist the model in generating more accurate reports. Experimental results show that FRCM performs satisfactorily in the task of Chinese medical report generation.
Keywords
|