视觉语言模型引导的青瓷跨模态知识图谱构建
肖刚1, 方静雯1, 张豪1, 刘莹2, 周晓峰3, 徐俊1(1.浙江工业大学;2.浙江龙泉青瓷博物馆;3.浙江丽水学院中国青瓷学院) 摘 要
目的 青瓷是中华民族文化瑰宝中的璀璨明珠,也是中外交流的文化使者。在文化数智化背景下,构建青瓷跨模态知识图谱是推动青瓷文化保护传承的关键技术之一。在此过程中,实现不同模态间相同实体的匹配至关重要,这涉及到对齐等价实体的不同模态特征。因此为最大程度地提升青瓷图像与文本间的匹配度,本文提出了一种基于视觉语言(vision-language pre-training,VLP)模型的图像多特征映射的跨模态实体对齐方法。方法 首先从青瓷图像中提取轮廓、纹理和色彩方面的局部特征。接着引入带门控的多元融合器来动态地融合多个图像特征。进一步地通过多层全连接网络,学习将融合特征映射到一个合适的中间表示空间,以引导文本编码器生成与图像特征更加匹配的文本特征。最后借助InfoNCE(infor-mation noise contrastive estimation)损失函数对模型进行训练和优化。结果 在ChinaWare数据集上,将本文方法与最新的基准方法CN-CLIP(contrastive vision-language pretraining in Chinese)、CoOp(context optimization)、CoCoOp(conditional c-ontext optimization)和Pic2Word(mapping pictures to words)进行实验对比。在跨模态对齐任务中,本文方法在MR(mean recall)指标上相较于上述方法,在最佳情况下分别提升了3.2%和5.6%。结论 本文提出的跨模态实体对齐方法可以在不改变VLP模型参数的前提下,充分挖掘图像特征有效的中间表示来完成文本特征的重构,提高了青瓷细节特征的跨模态识别准确度。最终利用所提方法成功构建了一个包含8 949个节点和18 211条关系的青瓷跨模态知识图谱。
关键词
Celadon cross-modal knowledge graph construction by visual-language model
(Zhejiang University of Technology) Abstract
Objective Celadon is not only a dazzling pearl among the cultural treasures of the Chinese nation but also a cultural messenger in cultural exchanges between China and other countries. It carries a rich historical and cultural connotation and demonstrates excellent artistic value. Its elegant shape and moist glaze make it an outstanding representative of traditional Chinese craft aesthetics. The production of celadon embodies the wisdom and creativity of ancient craftsmen and is an important carrier for the inheritance of excellent traditional Chinese culture. In the context of cultural digitization, constructing a cross-modal knowledge graph of celadon is one of the key technologies to promote the protection and inheritance of celadon culture. In this process, matching the same entities across different modalities is crucial, which involves aligning the different modal features of equivalent entities. However, the inherent structural differences between cross-modal data bring challenges to the alignment task. Traditional methods that rely on manually annotated data can ensure the accuracy of alignment to some extent, but they have the problems of low efficiency and high cost. In addition, coarse-grained annotated data are difficult to meet the requirements for fine-grained concept and entity recognition in constructing a cross-modal knowledge graph. At present, vision-language pre-training(VLP) model can effectively capture cross-modal semantic associations by learning rich cross-modal representations from large-scale unmarked image-text pair data. The strong cross-modal understanding ability of VLP model can provide precise semantic associations and fine-grained entity recognition for aligning entities of different modalities in the graph construction. Therefore, to maximize the matching degree between celadon images and text, we proposes a cross-modal entity alignment method based on VLP model, which maps multi-feature of images. Method The cross-modal entity alignment method proposed in this paper, which maps multi-feature of images, is initialized with the publicly available VLP model for both the image and text encoders, and the parameters of the encoders remain unchanged during the training process. The method mainly consists of four parts. First, based on the visual characteristics of celadon images, local features in terms of contour, texture, and color are extracted. Then, a gated multi-fusion unit is introduced to adaptively assign weights to the image features and the extracted multiple local image features to generate reliable fused features. Further, a multi-layer fully connected mapper is designed to learn the mapping of the fused features to an appropriate intermediate representation space through multiple layers of nonlinear transformations, guiding the text encoder to generate text features that match the image features more closely. Finally, the model is trained and optimized using the information noise contrastive estimation (InfoNCE) loss function, that is, by optimizing the similarity of positive sample pairs and the difference of negative sample pairs through calculating the cosine similarity between cross-modality features, thereby better establishing the connection between image features and text features. Result We compared our method with four of the latest benchmark methods in an experimental comparison, namely contrastive vision-language pretraining in Chinese (CN-CLIP), context optimization (CoOp), conditional context optimization (CoCoOp), and mapping pictures to words (Pic2Word). The quantitative evaluation metrics are recall rates, including R@1, R@5, R@10, and mean recall(MR). The experiments were conducted using the ChinaWare dataset, so all methods were trained on this dataset. We provided a data table comparing each method"s performance on recall rate metrics. In terms of the MR metric, our method outperformed zero-shot CN-CLIPViT-B/16 by 3.2% in the text-to-image alignment task and by 7.5% in the image-to-text task. CoOp focuses on text features, and our method outperformed CoOp by 11.4% and 12.1% respectively. CoCoOp adds consideration of image features on the basis of CoOp, and our method outperformed CoCoOp by 8.4% and 9.5% respectively. Pic2Word also focuses on original image features and does not fully utilize other local image features to improve model performance, and our method outperformed Pic2Word by 5.8% and 5.6% respectively. Conclusion The cross-modal entity alignment method proposed in this paper can fully explore the effective intermediate representation of image features to reconstruct text features without changing the parameters of the VLP model, thereby improving the cross-modal recognition accuracy of the details of celadon. Experimental results show that this method is superior to several state-of-the-art methods and has improved the performance of alignment. Ultimately, we successfully constructed a celadon cross-modal knowledge graph with 8,949 nodes and 18,211 relationships by applying technologies such as ontology modeling, data mining, and the cross-modal entity alignment method proposed in this paper.
Keywords
|