提取全局语义信息的场景图生成算法
段静雯1, 闵卫东2,3, 杨子元1, 张煜1, 陈鑫浩1, 杨升宝1(1.南昌大学信息工程学院, 南昌 330031;2.南昌大学软件学院, 南昌 330047;3.江西省智慧城市重点实验室, 南昌 330047) 摘 要
目的 场景图能够简洁且结构化地描述图像。现有场景图生成方法重点关注图像的视觉特征,忽视了数据集中丰富的语义信息。同时,受到数据集长尾分布的影响,大多数方法不能很好地对出现概率较小的三元组进行推理,而是趋于得到高频三元组。另外,现有大多数方法都采用相同的网络结构来推理目标和关系类别,不具有针对性。为了解决上述问题,本文提出一种提取全局语义信息的场景图生成算法。方法 网络由语义编码、特征编码、目标推断以及关系推理等4个模块组成。语义编码模块从图像区域描述中提取语义信息并计算全局统计知识,融合得到鲁棒的全局语义信息来辅助不常见三元组的推理。目标编码模块提取图像的视觉特征。目标推断和关系推理模块采用不同的特征融合方法,分别利用门控图神经网络和门控循环单元进行特征学习。在此基础上,在全局统计知识的辅助下进行目标类别和关系类别推理。最后利用解析器构造场景图,进而结构化地描述图像。结果 在公开的视觉基因组数据集上与其他10种方法进行比较,分别实现关系分类、场景图元素分类和场景图生成这3个任务,在限制和不限制每对目标只有一种关系的条件下,平均召回率分别达到了44.2%和55.3%。在可视化实验中,相比性能第2的方法,本文方法增强了不常见关系类别的推理能力,同时改善了目标类别与常见关系的推理能力。结论 本文算法能够提高不常见三元组的推理能力,同时对于常见的三元组也具有较好的推理能力,能够有效地生成场景图。
关键词
Global semantic information extraction based scene graph generation algorithm
Duan Jingwen1, Min Weidong2,3, Yang Ziyuan1, Zhang Yu1, Chen Xinhao1, Yang Shengbao1(1.School of Information Engineering, Nanchang University, Nanchang 330031, China;2.School of Software, Nanchang University, Nanchang 330047, China;3.Jiangxi Key Laboratory of Smart City, Nanchang 330047, China) Abstract
Objective The scene graph can construct a graph structure for image interpretation. The image objects and inter-relations are represented via nodes and edges. However, the existing methods have focused on the visual features and lack of semantic information. While the semantic information can provide robust feature and improve the capability of inference. In addition, it is challenged of long-tailed distribution issue in the dataset. The 30 regular relationships account for 69% of the sample size, while the triplet of 20 irregular relationships just has 31% of the sample size. Most of methods cannot maintain qualified results on the rare triplets and tend to infer the regular one. To improve the reasoning ability of irregular triples, we demonstrated a scene graph generation algorithm to generate robust features. Method The components of this network are semantic encoding, feature encoding, target inference, and relationship reasoning. The semantic coding module first represents the word in region description into low dimension via word embedding. Thanks to the Word2Vec model is trained on a large corpus database, it can better represent the semantics of words based on complete word embedding. We use the Word2Vec network to traverse the region description of the dataset and extract the intermediate word embedding vectors of 150 types of targets and 50 types of relationships as the semantic information. Additionally, in this module, we explicitly calculate global statistical knowledge, which can represent the global characters of the dataset. We use graph convolution networks to integrate them with semantic information. This method can get global semantic information, which strengthens the reasoning capability of rare triplets. The feature encoding module extracts the visual image features based on faster region convolutional neural network (Faster R-CNN). We remove its classification network and use its feature extraction network, region proposal network, and region of interest pooling layer to get visual features of image processing. In the target reasoning and the relationship reasoning modules, visual features and global semantic information are fused to obtain global semantic features via different feature fusion methods. These features applications can enhance the performance of rare triplets through clarifying the differences of target and relationship. In respect of the target reasoning module, we use graphs to represent the images and use gated graph neural networks to aggregate the context information. After three times step iteration, the target feature has been completely improved, we train a classifier to determine the target classes using these final global semantic features. Objects' classes can benefit to the reasoning capability of relationships. In respect of the relationship in reasoning module, we use both object class and the global semantic feature of relationship to conduct reasoning work. We use gated recurrent units to refine features and reasoning the relationship. Each relationship feature will aggregate information derived from the corresponding object pair. Meanwhile, a parser is used to construct the scene graph to describe structured images. Result We carried out experiments on the public visual genome dataset and compared it with 10 methods proposed. We actually conduct predicate classification, scene graph classification, and scene graph generation tasks, respectively. Ablation experiments were also performed. The average recall reached 44.2% and 55.3% under each setting, respectively. Compared with the neural motifs method, the R@50 of the scene graph classification task has a 1.3% improvement. With respect of the visualization part, we visualize the results of the scene graph generation task. The target location and their class in the original image are marked. The target and relationship classes are represented based on node and edge. Compared with the second score obtained in the quantitative analysis part, our network enhances the reasoning capability of rare relationships significantly in terms of the reasoning capability of target and common relationships improvement. Conclusion Our demonstrated algorithms facilitate the reasoning capability of rare triplets. It has good performance on regular-based triplets reasoning as well as scene graph generation.
Keywords
scene graph global semantic information target inference relationship reasoning image interpretation
|