Current Issue Cover
结合多层级解码器和动态融合机制的图像描述

姜文晖, 占锟, 程一波, 夏雪, 方玉明(江西财经大学信息管理学院, 南昌 330032)

摘 要
目的 注意力机制是图像描述模型的常用方法,特点是自动关注图像的不同区域以动态生成描述图像的文本序列,但普遍存在不聚焦问题,即生成描述单词时,有时关注物体不重要区域,有时关注物体上下文,有时忽略图像重要目标,导致描述文本不够准确。针对上述问题,提出一种结合多层级解码器和动态融合机制的图像描述模型,以提高图像描述的准确性。方法 对Transformer的结构进行扩展,整体模型由图像特征编码、多层级文本解码和自适应融合等3个模块构成。通过设计多层级文本解码结构,不断精化预测的文本信息,为注意力机制的聚焦提供可靠反馈,从而不断修正注意力机制以生成更加准确的图像描述。同时,设计文本融合模块,自适应地融合由粗到精的图像描述,使低层级解码器的输出直接参与文本预测,不仅可以缓解训练过程产生的梯度消失现象,同时保证输出的文本描述细节信息丰富且语法多样。结果 在MS COCO(Microsoft common objects in context)和Flickr30K 两个数据集上使用不同评估方法对模型进行验证, 并与具有代表性的12种方法进行对比实验。结果表明,本文模型性能优于其他对比方法。其中,在MS COCO数据集中,相比于对比方法中性能最好的模型,BLEU-1(bilingual evaluation understudy)值提高了0.5,CIDEr(consensus-based image description evaluation)指标提高了1.0;在Flickr30K数据集中,相比于对比方法中性能最好的模型,BLEU-1值提高了0.1,CIDEr指标提高了0.6;同时,消融实验分别验证了级联结构和自适应模型的有效性。定性分析也表明本文方法能够生成更加准确的图像描述。结论 本文方法在多种数据集的多项评价指标上取得最优性能,能够有效提高文本序列生成的准确性,最终形成对图像内容的准确描述。
关键词
The integrated mechanism of hierarchical decoders and dynamic fusion for image captioning

Jiang Wenhui, Zhan Kun, Cheng Yibo, Xia Xue, Fang Yuming(School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330032, China)

Abstract
Objective Image captioning aims at automatically generating lingual descriptions of images.It has a wide variety of applications scenarios like image indexing,medical imaging reports generation and human-machine interaction.To generate fluent sentences of the gathered information all,an algorithm of image captioning is called to recognize the scenes,entities and their relationships of the image.A deep encoder-decoder framework has been developed to resolve the issue past decades.The convolutional neural networks based (CNNs-based) encoder extracts feature vectors of the image and the recurrent neural networks based (RNNs-based) decoder generates image descriptions.Recent image captioning is driven by the development of attention mechanism.It improves the performance of image captioners via attending to informative image regions.Most attention models are based on the previously generated words as inputs when the next attending phases are predicted.Due to the lack of relevant textual guidance,most existing works are challenged of "attention defocus",i.e.,they fail to concentrate on correct image regions when generating the target words.As a result,contemporary models are prone to "hallucinating" objects,or missing informative visual clues,and make attention model be less interpretable.So,we facilitate an integrated hierarchical architecture and dynamic fusion strategy.Method The estimated word provides useful knowledge for predicting more grounded regions,although it is hard to localize the correct regions from the previously generated words at once.To refine the attention mechanism and improve the predicted words,we design a hierarchical architecture based on a series of captioning decoders.Our architecture is a hierarchical variant extended from the conventional encoder-decoder framework.Specifically,the first step is focused on the standard image captioning models,which generates a coarse description as a draft.To ground correct image regions with proper generated words,the latter one takes the outputs from the early decoder.Since the former decoder provides more predictable information to the target word,the attention accuracy is improved in latter decoders.To ground the final predicted words properly in this hierarchical architecture,attended regions from the early decoder can be well validated by the later decoders in a coarse-to-fine manner.Furthermore,we carry out a dynamic fusion strategy to aggregate the coarse-to-fine predictions from different decoders.Noteworthy,our manipulated gating mechanism is focused on the contributions from different decoders to the final word prediction.Differentiated from the previous gating mechanism managing the weight from each pathway,the contributions are jointed with a softmax schema from each decoder,which incorporates contextual information from all decoders to estimate the overall weight distribution.The dynamic fusion strategy provides rich fine-grained image descriptions and alleviates the problem of "vanishing gradients",which makes the learning of the hierarchical architecture easier.Result Our method is evaluated on Microsoft common objects in context (MS COCO) and Flickr30K,which are the common benchmark for image captioning.The MS COCO dataset is composed of 120 k images,and the Flickr30K includes 31 k examples.Each image of both datasets is provided with five descriptions.The model is trained and tested using the Karpathy splits.The quantitative evaluation metrics are related to bilingual evaluation understudy (BLEU),metric for evaluation of translation with explicit ordering (MEREOR),and consensus-based image description evaluation (CIDEr).We compare the performance of our model with 12 current methods.On MS COCO,our analysis is optimized by 0.5 and 1.0 of each beyond BLEU-1 and CIDEr.Our result achieves a CIDEr of 69.94 on Flickr30K.Compared to the baseline method (Transformer),our performance is optimized 4.6 of CIDEr on MS COCO and 3.8 on Flickr30K,which verifies that our method improves the accuracy of the predicted sentences effectively.In addition,our qualitative results demonstrate that the proposed method provides rich fine-grained image descriptions in comparison with other methods.Our method describes the number of appeared objects precisely when they belong to the same category.Our method could also describe small objects accurately.To further verify the effectiveness of the proposed hierarchical architecture,we visualize the attention mechanism and it shows that our method attends to discriminative parts of the target objects.In contrast,the baseline method may focus on irrelevant backgrounds,which leads to false predictions straightforward.Conclusion Our research is focused on a hierarchical architecture with dynamic fusion strategy for image captioning.The hierarchical architecture consists of a sequence of captioning decoders that refine the attention mechanism.To generate final sentence with rich fine-grained information,the dynamic fusion strategy aggregates different decoders.The ablation study demonstrates the effectiveness of each module in our proposed network.Our optimized results are demonstrated through the comparative experiments on MS COCO and Flickr30K datasets.
Keywords

订阅号|日报