Current Issue Cover
以文字为中心的图像理解技术综述

张言1,2, 李强1,2, 申化文1,2, 曾港艳3, 周宇1,2, 马灿1,2, 张远3, 王伟平1,2(1.中国科学院信息工程研究所, 北京 100093;2.中国科学院大学网络空间安全学院, 北京 101408;3.中国传媒大学媒体融合与传播国家重点实验室, 北京 100024)

摘 要
文字广泛存在于各种文档图像和自然场景图像之中,蕴含着丰富且关键的语义信息。随着深度学习的发展,研究者不再满足于只获得图像中的文字内容,而更加关注图像中文字的理解,故以文字为中心的图像理解技术受到越来越多的关注。该技术旨在利用文字、视觉物体等多模态信息对文字图像进行充分理解,是计算机视觉和自然语言处理领域的一个交叉研究方向,具有十分重要的实际意义。本文主要对具有代表性的以文字为中心的图像理解任务进行综述,并按照理解认知程度,将以文字为中心的图像理解任务划分为两类,第1类仅要求模型具备抽取信息的能力,第2类不仅要求模型具备抽取信息的能力,而且要求模型具备一定的分析和推理能力。本文梳理了以文字为中心的图像理解任务所涉及的数据集、评价指标和经典方法,并进行对比分析,提出了相关工作中存在的问题和未来发展趋势,希望能够为后续相关研究提供参考。
关键词
Text-centric image analysis techniques:a crtical review

Zhang Yan1,2, Li Qiang1,2, Shen Huawen1,2, Zeng Gangyan3, Zhou Yu1,2, Ma Can1,2, Zhang Yuan3, Wang Weiping1,2(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 101408, China;3.State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China)

Abstract
Text can be as one of the key carriers for information transmission. Digital media-related text has been widely developing for such image aspects of document and scene contexts. To extract and analyze these text information-involved images automatically,Conventional researches are mainly focused on automatic text extraction techniques like scene text detection and recognition. However,text-centric images-based semantic information recognition or analysis as a downstream task of spotting text,remains a challenge due to the difficulty of fully leveraging multi-modal features from both vision and language. To this end,text-centric image understanding has been an emerging research topic and many related tasks have been proposed. For example,the visual information extraction technique is capable of extracting the specified content from the given image,which can be used to improve productivity in finance,social media,and other fields. In this paper,we introduce five representative text-centric image understanding tasks and conduct a systematic survey on them. According to the understanding level,these tasks can be broadly classified into two categories. The first category requires the basic understanding ability to extract and distinguish information,such as visual information extraction and scene text retrieval. In contrast,besides the fundamental understanding ability,the second category is more concerned with highlevel semantic understanding capabilities like information aggregation and logical reasoning. With the research progress in deep learning and multimodal learning,the second category has attracted considerable attention recently. For the second category,this survey mainly introduces document visual question answering,scene text visual question answering,and scene text image captioning tasks. Over the past few decades,the development of text-centric image understanding techniques has gone through several stages. Earlier approaches are based on heuristic rules and may only utilize unimodal features. Currently,deep learning methods have gained wide popularity and dominated this area. Meanwhile,multimodal features are valued and exploited to improve performance. To be more specific,traditional visual information extraction depends on pre-defined templates or specific rules. Traditional text retrieval task tends to represent words with pyramid histograms of character vectors and predict the matched image according to the representation distance. Expanded from the conventional visual question answering framework,earlier document visual question answering,and scene text visual question answering approaches simply add an optical character recognition branch to extract text information. As integrating knowledge from multimodal signals helps to better understand images,graph neural networks and Transformer-based frameworks are used to fuse multi-modal features recently. Furthermore,self-supervised pre-training schemes are applied to learn the alignment between different modalities,thus boosting model capabilities by a large margin. For each text-centric image understanding task,we summarize classical methods and further elaborate the pros and cons of them. In addition, we also discuss the potential problems and further research directions for the community. Firstly,due to the complexity of different modality features,such as mutative layout and diverse fonts,current deep learning architectures still fail to complete the interaction of multi-modal information efficiently. Secondly,existing text-centric image understanding methods are still limited in their reasoning abilities,involving counting,sorting,and arithmetic operations. For instance,in document visual question answering and scene text visual question answering tasks,current models have difficulty predicting accurate answers when they require to jointly reason over image layout,textual content,and visual art,etc. Finally,the current text-centric understanding tasks are often trained independently and the correlation between different tasks has not been effectively leveraged. We hope this survey can help researchers capture the latest progress in text-centric image understanding and inspire the new design of advanced models and algorithms.
Keywords

订阅号|日报