Current Issue Cover
少数民族文字文本分析与识别的研究进展

王维兰1, 胡金水2, 魏宏喜3, 库尔班·吾布力4, 邵文苑5, 毕晓君6, 贺建军7, 李振江8, 丁凯9, 金连文10, 高良才11(1.西北民族大学数学与计算机科学学院, 兰州 730030;2.科大讯飞研究院, 合肥 230001;3.内蒙古大学计算机学院, 呼和浩特 010021;4.新疆大学计算机科学与技术学院, 乌鲁木齐 830046;5.上海大学社会学院, 上海 200000;6.中央民族大学信息工程学院, 北京 100081;7.大连民族大学信息与通信工程学院, 大连 116605;8.甘肃政法大学网络空间安全学院, 兰州 730000;9.上海合合信息科技股份有限公司, 上海 200000;10.华南理工大学电子与信息学院, 广州 510641;11.北京大学王选计算机研究所, 北京 100871)

摘 要
对于少数民族古籍的保护与传承,国家予以高度重视,并强调了对这些不可再生文化资源透彻数字化的重要性。随着文档图像分析与识别技术的不断进步,对少数民族文字的文本分析与识别研究受到广泛关注,并取得显著成就,成为人工智能应用研究的一个热点领域。然而,由于少数民族文字种类繁多、应用场景多样及数据集的稀缺性等问题,这一研究领域仍面临诸多挑战。本文旨在总结先前的工作,并为未来的研究提供支持,重点讨论了印刷体文本、联机手写、古籍文档及场景文字识别等任务,概述了国内外在少数民族文种识别领域的发展和最新成果。首先阐明了少数民族文字文本分析与识别的重要性及其价值,介绍了特定少数民族文字及其古籍文档的特征。然后,回顾了这一领域的发展历史和现状,分析并总结了传统方法的代表性成果及其应用;详细讨论了研究重点向深度神经网络模型和深度学习方法的全面转移,这一转变使得各文种的识别性能得到了显著提升。最后,基于相关分析,本文指出了在不同文种文档分析与识别中存在的精度和泛化能力等方面的不足,以及与汉文文本分析与识别的差异;面对少数民族文字文本识别领域的主要困难与挑战,展望了未来的研究趋势和技术发展目标。
关键词
Survey on text analysis and recognition for multiethnic scripts

Wang Weilan1, Hu Jinshui2, Wei Hongxi3, Ubul Kurban4, Shao Wenyuan5, Bi Xiaojun6, He Jianjun7, Li zhenjiang8, Ding Kai9, Jin Lianwen10, Gao Liangcai11(1.School of Mathematics and Computer Science, Northwest Minzu University, Lanzhou 730030, China;2.iFLYTEK Research Co., Ltd., Hefei 230001, China;3.College of Computer Science-College of Software, Inner Mongolia University, Hohhot 010021, China;4.School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China;5.School of Sociology and Political Science, Shanghai University, Shanghai 200000, Chna;6.School of Information Engineering, Minzu University of China, Beijing 100081, China;7.College of Information and Communication Engineering, Dalian Minzu University, Dalian 116605, China;8.School of Cyberspace Security, Gansu University of Political Science and Law, Lanzhou 730000, China;9.INTSIG Information Co., Ltd., Shanghai 200000, China;10.School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China;11.Wangxuan Computer Institute, Peking University, Beijing 100871, China)

Abstract
China’s ethnic scripts differ in their structure types,creation periods,and regions of usage and scope. The historical documents and various literary materials written,recorded,and printed in ethnic scripts are even more voluminous, which leave an invaluable wealth for exploring the civilization and development history of different ethnic groups. Compared with mainstream languages,the study of ethnic minority scripts often faces low-resource conditions. In recent years, the protection and inheritance of the intangible cultural heritage of ethnic minorities have attracted increased attention from the country,which has great importance and application value for the protection of irreparable diverse cultural resources. By applying traditional image processing,pattern recognition,and machine learning methods,certain results have been achieved in text recognition and document recognition in Mongolian,Tibetan,Uyghur,Kazakh,Korean,and other major languages. Compared with mainstream languages such as English and Chinese,the research on the character recognition of minority languages,the analysis of document images,and the development of application systems is relatively lagging behind. Since the 21st century,the research and application of ethnic script text analysis and recognition have received extensive attention and made remarkable progress due to the continuous development and application of technologies in the field of document image analysis and recognition. They have become the research hotspots in the field of document analysis and recognition and artificial intelligence. However,a large number of problems still need to be solved in the field of minority script text and recognition research due to the large number of minority scripts,the wide range of application scenarios, and the scarcity of datasets. This study reviews the development history and recent progress in this field at home and abroad to better summarize previous works and provide support for the subsequent research. It focuses on four subtasks:printed text recognition,handwriting recognition,historical document recognition,and scene text recognition of several minority texts. It mainly includes Tibetan,Mongolian,Uighur,Yi,Manchu,and Dongba. These studies are mainly related to the following areas. 1)In the document image preprocessing stage,the system performs a series of operations on the input image,such as binarization,noise removal,skew correction,and image enhancement. The goal of preprocessing is to improve the accuracy of subsequent analysis and recognition. 2)Layout analysis,such as layout segmentation,text line segmentation,and character segmentation,helps understand the organizational structure of documents and extract useful information. 3)Text recognition is one of the core tasks of document image analysis,which identifies the text in a document through various technical approaches. This task may involve traditional methods such as text recognition based on single character classifiers,or it may include end-to-end text line recognition in deep learning methods. 4)Dataset construction involves constructing various datasets for training and evaluating algorithms,such as document image binarization datasets,layout analysis datasets,text line datasets,and character datasets. By contrast,analysis and recognition of historical documents are difficult due to the complexities of rough,degraded,and damaged historical book papers,which result in severe background noise in the document image layout,sticky text strokes,unclear handwriting,and damage. At present,a practical recognition system for historical documents is lacking. First,the importance and value of minority script text analysis and recognition are explained,and some minority script texts,especially historical documents,and their characteristics are introduced. Then,the history of the development of the field and the current state of the research are reviewed,and the representative results of the research of the traditional methods and the progress of the research of the deep learning methods are analyzed and summarized. Current research objects are expanding in depth and breadth,with processing methods comprehensively shifting to deep neural network models and deep learning methods. The recognition performance is also greatly improved,and the application scenarios are constantly expanding. One of the studies realizes effective modeling under low resources. It further proposes a unified multilingual joint modeling technology to identify multiple languages through one model,greatly reduce the overhead of hardware resources,and significantly improve the image and text recognition effect and generalization in multilingual scenarios. At present,it can recognize images and texts in 18 key languages or ethnic languages,including English,French,German,Japanese,Russian,Korean,Arabic,Uyghur, Kazakh,and Inner Mongolian. Based on relevant analyses,obvious deficiencies are observed in recognition accuracy and generalization ability,and differences with Chinese text recognition of ethnic script text recognition are found. The characteristics of the characters and documents of each language are completely different from those of Chinese characters and Chinese documents. For example,in the development of the Yi language,variant characters are particularly abundant due to various factors,and“one-to-many,many-to-one”characters and interpretations are the norm. The arbitrariness and diversity of historical Yi handwriting have brought great challenges to the recognition of historical Yi script. Moreover,the Tibetan script uses arabesque,the shape of the letters is complex,the black plum script is intertwined with each other, some strokes even span several characters before and after,and the connection between the letters is also relatively unique. Thus,the multi-style Tibetan recognition with high complexity and difficulty needs to be solved to achieve true multi-font text recognition. Finally,the main difficulties and challenges faced in the field of minority text recognition are discussed, and the future research trends and technical development goals are prospected. For example,research and application system development are conducted in combination with the characteristics of different languages,layout formats,and varying application scenarios. A certain gap still exists between the recognition of most ethnic languages and the development of Chinese recognition,especially in applications related to education,security,and people’s livelihood. This gap can be addressed by actively expanding new application directions. Opportunities for expansion are abundant,such as migrating large language models to ethnic minority scripts and text recognition and developing a unified multilingual joint modeling and application system.
Keywords

订阅号|日报