结合部首字形和层级结构的手写汉字纠错方法
摘 要
目的 手写汉字纠错(handwritten Chinese character error correction,HCCEC)任务具有两重性,即判断汉字正确性和对错字进行纠正,该任务在教育场景下应用广泛,可以帮助学生学习汉字、纠正书写错误。由于手写汉字具有复杂的空间结构、多样的书写风格以及巨大的数量,且错字与正确字之间具有高度的相似性,因此,手写汉字纠错的关键是如何精确地建模一个汉字。为此,提出一种层级部首网络(hierarchical radical network,HRN)。方法 从部首字形的角度出发,挖掘部首形状结构上的相似性,通过注意力模块捕获包含部首信息的细粒度图像特征,增大相似字之间的区分性。另外,结合汉字本身的层级结构特性,采用基于概率解码的思路,对部首的层级位置进行建模。结果 在手写汉字数据集上进行实验,与现有方案相比,HRN在正确字测试集与错字测试集上,精确率分别提升了0.5%和9.8%,修正率在错字测试集上提升了15.3%。此外,通过注意力机制的可视化分析,验证了HRN可以捕捉包含部首信息的细粒度图像特征。部首表征之间的欧氏距离证明了HRN学习到的部首表征向量中包含了部首的字形结构信息。结论 本文提出的HRN能够更好地对相似部首进行区分,进而精确地区分正确字与错字,具有很强的鲁棒性和泛化性。
关键词
A method of radical form and hierarchical structure based handwritten Chinese character error correction
Li Yunqing1, Du Jun1, Hu Pengfei1, Zhang Jianshu2(1.National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei 230026, China;2.iFLYTEK CO., LTD., Hefei 230088, China) Abstract
Objective Handwritten Chinese character error correction(HCCEC)is developed to handle the complex hierarchical structure,multiple writing styles,and large-scale character vocabulary of Chinese characters recently. The HCCEC is focused on two aspects for assessment and correction. The assessment can be used to determine whether a given handwritten isolated character is correct or not. The correction can be used to locate and correct specific character-misspelled errors. However,HCCEC has its unique chateristics beyond handwritten Chinese character recognition(HCCR)on three aspects as mentioned below:first,such categories of misspelled characters are endless to deal with more inquality Chinese characters,which puts a high demand on the generalization ability of the model. We assume that the training samples are right characters,in which both right characters and misspelled ones are involved in test set. The transfer learning ability of the model is still challenged to handle unclear misspelled characters. Therefore,HCCEC is melted into a generalized zeroshot learning(GZSL)problem further. Compared to zero-shot learning,GZSL-related test set contains seen and unseen classes,which makes it more realistic and challenging. Simutaneously,characters-misspelled misclassification is to be optimized as the right ones when testing. Second,misspelled characters could be quite similar to the right ones. It requires the ability of the model to capture fine-grained features. Third,to optimize HCCR,HCCEC-relevant verification is oriented to link corresponding right characters with misspelled characters. Method Radical-between similarities is developed in terms of radical shape and structure,and a hierarchical radical network(HRN)is melted into. For the analysis of Chinese characters,the key issue is to extract radical and structural information. For similar radicals,their distance in the representation space should be close. The completed radical information is beneficial for similar characters-between clarification, which is crucial for resolving the HCCEC task to some extent. Structure refers to the two-dimensional spatial contexts of the entire character. The hierarchical decomposition modeling of Chinese characters is also required for dealing with the problem of hierarchical structure of Chinese characters. The attention mechanism is implemented to capture fine-grained image features for similar character-between clarification. Specifically,the HRN is proposed in relevance to a convolutional neural network-based encoder and two attention modules. To obtain the representation of radicals,all radicals in the dictionary are fed into the embedding layer in the input stage. Through the first attention module,attention weights are calculated, which is used to obtain scores on the existence of radicals. After that,the radical attention module is used to balance the weight of each radical in different Chinese characters. Finally,the hierarchical-related embedding can be used to get the probability of each character. Result Experiments are carried out on the basis of the in-house handwritten Chinese character dataset. It contains 401 400 handwritten samples for 7 000 common characters and 570 misspelled characters. It also consists of corresponding character-level and radical-level labels. Three sorts of metrics are introduced to evaluate the quality of models. The first one is the F1 score,a measure of pre-judgement ability. The second one is accuracy,a fine measure of classification ability. The last one is correction rate,which aims to measure the error correction ability of models. Each HRN is optimized by 0. 5% and 9. 8% for the right character test set and the misspelled character test set. And,the correction rate is improved by 15. 3% on the misspelled character test set. For ablation experiments,we verify the effectiveness of the attention modules and hierarchical embedding for each. At the same time,we also conduct experiments on the dataset Chinese text in the wild(CTW),which has occupied 1 million street view images approximately. The accuracy is improved by 0. 5% as well. Due to the diversity and complexity of CTW,it has potential robustness and feasibility of the HRN. Qualitative results show that the attention module can capture the corresponding positions of each radical to a certain extent. Conclusion we develop a radical shape based hierarchical radical network. It can be used to learn the representation of each radical through the attention mechanism,and fine-grained features can be captured more precisely. Similar radicals can be better sorted out,and handwritten characters-related errors can be detected more easily. Our proposed model is still challenged for sufficient and effective training samples. Future research direction can be probably focused on the extension to text lines beyond isolated characters.
Keywords
handwritten Chinese character error correction(HCCEC) Chinese character recognition radical analysis generalized zero-shot learning(GZSL) attention mechanism convolutional neural network(CNN)
|