Current Issue Cover
结合门循环单元和生成对抗网络的图像文字去除

王超群1,2,3, 全卫泽1,2, 侯诗玉1,2, 张晓鹏1,2, 严冬明1,2,3(1.中国科学院大学人工智能技术学院, 北京 100049;2.中国科学院自动化研究所模式识别国家重点实验室, 北京 100190;3.清华大学水沙科学水利水电工程国家重点实验室, 北京 100084)

摘 要
目的 图像文本信息在日常生活中无处不在,其在传递信息的同时,也带来了信息泄露问题,而图像文字去除算法很好地解决了这个问题,但存在文字去除不干净以及文字去除后的区域填充结果视觉感受不佳等问题。为此,本文提出了一种基于门循环单元(gate recurrent unit,GRU)的图像文字去除模型,可以高质量和高效地去除图像中的文字。方法 通过由门循环单元组成的笔画级二值掩膜检测模块精确地获得输入图像的笔画级二值掩膜;将得到的笔画级二值掩膜作为辅助信息,输入到基于生成对抗网络的文字去除模块中进行文字的去除和背景颜色的回填,并使用本文提出的文字损失函数和亮度损失函数提升文字去除的效果,以实现对文字高质量去除,同时使用逆残差块代替普通卷积,以实现高效率的文字去除。结果 在1 080组通过人工处理得到的真实数据集和使用文字合成方法合成的1 000组合成数据集上,与其他3种文字去除方法进行了对比实验,实验结果表明,在峰值信噪比和结构相似性等图像质量指标以及视觉效果上,本文方法均取得了更好的性能。结论 本文提出的基于门循环单元的图像文字去除模型,与对比方法相比,不仅能够有效解决图像文字去除不干净以及文字去除后的区域与背景不一致问题,并能有效地减少模型的参数量和计算量,最终整体计算量降低了72.0%。
关键词
Gate recurrent unit and generative adversarial networks for scene text removal

Wang Chaoqun1,2,3, Quan Weize1,2, Hou Shiyu1,2, Zhang Xiaopeng1,2, Yan Dongming1,2,3(1.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China;2.National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;3.State Key Laboratory of Hydro-Science and Engineering, Tsinghua University, Beijing 100084, China)

Abstract
Objective The textual information in digital images is ubiquitous in our daily life. However, while it delivers valuable information, it also runs the risk of leaking private information. For example, when taking photos or collecting data, some private information will inevitably appear in the images, such as phone numbers. Image text removal technology can protect privacy by removing sensitive information in the images. At the same time, this technology can also be widely used in image and video editing, text translation, and other related tasks. Tursun et al. added a binary mask as auxiliary information to make the model focus on the text area, which has made obvious improvements compared with the existing scene text removal methods. However, this binary mask is redundant because it covers a large amount of background information between text strokes, which means the removed area (indicating by binary mask) is larger than what needs to be removed (i.e., text strokes), and this limitation can be improved further. Considering the problems of unclean text removal in existing text removal methods and poor visual perception after text removal, we propose a gate recurrent unit (GRU)-based generative adversarial network (GAN) framework to effectively remove the text and obtain high-quality results. Method Our framework is fully “end-to-end”. We first take the image with text as input and the binary mask of the corresponding text area, the stroke-level binary mask of the input image can be accurately obtained through our designed detection module composed of multiple GRUs. Then, the GAN-based text removal module combines input image, text area mask, and stroke-level mask to remove the text in the image. Meanwhile, we propose the brightness loss function to further improve visual quality based on the observation that human eyes are more sensitive to changes in the brightness of the image. Specifically, we transfer the output image from the RGB space to the YCrCb color space and minimize the difference in the brightness channel of the output image and ground truth. The purpose of using the weighted text loss function is to make the model focus more on the text area. Using the weighted text loss function and brightness loss function proposed in this paper can effectively improve the performance of text removal. Our method applies the inverted residual blocks instead of standard convolutions to achieve a high-efficiency text removal model and balance model size and inference performance. The inverted residual structure first uses a point convolution operation with a convolution kernel of 1×1 to expand the dimension of the input feature map, which can prevent too much information from being lost after the activation function because of the low dimension. Then, a depth-wise convolution with the kernel of 3×3 is applied to extract features, and a 1×1 point convolution is used to compress the number of channels of the feature map. Result We conduct extensive experiments and evaluate 1 080 groups of real-world data obtained through manual processing and 1 000 groups of synthetic data synthesized using the SynthText method to validate our proposed method. In this work, we compare our method with several state-of-the-art text removal methods. For the evaluation metrics, we adopt two kinds of evaluation measures to evaluate the results quantitatively. The first type of evaluation indicators is PSNR (peak signal-to-noise ratio) and SSIM (structural similarity index), which are used to measure the difference between the results after removing the text and the corresponding ground truth. The second type of evaluation index is recall, precision, and F-measure, which are applied to measure the model’s ability to remove text. The experimental results show that our method consistently performs better in terms of PSNR and SSIM. In addition, we also compare the results of our proposed method qualitatively with state-of-the-art(SOTA) methods, and our method achieves better visual quality. The inverted residual blocks reduce the floating point of operations (FLOPs) by 72.0% with a slight reduction of the performance. Conclusion We propose a high-quality and efficient text removal method based on gate recurrent unit, which takes the image with text and the binary mask of the text area as inputs and obtains the image result after text removal in an “end-to-end” manner. Compared with the existing methods, our method can not only improve the problem of unclean image text removal effectively and the inconsistency of the text removal area with the background, but also reduce the model parameters and FLOPs effectively.
Keywords

订阅号|日报