Current Issue Cover
局部特征增强的转置自注意力图像超分辨率重建

孙阳, 丁建伟, 张琪, 邓琪瑶(中国人民公安大学信息网络安全学院, 北京 100038)

摘 要
目的 超分辨率(super resolution,SR)重建任务通过划分窗口引入自注意力机制进行特征提取,获得了令人瞩目的成绩。针对划分窗口应用自注意力机制时会限制图像信息聚合范围、制约模型对特征信息进行建模的问题,本文基于转置自注意力机制构建全局和局部信息建模网络捕捉图像像素依赖关系。方法 首先采用轻量的基线模型对特征进行简单关系建模,然后将空间维度上的自注意力机制转换到通道维度,通过计算交叉协方差矩阵构建各像素点之间的长距离依赖关系,接着通过引入通道注意力块补充图像重建所需的局部信息,最后构建双门控机制控制信息在模型中的流动,提高模型对特征的建模能力及其鲁棒性。结果 实验在5个基准数据集Set5、Set14、BSD100、Urban100、Manga109上与主流方法进行了比较,在不同比例因子的SR任务中均获得了最佳或者次佳的结果。与SwinIR(image restoration using swin Transformer)在×2倍SR任务中相比,在以上5个数据集上的峰值信噪比分别提升了0.03dB、0.21dB、0.05dB、0.29dB和0.10dB,结构相似度也获得了极大提升,同时视觉感知优化十分明显。结论 所提出的网络模型能够更充分地对特征信息全局关系进行建模,同时也不会丢失图像特有的局部相关性。重建图像质量明显提高,细节更加丰富,充分说明了本文方法的有效性与先进性。
关键词
Images super-resolution reconstruction of transposed self-attention with local feature enhancement

Sun Yang, Ding Jianwei, Zhang Qi, Deng Qiyao(School of Information Network Security, People’s Public Security University of China, Beijing 100038, China)

Abstract
Objective Research on super-resolution image reconstruction based on deep learning techniques has gained exceptional progress in recent years.In particular,when the development of traditional convolutional neural networks reached a bottleneck,Transformer,which performs extremely well in natural language processing,was introduced to approximate super-resolution image reconstruction.However,the computational complexity of Transformer is related to the square of the width and height of the input image,leading to the inability to migrate Transformer to low-level computer vision tasks fully.Recent methods,such as image restoration using Swin Transformer(SwinIR),have achieved excellent performance by dividing windows,performing self-attention within the windows and interacting the information between the windows.However,this method of dividing windows increases the computational burden as the window size increases.Moreover,the window division method cannot model the global information of images completely,resulting in partial loss of information.To solve the above problems,we model the long-range dependencies of images by constructing a Transformer block while maintaining a moderate level of the number of parameters.Excellent super-resolution reconstruction performance is achieved by constructing global dependencies of features.Method The proposed super-resolution network based on self-attention(SRTSA)consists of four main stages:a shallow feature extraction module,a deep feature special extraction module,an image upsampling module,and an image reconstruction module.The shallow feature extraction part consists of a 3 × 3 convolution.The deep feature extraction part mainly consists of a global and local information extraction block(GLIEB).Our proposed GLIEB performs simple relational modeling through a sufficiently lightweight nonlinear activation free block(NAFBlock).Although dropout can improve the robustness of the model,we discard the dropout layer to avoid losing other information before modeling the feature information globally.In the global modeling of feature information using the transposed self-attention mechanism,we keep the features with positive effects on image reconstruction and discard the features with negative effects by replacing the softmax activation function in the self-attention mechanism with the ReLU activation function,which makes the reconstructed global dependencies more robust.Given that an image includes global and local information,a residual channel attention module is used to supplement the local information and enhance the expressive ability of the model.Furthermore,a new dual-channel gating mechanism is introduced to control the flow of information in the model to improve the modeling capability of the model for features and its robustness.The image upsampling module uses subpixel convolution to expand the features to the target dimension,and the reconstruction module employs a 3 × 3 convolution to obtain the final reconstruction results.For the loss function,although many loss functions have been proposed to optimize model training,to demonstrate the advancement and effectiveness of our model,we use the same L1 loss function as that of SwinIR to supervise the model training.The L1 loss function can provide a stable gradient that allows the model to converge quickly.In the image training phase,800 images from the DIV2K dataset are used for training.The 800 training images are randomly rotated or horizontally flipped to expand the dataset,and 16 LR image blocks of size 48 × 48 pixels are used as input in each iteration.The Adam optimizer is used for training.Result We test on five datasets commonly used in super-resolution tasks,namely,Set5,Set14,Berkeley segmentation dataset 100 (BSD100),Urban100,and Manga109,to demonstrate the effectiveness and robustness of the proposed method.We also compare the proposed method with SRCNN,VDSR,EDSR,RCAN,SAN,HAN,NLSA,and SwinIR networks in terms of objective metrics.These networks are supervised using the L1 loss function during the training process.The peak signal-tonoise ratio(PSNR)and structural similarity(SSIM)are calculated on the Y channel of the YCbCr space of the output image to measure the image reconstruction effect.Experimental results show that the PSNR and SSIM values obtained our method are both optimal.In the ×2 super-resolution tasks,compared with those of SwinIR,the PSNR of the proposed method is improved by 0.03 dB,0.21 dB,0.05 dB,0.29 dB,and 0.10 dB,and the SSIM is enhanced by 0.000 4,0.001 6,0.000 9,and 0.002 7 on four datasets,except Manga109.The reconstruction effect demonstrates that SRTSA can recover more detailed information and more texture structure compared with most methods.From the attribution analysis of the model using local attribution maps(LAM),SRTSA uses a larger range of pixels in the reconstruction process compared with other methods,such as SwinIR,which fully illustrates the global modeling capability of SRTSA.Conclusion The proposed super-resolution image reconstruction algorithm based on a transposed self-attention mechanism can fully model the global relationship of feature information without losing the local relationship of features by converting the global relationship modeling in the spatial dimension into a channel dimension for global relationship modeling.It also contains global and local information,which effectively improves the image super-resolution reconstruction performance.The excellent PSNR and SSIM on five datasets and the significantly high quality of the reconstructed images with rich details and sharp edges fully demonstrate the effectiveness and advancedness of the proposed method.
Keywords

订阅号|日报