通道注意力嵌入的Transformer图像超分辨率重构
熊巍1, 熊承义1,2, 高志荣3, 陈文旗1, 郑瑞华1, 田金文4(1.中南民族大学电子信息工程学院, 武汉 430074;2.中南民族大学智能无线通信湖北省重点实验室, 武汉 430074;3.中南民族大学计算机科学学院, 武汉 430074;4.华中科技大学多谱信息处理技术国家重点实验室, 武汉 430074) 摘 要
目的 基于深度学习的图像超分辨率重构研究取得了重大进展,如何在更好提升重构性能的同时,有效降低重构模型的复杂度,以满足低成本及实时应用的需要,是该领域研究关注的重要问题。为此,提出了一种基于通道注意力(channel attention,CA)嵌入的Transformer图像超分辨率深度重构方法(image super-resolution with channelattention-embedded Transformer,CAET)。方法 提出将通道注意力自适应地嵌入Transformer变换特征及卷积运算特征,不仅可充分利用卷积运算与Transformer变换在图像特征提取的各自优势,而且将对应特征进行自适应增强与融合,有效改进网络的学习能力及超分辨率性能。结果 基于5个开源测试数据集,与6种代表性方法进行了实验比较,结果显示本文方法在不同放大倍数情形下均有最佳表现。具体在4倍放大因子时,比较先进的SwinIR (image restoration using swin Transformer)方法,峰值信噪比指标在Urban100数据集上得到了0.09 dB的提升,在Manga109数据集提升了0.30 dB,具有主观视觉质量的明显改善。结论 提出的通道注意力嵌入的Transformer图像超分辨率方法,通过融合卷积特征与Transformer特征,并自适应嵌入通道注意力特征增强,可以在较好地平衡网络模型轻量化同时,得到图像超分辨率性能的有效提升,在多个公共实验数据集的测试结果验证了本文方法的有效性。
关键词
Image super-resolution with channel-attention-embedded Transformer
Xiong Wei1, Xiong Chengyi1,2, Gao Zhirong3, Chen Wenqi1, Zheng Ruihua1, Tian Jinwen4(1.School of Electronic and Information Engineering, South-Central Minzu University, Wuhan 430074, China;2.Hubei Key Laboratory of Intelligent Wireless Communication, South-Central Minzu University, Wuhan 430074, China;3.School of Computer Science, South-Central Minzu University, Wuhan 430074, China;4.State Key Laboratory of Multispectral Information Processing Technology, Huazhong University of Science and Technology, Wuhan 430074, China) Abstract
Objective Research on single image super-resolution reconstruction based on deep learning technology has made great progress in recent years. However, to improve reconstruction performance, previous studies have mostly focused on building complex networks with a large number of parameters. How to effectively reduce the complexity of the model while improving the reconstruction performance to meet the needs of low-cost and real-time applications has become an important research direction. While state-of-the-art lightweight super-resolution methods are mainly based on convolutional neural networks, only few methods have been designed with Transformer, which show an excellent performance in image restoration tasks. To solve these problems, we propose a lightweight super-resolution network called image superresolution with channel-attention-embedded Transformer(CAET), which can achieve excellent super-resolution performance with a small number of parameters. Method CAET involves four stages, namely, shallow feature extraction, hierarchical feature extraction, multi-layer feature fusion, and image reconstruction. The hierarchical feature extraction stage is performed by a basic building block called channel-attention-embedded Transformer block(CAETB), which adaptively embeds channel attention(CA)into Transformer and convolutional features, hence not only taking full advantage of the convolutional network and Transformer in image feature extraction but also adaptively enhancing and fusing the corresponding features. Convolutional layers provide stable optimization and extraction results during early vision feature processing, and co-solution layers with spatially invariant filters can enhance the advection equivalence of the network. The stacking of convolutional layers can effectively increase the perceptual field of the network. Therefore, three cascaded convolutional layers are placed in front of CAETB to receive the features output from the previous module, and the LeakyReLU activation function is used to activate them. The features extracted by convolution layers are embedded with channel attention. To effectively adjust the channel attention parameters, we adopt a linear weighting method to combine channel attention with features from different levels. These features are then inputted into the swin Transformer layer(SwinIR)for further deep feature extraction. Given that increasing the network depth leads to saturation, we set the number of CAETB to 4 to maintain a balance between model complexity and super-resolution performance. The hierarchical information at different stages is helpful in interpreting the final reconstruction results. Therefore, CAET combines all the low- and high-level information from the deep feature extraction and multi-level feature fusion stages. In the image reconstruction phase, we use a convolution layer and the pixel shuffle layer to upsample the features to the corresponding dimensions of a high-resolution image. During the training stage, we use 800 images from the DIV2K dataset to train CAET, and we augment all training images by randomly flipping them vertically and horizontally to increase the diversity of the training data. For each mini-batch, we randomly crop image patches to a size of 64 × 64 pixels as our low-resolution(LR)images. We then optimize our network using the Adam algorithm and apply L1 loss as our loss function. Result We conduct experiments on five public datasets, namely, Set5, Set14, Berkeley segmentation dataset(BSD)100, Urban100, and Manga109, to compare the performance of our proposed method with that of six state-of-the-art models, including super-resolution convolutional neural network (SRCNN), cascading residual network(CARN), information multi-distillation network(IMDN), super-resolution with lattice block(LatticeNet), and image restoration using swin Transformer(SwinIR). We measure the performances of these methods using peak signal-to-noise ratio(PSNR)and structural similarity(SSIM)as metrics. Given that humans are highly sensitive to the brightness of images, we measure these metrics in the Y channel of an image. Experiment results show that the proposed method receives the highest PSNR and SSIM values and recovers more detailed information and more accurate texture compared with the state-of-the-art methods at ×2, ×3, and ×4 amplification factors. At the ×4 amplification factor, the PSNR of the proposed method is improved by 0. 09 dB on the Urban100 dataset and by 0. 30 dB on the Manga109 dataset compared to that of SwinIR. In terms of model complexity, CAET achieves a better performance with fewer parameters and multiply-accumulator operations compared to SwinIR, which also uses Transformer as the backbone of the network. Although CAET consumes more parameters and multiply-accumulator operations compared to IMDN and the LatticeNet, this method achieves significantly higher performance in terms of PSNR and SSIM. Conclusion The proposed CAET can effectively improve the image super-resolution reconstruction performance by fusing convolution and Transformer features and applies adaptive embedding channel attention to enhance the features. CAET effectively improves the image super-resolution performance while controlling the complexity of the whole network. Experiment results on several public experimental datasets verify the effectiveness of our method.
Keywords
super-resolution(SR) Transformer convolutional neural network(CNN) channel attention(CA) deep learning
|