Current Issue Cover
多层次融合注意力网络的双目图像超分辨率重建

徐磊1,2, 宋慧慧1,2, 刘青山1,2(1. 南京信息工程大学大气环境与装备技术协同创新中心, 南京 210044;2.
2. 江苏省大数据分析技术重点实验室, 南京 210044)

摘 要
目的 随着深度卷积神经网络广泛应用于双目立体图像超分辨率重建任务,双目图像之间的信息融合成为近年来的研究热点。针对目前的双目图像超分辨重建算法对单幅图像的内部信息学习较少的问题,提出多层次融合注意力网络的双目图像超分辨率重建算法,在立体匹配的基础上学习图像内部的丰富信息。方法 首先,利用特征提取模块从不同尺度和深度来获取左图和右图的低频特征。然后,将低频特征作为混合注意力模块的输入,此注意力模块先利用二阶通道非局部注意力模块学习每个图像内部的通道和空间特征,再采用视差注意力模块对左右特征图进行立体匹配。接着采用多层融合模块获取不同深度特征之间的相关信息,进一步指导产生高质量图像重建效果。再利用亚像素卷积对特征图进行上采样,并和低分辨率左图的放大特征相加得到重建特征。最后使用1层卷积得到重建后的高分辨率图像。结果 本文算法采用Flickr1024数据集的800幅图像和60幅经过2倍下采样的Middlebury图像作为训练集,以峰值信噪比(peak signal-to-noise ratio,PSNR)和结构相似性(structural similarity,SSIM)作为指标。实验在3个基准测试集Middlebury、KITTI2012和KITTI2015上进行定量和定性评估。实验结果显示,本文算法获得了最清晰的图像效果。当放大因子为2时,在3个数据集上的指标与PASSRnet (learning parallax attention for stereo image super-resolution)相比,本文算法的峰值信噪比提升了0.56 dB、0.31 dB和0.26 dB,结构相似性均提升了0.005。结论 本文提出的网络模型充分学习图像内部的丰富信息,有效指导左右特征图的立体匹配。同时,能够不断进行高低频信息融合,取得了较好的重建效果。
关键词
Super-resolution reconstruction of binocular image based on multi-level fusion attention network

Xu Lei1,2, Song Huihui1,2, Liu Qingshan1,2(1. Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China;2.
2. Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing 210044, China)

Abstract
Objective Binocular images-interrelated information fusion method has been developing intensively in terms of deep convolutional neural networks(DNNs)based binocular stereo image super-resolution tasks. However,current stereo image super-resolution algorithms are challenged for internal information learning of a single image. To resolve this problem,we develop a multi-level fusion attention network-relevant binocular image super-resolution reconstruction algorithm, which can learn stereo matching-related richer information-inner of the image. Method Our network is demonstrated and composed of multiple modules in the context of 1)feature extraction,2)mixed attention,3)multi-level fusion,and 4) reconstruction. The feature extraction module consists of 1) convolutional layer,2) residual unit,and 3) residualintensive atrous space pyramid pooling module. Specifically,a convolutional layer is used to extract the shallow features of the low-resolution image,and the residual unit and the residual-intensive spatial pyramid pooling module is used to process the shallow features alternately. To form a spatial pyramid pooling group,the residual-intensive atrous space pyramid pooling module is interconnected of three hollow convolutions with expansion rates of 1,4,and 8 in parallel. First,three spatial pyramid pooling groups of the same structure are cascaded,and the output features and input features of each group are transmitted back sequentially to the next group in terms of a densely connected manner. Then,to perform feature fusion and channel reduction,a convolution layer is utilized at the end of each spatial pyramid pooling group. At the end of the module,a dense feature fusion and global residual connection are performed and the output features of each spatial pyramid pooling group are fused together,and linear-superimposed is followed with the input features of the module. The mixed attention module is mainly organized of 1)the second-order channel non-local attention module,and 2)the parallax attention module. The second-order channel and non-local attention modules are divided into 1)second-order channels and spatial attention modules,and 2)high-efficiency non-local modules. To optimize effective information,the second-order channel and spatial attention module can be used to extract useful information of features in channel and spatial dimensions. The input features are transmitted in the channel and spatial dimensions at the same time,where the channel dimension first performs global covariance pooling on it,and the convolution is then used to increase and decrease the dimensionality of the channels to obtain the correlation between the channels,which is regarded as the channel attention map. Finally, the input features are adjusted using the channel attention map. In the spatial dimension,the module first performs global average pooling and global maximum pooling on the input feature map at the same time,and cascade the generated feature maps,the convolution and sigmoid function is then used to obtain the spatial attention map,and the spatial attention map is used to compare the input features make adjustments at the end. The high-efficiency non-local module uses non-local operation to learn the global correlation of features to expand the receptive field and capture contextual information. The parallax attention module first uses the convolutional layer and the residual unit to process the left and right feature maps,and the parallax attention mechanism is then used to capture the stereo correlation between the left and right images for stereo matching. The multi-level fusion module takes the dense residual block as the basic block,and the attention mechanism is then used to explore the interlinks between different depth features,assign different attention weights to different depth features,and improve the characterization ability of features. To obtain the reconstructed feature,sub-pixel convolution is used to up-sample the feature map and it is added to the enlarged feature of the low-resolution left image as well. Finally,a layer of convolution is used to obtain the reconstructed high-resolution image. Result Our algorithm is developed and organized by 800 images of the Flickr1024 dataset and twice-downsampled 60 Middlebury images as the training set. Our research is focused on the bicubic interpolation down-samples the high-resolution images to generate low-resolution images, and a sum of 20 steps is used to crop these low-resolution images into image blocks,and the high-resolution images are also cropped after that. As the benchmark dataset,the test set is based on 5 images from the Middlebury dataset,20 images from the KITTI2012 dataset,and 20 images from the KITTI2015 dataset. To evaluate the reconstruction effect of the model quantitatively and compare it to other methods,peak signal-to-noise ratio(PSNR)and structural similarity(SSIM)are used as evaluation indicators. We compare the results of the algorithm model to some single-image super-resolution methods,and the latest stereo image super-resolution methods like StereoSR,PASSRnet,SRResNet + SAM,SRResNet + DFAM,CVCnet on three benchmark test sets in terms of the scale and conditions. The KITTI2012 test set is as an example when the scale×2,the PSNR and SSIM are 0. 17 dB and 0. 002 higher than the CVCnet network of each. Conclusion Our model is demonstrated and focused on fully learning for richer effective information,and it can guide the stereo matching of the left and right feature maps effectively. Furthermore,the fusion of high and low frequency information is in consistent, and a good reconstruction effect is achieved. Our algorithm model is still challenged to optimize richer information in a single image and the complementary information between the left and right images. The future research direction can be predicted that a single image feature extraction module is required to be designed and get a left and right image feature fusion module further.
Keywords

订阅号|日报