Current Issue Cover
轻量级注意力约束对齐网络的视频超分重建

靳雨桐, 宋慧慧, 刘青山(南京信息工程大学, 江苏省大气环境与装备技术协同创新中心, 江苏省大数据分析技术重点实验室, 南京 210044)

摘 要
目的 深度学习在视频超分辨率重建领域表现出优异的性能,本文提出了一种轻量级注意力约束的可变形对齐网络,旨在用一个模型参数少的网络重建出逼真的高分辨率视频帧。方法 本文网络由特征提取模块、注意力约束对齐子网络和动态融合分支3部分组成。1)共享权重的特征提取模块在不增加参数量的前提下充分提取输入帧的多尺度语义信息。2)将提取到的特征送入注意力约束对齐子网络中生成具有精准匹配关系的对齐特征。3)将拼接好的对齐特征作为共享条件输入动态融合分支,融合前向神经网络中参考帧的时域对齐特征和原始低分辨率(low-resolution,LR)帧在不同阶段的空间特征。4)通过上采样重建高分辨率(high-resolution,HR)帧。结果 实验在两个基准测试数据集(Vid4(Vimeo-90k)和REDS4(realistic and diverse scenes dataset))上进行了定量评估,与较先进的视频超分辨率网络相比,本文方法在图像质量指标峰值信噪比(peak signal to noise ratio,PSNR)和结构相似性(structural similarity,SSIM)方面获得了更好的结果,进一步提高了超分辨率的细节特征。本文网络在获得相同的PSNR指标的情况下,模型参数减少了近50%。结论 通过极轴约束使得注意力对齐网络模型参数量大大减少,并能够充分捕获远距离信息来进行特征对齐,产生高效的时空特征,还通过设计动态融合机制,实现了高质量的重建结果。
关键词
Super-resolution video frame reconstruction through lightweight attention constraint alignment network

Jin Yutong, Song Huihui, Liu Qingshan(Nanjing University of Information Science and Technology, Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing 210044, China)

Abstract
Objective Current deep learning technology is beneficial to video super-resolution(SR) reconstruction. The existing methods are constrained of the accuracy of motion estimation and compensation based on optical flow estimation, and the reconstruction effect of large-scale moving targets is poor. The deformable convolutional alignment network captures the target's motion information via learning adaptive receptive fields, and provides a new solution for video super-resolution reconstruction. To reconstruct realistic high-resolution(HR) video frames, our lightweight-attention-constrained deformable alignment network aims to use a less model parameters network to make full use of the redundant information between the reference frame and adjacent frames.Method Our attention constraint alignment network (ACAN) consists of three key components like feature extraction module, attention constraint alignment sub-network and dynamic fusion. First, the 5 layers are designed in terms of shared weights feature extraction module in the context of three ground residuals without batch normalization (BN) layer and two residuals atrous spatial pyramid pooling (res_ASPP). To extract multi-scale information and multi-level information without increasing the amount of parameters, the two residuals atrous spatial pyramid pooling and three ground residuals are connected alternately without batch normalization layer. After that, the polar axis constraint and the attention mechanism are integrated to design a lightweight attention constraint alignment sub-network (ACAS). The network regulates the input features of deformable convolution via capturing the global correspondence between adjacent frames and reference frames in the time domain under polar axis constraints, and generates a reasonable offset to achieve implicit alignment. Specifically, the ACAS is introduced through combining the deformable convolution with attention and polar axis constraint. The three attention constraint blocks (ACB) involved ACAS to constrain the features on the horizontal axis of neighboring frames. To find out the most similar features, it can code the feature correlation between any two positions along the horizontal line. At the same time, an effective mask is designed to solve the unavoidable occlusion phenomenon in the video. In the feature extraction module, we send extracted features to the alignment module to generate alignment features with exact matching relationships. In the ablation experiment, we verified that the network can well capture the matching relationship between the reference frame and the adjacent frame using a layer of ACB. However, the network can capture the matching relationship between adjacent frames and the reference frame and handle the status of large motion in the video based on the cascaded three-layer ACB. Therefore, we select a cascaded three-layer ACB network during network design. We illustrate a dynamic fusion branch, which is composed of 16 dynamic fusion blocks. Each block is made of two spatial feature transformation (SFT) layers and two 1×1 convolutions. This branch fuses the time alignment features of the reference frame in the forward neural network and the spatial features of the original low-resolution(LR) frame at different stages. Finally, the high-resolution frame is reconstructed and to be trained. Vimeo-90K is a widely used training dataset and is evaluated in conjunction with the Vid4 test dataset in common. In the training process, this network is trained on Vimeo-90K dataset and tested on Vid4 and REDS4 datasets. The loss function chooses the Charbonnier penalty function solely. The channel size of each layer is set to 64 for the final comparison, where we designates that the alignment module is composed of a layer of attention constraint alignment module, while that the assigned alignment module is cascading from three layers of attention constraint alignment module.Additionally, the network makes use of seven consecutive frames as input. Our RGB patches of a size of 64×64 are used as input to the video SR, with the mini-batch size set to 16. We use the Adam optimizer to update the network parameters. The initial learning rate is set to 4e-4. All experiments are conducted on PyTorch 1.0 and four Nvidia Tesla T4 GPUs.Result Our experiment is evaluated on two benchmark datasets quantitatively, including Vid4 and realistic and diverse scenes dataset(REDS4), and the proposed combined method obtained better results in the image quality indicators peak signal to noise ratio (PSNR) and structural similarity (SSIM). Our results are compared the model to 10 recognized super resolution models, including single image super resolution(SISR) and video super resolution(VSR) methods on two common datasets(Vid4, REDS4).The quantitative evaluations are involved of PSNR and SSIM, and the reconstructed images of each method are provided for comparison. Our reconstruction results show that the proposed model can recover precise details, and the effectiveness of the proposed alignment module with polar axis constraints is verified by comparing the results of no alignment operation and the results of one or three layers of attention constraint alignment. Without the use of alignment, the PSNR score is 22.11 dB, with one layer of ACB PSNR score increased by 1.81 dB, and with three layers of ACB, the PSNR score is increased by 1.21 dB. This result proves the effectiveness of attention constraint to aligning blocks, and the network of cascaded three-layer ACB can capture long-distance spatial information. The dynamic fusion (DF) module is also verified, and the comparative experiment shows that the DF module can improve the reconstruction performance. Our results demonstrate that the PSNR score on the Vid4 data set has increased by more than 0.33 dB compared to EDVR_M, which is an increase of about 1.2%.Compared with EDVR_M, the PSNR score has increased by 0.49 dB on the REDS4 dataset, which is an increase of about 1.6%. Moreover, under the condition of the same PSNR scores, the proposed model parameters are nearly 50% less than that of recurrent back-projection network(RBPN). Our PSNR value is much higher than dynamic upsampling filter(DUF) in terms of the same number of parameters. The PSNR is increased by 0.21 dB although the number of parameters is slightly higher than that of EDVR_M in our model.Conclusion the number of model parameters is reduced dramatically in the attentional alignment network through the polar axis constraint. To achieve high quality reconstruction results, the distance information can be captured for feature alignment.It can integrate the spatio-temporal features of video frames.
Keywords

订阅号|日报