Current Issue Cover

江俊君1, 程豪1, 李震宇1, 刘贤明1, 王中元2(1.哈尔滨工业大学计算机科学与技术学院, 哈尔滨 150001;2.武汉大学计算机学院, 武汉 430072)

摘 要
Deep learning based video-related super-resolution technique: a survey

Jiang Junjun1, Cheng Hao1, Li Zhenyu1, Liu Xianming1, Wang Zhongyuan2(1.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;2.School of Computer, Wuhan University, Wuhan 430072, China)

Video-related super-resolution(VSR)technique can be focused on high-resolution video profiling and restoration to optimize its low-resolution version-derived quality. It has been developing intensively in relevant to such domains like satellite remote sensing detection,video surveillance,medical imaging,and low-involved electronics. To reconstruct high-resolution frames,conventional video-relevant super-resolution methods can be used to estimate potential motion status and blur kernel parameters,which are challenged for multiscene hetegerneity. Due to the quick response ability of fully integrating video spatio-temporal information of real and natural textures,the emerging deep learning based video superresolution algorithms have been developing dramatically. We review and analyze current situation of deep learning based video super-resolution systematically and literately. First,popular YCbCr datasets are introduced like YUV25,YUV21, ultra video group(UVG),and the RGB datasets are involved in as well,such as video 4(Vid4),realistic and dynamic scenes(REDS),Vimeo90K. The profile information of each dataset is summarized,including its name,year of publication,number of videos,frame number,and resolution. Furthermore,key parameters of the video super-resolution algorithm are introduced in detail in terms of peak signal-to-noise ratio(PSNR),structural similarity(SSIM),video quality model for variable frame delay(VQM_VFD),and learned perceptual image patch similarity(LPIPS). For the concept of video super-resolution and single image super-resolution,the difference between video super-resolution and single image super-resolution can be shown and the former one has richer video frames-interrelated motion information. If the video is processed frame by frame in terms of the single image super-resolution method,there would be a large number of artifacts in the reconstructed video. We carry out deep learning based video super-resolution methods analysis and it has two key technical challenges of those are image alignment and feature integration. For image alignment,its option of image alignment module is challenged for severe hetergeneity between video super-resolution methods. Image alignment and non-alignment methods are categorized. The integration of multi-frame information is based on the network structure like generative adversarial networks(GAN),recurrent convolutional neural networks(RNN),and Transformer. To process video feature and make neighboring frames align with the target frame,image-aligned methods can use different motion estimation and motion compensation module. Image alignment methods can be segmented into three alignment-related categories:optical flow, kernel,and convolution-deformable. This optical flow alignment method can be used to calculate the motion flows between two frames through their pixels-between gray changes in temporal and the neighboring frames are warped by motion compensation module. We divide them into four categories in terms of the optical flow alignment-relevant model structure of deep convolutional neural network(CNN)further:2D convolution,RNN,GAN,and Transformer. For optical flow-aligned 2D convolution methods analysis,we mainly introduce video efficient sub-pixel convolutional network (VESPCN)and its improvement on optical flow estimation network and motion compensation network,such as ToFlow and spatial-temporal transformer network(STTN). For the RNN methods with optical flow alignment,we analyze residual recurrent convolutional network(RRCN),recurrent back-projection network(RBPN)and other related methods using optical flow to align neighboring frames at the image level,which is required to resolve the constraints of the sliding window methods. Therefore,to obtain excellent reconstruction performance,we focus on BasicVSR(basic video super-resolution),IconVSR (information-refill mechanism and coupled propagation video super-resolution)and other networks,which can warp neighboring frames at the feature level. The optical flow alignment-based TecoGAN(temporal coherence via self-supervision for gan-based video generation)and VSR Transformer methods are introduced in detail as well. Due to a few kernel-based and deformable convolution-based align methods,it is still a challenging issue for classify network structure. Because convolution kernel size can used to limit the range of motion estimation,the reconstruction performance of the kernel-based alignment methods is relatively poor. Specifically,deformable convolution is a sampling improvement of conventional convolution,which still has some gaps to be bridged like high computational complexity and harsh convergence conditions. For non-alignment methods,multiple network structures are challenged for video frames-between correlation to a certain extent. We review and analyze the methods in related to non-aligned 3D convolution,non-aligned RNN,alignmentexcluded GAN,and non-local. The non-alignment RNN methods consist of recurrent latent space propagation(RLSP), recurrent residual network(RRN)and omniscient video super-resolution(OVSR)and it demonstrates that a balance can be achieved between reconstruction speed and visual quality. To reduce the computational cost,the improved non-local module is focused on when alignment-excluded non-local methods are introduced. All models are tested with 4×downsampling using two degradations like bicubic interpolation(BI)and blur downsampling(BD). The multiple datasets-based quantitative results,speed comparison of the super-resolution methods are summarized as well,including REDS4, UDM10,and Vid4. Some effects can be optimized. The reconstruction performances of these video-based super-resolution networks are balanced in consistency,the parameters of the model are gradually shrinked,and the speed of training and reasoning is accelerated as well. However,the application of deep learning in video super-resolution is still to be facilitated more. We predict that it is necessary to improve the adaptability of the network and validate the traced result. Current deep learning technologies can be introduced on the nine aspects as mentioned below:network training and optimization,ultrahigh resolution-oriented video super-resolution for,video-compressed super-resolution video-rescaling methods,selfsupervised video super-resolution,various-scaled video super-resolution,spatio-temporal video super-resolution,auxiliary task-guided video super-resolution, and scenario-customized video super-resolution.
