面向虚拟视点绘制空洞填充的渐进式迭代网络

刘家希; 周洋; 林坤; 殷海兵; 唐向宏

发布时间： 2024-07-16
摘要点击次数： 16380
全文下载次数： 358
DOI: 10.11834/jig.230290
2024 | Volume 29 | Number 7

面向虚拟视点绘制空洞填充的渐进式迭代网络

刘家希, 周洋, 林坤, 殷海兵, 唐向宏(杭州电子科技大学通信工程学院, 杭州 310018)

摘要

目的基于深度图像的绘制（depth image based rendering，DIBR）是合成虚拟视点图像的关键技术，但在绘制过程中虚拟视图会出现裂纹和空洞问题。针对传统算法导致大面积空洞区域像素混叠和模糊的问题，将深度学习模型应用于虚拟视点绘制空洞填充领域，提出了面向虚拟视点绘制空洞填充的渐进式迭代网络。方法首先，使用部分卷积对大面积空洞进行渐进修复。然后采用 U-Net 网络作为主干对空洞区域进行编解码操作，同时嵌入知识一致注意力模块加强网络对有效特征的利用。接着通过加权合并方法来融合每次渐进式迭代生成的特征图，保护早期特征不被破坏。最后结合上下文特征传播损失提高网络匹配过程中的鲁棒性。结果在微软实验室提供的2 个多视点 3D（three-dimension）视频序列以及 4 个 3D-HEVC（3D high efficiency video coding）序列上进行定量与定性评估实验，以峰值信噪比（peak signal-to-noise ratio，PSNR）和结构相似性（structural similarity，SSIM）作为指标。实验结果表明，本文算法在主观和客观上均优于已有方法。相比于性能第 2 的模型，在 Ballet、Breakdancers、Lovebird1 和Poznan_Street 数据集上，本文算法的 PSNR 提升了 1. 302 dB、1. 728 dB、0. 068 dB 和 0. 766 dB，SSIM 提升了 0. 007、0. 002、0. 002 和 0. 033；在 Newspaper 和 Kendo 数据集中，PSNR 提升了 0. 418 dB 和 0. 793 dB，SSIM 提升了 0. 011 和0. 007。同时进行消融实验验证了本文方法的有效性。结论本文提出的渐进式迭代网络模型，解决了虚拟视点绘制空洞填充领域中传统算法过程烦琐和前景纹理渗透严重的问题，取得了极具竞争力的填充结果。

关键词

虚拟视点绘制空洞填充注意力特征提取多视点视频加深度

Progressive iteration network for hole filling in virtual view rendering

Liu Jiaxi, Zhou Yang, Lin Kun, Yin Haibing, Tang Xianghong(School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China)

Abstract

Objective Depth image-based rendering （DIBR） makes full use of the depth information in a reference image and can combine color image and depth information organically, which is faster and less complex than the general rendering method. Therefore, DIBR is selected by ISO as the primary virtual view rendering technology in 3D multimedia video. The principal challenge associated with virtual view rendering technology is the 3D warping of the reference view, which leads to exposure of the background that was previously obstructed by the foreground. As a result, certain areas appear as holes in the virtual view due to the absence of pixel values. The search for an effective solution to address missing regions in the rendered view image is a critical challenge in virtual view rendering technology. The traditional algorithms mainly fill the holes based on the space-domain consistency and time-domain consistency methods. Filtering can effectively remove the cracks and some of the holes but cannot handle the large-area holes. The patch-based method can fill large-area holes, but the process is tedious, the amount of data is too large, and the accuracy of searching for the best matching patch is not high, which may lead to the texture belonging to the foreground being incorrectly filled to the hole area belonging to the background. Based on the time-domain consistency method, a model is developed to reconstruct the vacant part of the background using various models, and the foreground part is repositioned to the virtual viewpoint location to reduce the computational complexity and increase the adaptability to the scene. However, the moving camera scene contains both stationary and moving objects, which easily causes some parts of the foreground to be modeled as the background, resulting in the mixing of foreground and background pixels. Therefore, a deep learning model is applied to the field of hole filling in virtual view rendering, and a progressive iterative network for hole filling in virtual view rendering is proposed to address the problem of traditional algorithms leading to pixel blending and blurring in large hole regions.Method In this study, a progressive iterative network based on convolutional neural network is built. The network model mainly consists of a knowledge consistent attention module, a contextual feature propagation loss module, and a weighted merging module. First, partial convolutions are used in the initial stage of the network for progressive repair of large area holes. The partial convolutions are operated using only the valid pixels in the hole region, and the updated masks are retained throughout the iterations until they are reduced and updated in the next iteration, which is beneficial to the extraction of shallow valid features. Then, the U-Net network is used as the backbone to codify and decode the empty regions and cascade the shallow and deep information by introducing skip connections to tackle the problem of missing information. To select effective features in the network, we embed a knowledge consistent attention module. One benefit of this attention module is that it measures the attention score by weighing the current score with the score obtained from the previous iteration, which establishes the correlation between the front and back frame patches and effectively avoids the problem of foreground and background pixel blending in the traditional algorithm. The contextual feature propagation loss module is also used in a progressive iterative network with an attention module. This module plays a complementary role to the knowledge consistent attention module, reducing the difference between the reconstructed images in the encoder and decoder and enhancing the robustness of the network matching process. In addition, it allows for the creation of semantically consistent patches to fill in background holes by utilizing auxiliary images as guidance. Furthermore, we employ a pre-trained Visual Geometry Group （VGG-16） feature extractor to facilitate the joint guidance of our model using L₁ loss, perceptual loss, style loss, and smoothing loss, ultimately enhancing the resemblance between reference and target views. Lastly, the feature maps produced in each successive iteration are integrated via a weighted merging approach. This process involves the development of an adaptive map through the learning process. Specifically, through the concatenation of soft weight maps and the output feature maps of adaptive merging, the method provides an adaptive map that preserves original feature information with soft weight map assistance and protects early features from corruption, thus preventing gradient erosion.Result The experiments were quantitatively and qualitatively evaluated on multi-view 3D video sequences provided by Microsoft Labs and four 3D high efficiency video coding （3D-HEVC） sequences. Peak signal-to-noise ratio （PSNR） and structural similarity （SSIM） metrics were used to measure the algorithm’s performance, and a set of hole masks suitable for virtual view rendering were collected for training. Our experimental results demonstrate that our model yields the most reasonable images in terms of subjective perceptual quality. Furthermore, compared with the model with the second-highest performance, our model outperforms in terms of PSNR and SSIM, improving 1.302 dB, 1.728 dB, 0.068 dB, and 0.766 dB, and 0.007, 0.002, 0.002, 0.002, and 0.033 on the Ballet, Breakdancers, Lovebird1, and Poznan_Street datasets, respectively. Meanwhile, compared with the deep learning model, the PSNR and SSIM increased by 0.418 dB and 0.793 dB, and 0.011 and 0.007, respectively, in the Newspaper and Kendo datasets. In addition, we conducted a series of ablation experiments to verify the effectiveness of each module in our model, including the knowledge consistent attention module, the contextual feature propagation loss module, the weighted merging module, and the number of iterations.Conclusion In this study, we apply deep learning to the field of hole filling in virtual view rendering. Our proposed progressive iterative network model was validated through experimental demonstration. We observed that our model performs exceptionally well in terms of avoiding tedious processes and minimizing foreground texture infiltration, ultimately leading to superior filling outcomes. However, our model exhibits some limitations. While it can focus on effective texture features, its overall efficiency still requires further improvement. Moreover, depth maps associated with 3D video sequences can be utilized as a guide, enabling the convolutional neural network to comprehend more intricate structural aspects and enhancing the model’s overall performance. In future research, we may consider merging frame interpolation and inpainting techniques to concentrate on the motion-related information of objects over time.

Keywords

virtual view rendering hole-filling attention feature extraction multi-view video plus depth

在线采编平台

论文出版

年度会议

下载中心

年度信息