Current Issue Cover
恶劣场景下视觉感知与理解综述

汪文靖1, 杨文瀚2, 方玉明3, 黄华4, 刘家瑛1(1.北京大学王选计算机研究所, 北京 100871;2.鹏城实验室战略与交叉前沿研究部, 深圳 518055;3.江西财经大学信息管理学院, 南昌 330032;4.北京师范大学人工智能学院, 北京 100875)

摘 要
恶劣场景下采集的图像与视频数据存在复杂的视觉降质,一方面降低视觉呈现与感知体验,另一方面也为视觉分析理解带来了很大困难。为此,系统地分析了国际国内近年恶劣场景下视觉感知与理解领域的重要研究进展,包括图像视频与降质建模、恶劣场景视觉增强、恶劣场景下视觉分析理解等技术。其中,视觉数据与降质建模部分探讨了不同降质场景下的图像视频与降质过程建模方法,涵盖噪声建模、降采样建模、光照建模和雨雾建模。传统恶劣场景视觉增强部分探讨了早期非深度学习的视觉增强算法,包括直方图均衡化、视网膜大脑皮层理论和滤波方法等。基于深度学习模型的恶劣场景视觉增强部分则以模型架构创新的角度进行梳理,探讨了卷积神经网络、Transformer 模型和扩散模型等架构。不同于传统视觉增强的目标为全面提升人眼对图像视频的视觉感知效果,新一代视觉增强及分析方法考虑降质场景下机器视觉对图像视频的理解性能。恶劣场景下视觉理解技术部分探讨了恶劣场景下视觉理解数据集和基于深度学习模型的恶劣场景视觉理解,以及恶劣场景下视觉增强与理解协同计算。论文详细综述了上述研究的挑战性,梳理了国内外技术发展脉络和前沿动态。最后,根据上述分析展望了恶劣场景下视觉感知与理解的发展方向。
关键词
Visual perception and understanding in degraded scenarios

Wang Wenjing1, Yang Wenhan2, Fang Yuming3, Huang Hua4, Liu Jiaying1(1.Wangxuan Institute of Computer Technology, Peking University, Beijing 100871, China;2.Department of Strategic and Advanced Interdisciplinary, PengCheng Laboratory, Shenzhen 518055, China;3.School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330032, China;4.School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China)

Abstract
Visual media such as images and videos are crucial means for humans to acquire,express,and convey information. The widespread application of foundational technologies like artificial intelligence and big data has facilitated the gradual integration of systems for the perception and understanding of images and videos into all aspects of production and daily life. However,the emergence of massive applications also brings challenges. Specifically,in open environments, various applications generate vast amounts of heterogeneous data,which leads to complex visual degradation in images and videos. For instance,adverse weather conditions like heavy fog can reduce visibility,which results in the loss of details. Data captured in rainy or snowy weather can exhibit deformations in objects or individuals due to raindrops,which result in structured noise. Low-light conditions can cause severe loss of details and structured information in images. Visual degradation not only diminishes the visual presentation and perceptual experience of images and videos but also significantly affects the usability and effectiveness of existing visual analysis and understanding systems. In today’s era of intelligence and information technology,with explosive growth in visual media data,especially in challenging scenarios,visual perception and understanding technologies hold significant scientific significance and practical value. Traditional visual enhancement techniques can be divided into two methods:spatial domain-based and frequency domain-based. Spatial domain methods directly process 2D spatial data,including grayscale transformation,histogram transformation,and spatial domain filtering. Frequency domain methods transform data into the frequency domain through models,like Fourier transform,for processing and then restore it to the spatial domain. The development of computer vision technology has facilitated the emergence of more well-designed and robust visual enhancement algorithms,such as dehazing algorithms based on dark channel priors. Since 2010s,the rapid advancement in artificial intelligence technology has enabled the development of many visual enhancement methods based on deep learning models. These methods not only can reconstruct damaged visual information but also can further improve the visual presentation,which comprehensively enhances the visual perceptual experience of images and videos captured in challenging scenarios. As computer vision technology becomes more widespread,intelligent visual analysis and understanding are penetrating various aspects of society,such as face recognition and autonomous driving. However,visual enhancement in traditional digital image processing frameworks mainly focuses on improving visual effects,which ignores the impact on high-level analysis tasks. This oversight severely reduces the usability and effectiveness of existing visual understanding systems. In recent years,several visual understanding datasets for challenging scenarios have been established,which leads to the development of numerous visual analysis and understanding algorithms for these scenarios. Domain transfer methods from ordinary scenes to challenging scenes are gaining attention in further reducing reliance on datasets. Coordinating and optimizing the relationship between visual perception and visual presentation,which are two different task objectives,are also important research problems in the field of visual computing. To address the development needs of the visual computing field in challenging scenarios,this study extensively reviews the challenges of the aforementioned research,outlines the developmental trends,and explores the cutting-edge dynamics. Specifically,this study reviews the technologies related to visual degradation modeling,visual enhancement, and visual analysis and understanding in challenging scenarios. In the section on visual data and degradation modeling, various methods for modeling image and video degradation processes in different degradation scenarios are discussed. These methods include noise modeling,downsampling modeling,illumination modeling,and rain and fog modeling. For noise modeling,Poissonian-Gaussian noise modeling is the most commonly used. For downsampling modeling,classical methods include bicubic interpolation and blurring kernel. Noise including JPEG compression is also considered. A recent comprehensive model jointly uses blurring,downsampling,and noise. For illumination modeling,the Retinex theory is one of the most widely used. It decomposes images into illumination and reflectance. For rain and fog modeling,images are generally decomposed into rain and background layers. In the traditional visual enhancement section,numerous visual enhancement algorithms have been developed to address the degradation of image and video information in adverse scenarios. Early algorithms often employed simple strategies,such as super-resolution methods primarily based on interpolation techniques. However,these methods are constrained by linear models and struggle to restore high-frequency details. Researchers have proposed more sophisticated algorithms to address the complex degradation issues in adverse scenarios. These algorithms include techniques such as histogram equalization,Retinex theory,and filtering methods. Deep neural networks have shown remarkable performance in various fields such as image classification,object detection,and facial recognition. Simultaneously,in low-level computer vision tasks such as super-resolution,style transfer,color conversion, and texture transfer,they also demonstrate excellent performance. With the continuous evolution of deep neural network frameworks,researchers have proposed diverse visual enhancement methods. The section on visual enhancement based on deep learning models takes an innovative approach to model architecture. It discusses architectures like convolutional neural networks,Transformer models,and diffusion models. Unlike traditional visual enhancement,which aims to comprehensively improve human visual perception of images and videos,the new generation of visual enhancement and analysis methods considers the interpretive performance of machine vision in degraded scenarios. The section on visual understanding technology in challenging scenarios discusses visual understanding and its corresponding datasets in challenging scenarios based on deep learning models. It also explores the collaborative computation of visual enhancement and understanding in challenging scenarios. Finally,based on the analysis,it provides prospects for the future development of visual perception and understanding in adverse scenarios. When facing complex degradation scenarios,real-world images may be simultaneously influenced by various factors such as heavy rain and fog,dynamic changes in lighting,low-light environments,and image corruption. This condition requires models to handle unknown and diverse image features. The current challenge lies in the fact that most existing models are designed for specific degradation scenarios. This complexity introduces a significant amount of prior knowledge and causes difficulty in adapting to other degradation scenarios. The construction of existing visual understanding models in adverse scenarios relies on downstream task information,including target domain data distribution,degradation priors,and pre-trained models for downstream tasks. This reliance causes difficulty in achieving robustness for arbitrary tasks and analysis models. Moreover,most methods are limited to a specific machine analysis downstream task and cannot generalize to new downstream task scenarios. Finally,in recent years,large models have achieved significant accomplishments in various fields. Currently,many studies have demonstrated unprecedented potential for large models in tasks like enhancing reconstruction and other low-level computational visual tasks. However,the high complexity of large models also presents challenges,including substantial computational resource requirements,long training times,and difficulties in model optimization. At the same time,the generalization capability of models in adverse scenarios is a pressing challenge that requires more comprehensive data construction strategies and more effective model optimization methods. How to improve the performance and reliability of large visual models in visual perception and understanding tasks in adverse scenarios is a key problem that remains unsolved.
Keywords

订阅号|日报