Current Issue Cover
基于深度学习的弱监督语义分割方法综述

项伟康1,2, 周全1,2, 崔景程1, 莫智懿2, 吴晓富1, 欧卫华3, 王井东4, 刘文予5(1.南京邮电大学通信与信息工程学院, 南京 210023;2.梧州学院广西高校智能软件重点实验室, 梧州 543003;3.贵州师范大学大数据与计算机科学学院, 贵阳 550025;4.百度, 北京 100085;5.华中科技大学电子信息与通信学院, 武汉 430071)

摘 要
语义分割是计算机视觉领域的基本任务,旨在为每个像素分配语义类别标签,实现对图像的像素级理解。得益于深度学习的发展,基于深度学习的全监督语义分割方法取得了巨大进展。然而,这些方法往往需要大量带有像素级标注的训练数据,标注成本巨大,限制了其在诸如自动驾驶、医学图像分析以及工业控制等实际场景中的应用。为了降低数据的标注成本并进一步拓宽语义分割的应用场景,研究者们越来越关注基于深度学习的弱监督语义分割方法,希望通过诸如图像级标注、最小包围盒标注、线标注和点标注等弱标注信息实现图像的像素级分割预测。首先对语义分割任务进行了简要介绍,并分析了全监督语义分割所面临的困境,从而引出弱监督语义分割。然后,介绍了相关数据集和评估指标。接着,根据弱标注的类型和受关注程度,从图像级标注、其他弱标注以及大模型辅助这3个方面回顾和讨论了弱监督语义分割的研究进展。其中,第2类弱监督语义分割方法包括基于最小包围盒、线和点标注的弱监督语义分割。最后,分析了弱监督语义分割领域存在的问题与挑战,并就其未来可能的研究方向提出建议,旨在进一步推动弱监督语义分割领域研究的发展。
关键词
Weakly supervised semantic segmentation based on deep learning

Xiang Weikang1,2, Zhou Quan1,2, Cui Jingcheng1, Mo Zhiyi2, Wu Xiaofu1, Ou Weihua3, Wang Jingdong4, Liu Wenyu5(1.School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;2.Guangxi Colleges and Universities Key Laboratory of Intelligent Software, Wuzhou University, Wuzhou 543003, China;3.School of Big Data and Computer Science, Guizhou Normal University, Guiyang 550025, China;4.Baidu, Beijing 100085, China;5.School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430071, China)

Abstract
Semantic segmentation is an important and fundamental task in the field of computer vision. Its goal is to assign a semantic category label to each pixel in an image,achieving pixel-level understanding. It has wide applications in areas, such as autonomous driving,virtual reality,and medical image analysis. Given the development of deep learning in recent years,remarkable progress has been achieved in fully supervised semantic segmentation,which requires a large amount of training data with pixel-level annotations. However,accurate pixel-level annotations are difficult to provide because it sacrifices substantial time,money,and human-label resources,thus limiting their widespread application in reality. To reduce the cost of annotating data and further expand the application scenarios of semantic segmentation,researchers are paying increasing attention to weakly supervised semantic segmentation(WSSS)based on deep learning. The goal is to develop a semantic segmentation model that utilizes weak annotations information instead of dense pixel-level annotations to predict pixel-level segmentation accurately. Weak annotations mainly include image-level,bounding-box,scribble,and point annotations. The key problem in WSSS lies in how to find a way to utilize the limited annotation information,incorporate appropriate training strategies,and design powerful models to bridge the gap between weak supervision and pixel-level annotations. This study aims to classify and summarize WSSS methods based on deep learning,analyze the challenges and problems encountered by recent methods,and provide insights into future research directions. First,we introduce WSSS as a solution to the limitations of fully supervised semantic segmentation. Second,we introduce the related datasets and evaluation metrics. Third,we review and discuss the research progress of WSSS from three categories:image-level annotations, other weak annotations,and assistance from large-scale models,where the second category includes bounding-box, scribble,and point annotations. Specifically,image-level annotations only provide object categories information contained in the image,without specifying the positions of the target objects. Existing methods always follow a two-stage training process:producing a class activation map(CAM),also known as initial seed regions used to generate high-quality pixel-level pseudo labels;and training a fully supervised semantic segmentation model using the produced pixel-level pseudo labels. According to whether the pixel-level pseudo labels are updated or not during the training process in the second stage,WSSS based on image-level annotations can be further divided into offline and online approaches. For offline approaches,existing research treats two stages independently,where the initial seed regions are optimized to obtain more reliable pixel-level pseudo labels that remain unchanged throughout the second stage. They are often divided into six classes according to different optimization strategies,including the ensemble of CAM,image erasing,co-occurrence relationship decoupling, affinity propagation,additional supervised information,and self-supervised learning. For online approaches,the pixellevel pseudo labels keep updating during the entire training process in the second stage. The production of pixel-level pseudo labels and the semantic segmentation model are jointly optimized. The online counterparts can be trained end to end,making the training process more efficient. Compared with image-level annotations,other weak annotations,including bounding box,scribble,and point,are more powerful supervised signals. Among them,bounding-box annotations not only provide object category labels but also include information of object positions. The regions outside the bounding-box are always considered background,while box regions simultaneously contain foreground and background areas. Therefore, for bounding-box annotations,existing research mainly starts from accurately distinguishing foreground areas from background regions within the bounding-box,thereby producing more accurate pixel-level pseudo labels,used for training following semantic segmentation networks. Scribble and point annotations not only indicate the categories of objects contained in the image but also provide local positional information of the target objects. For scribble annotations,more complete pseudo labels can be produced to supervise semantic segmentation by inferring the category of unlabeled regions from the annotated scribble. For point annotations,the associated semantic information is expanded to the entire image through label propagation,distance metric learning,and loss function optimization. In addition,with the rapid development of large-scale models,this paper further discusses the recent research achievements in using large-scale models to assist WSSS tasks. Large-scale models can leverage their pretrained universal knowledge to understand images and generate accurate pixel-level pseudo labels,thus improving the final segmentation performance. This paper also reports the quantitative segmentation results on pattern analysis,statistical modeling and computational learning visual object classes 2012 (PASCAL VOC 2012)dataset to evaluate the performance of different WSSS methods. Finally,four challenges and potential future research directions are provided. First,a certain performance gap remains between weakly supervised and fully supervised methods. To bridge this gap,research should keep on improving the accuracy of pixel-level pseudo labels. Second,when WSSS models are applied to real-world scenarios,they may encounter object categories that have never appeared in the training data. This encounter requires the models to have a certain adaptability to identify and segment unknown objects. Third,existing research mainly focuses on improving the accuracy without considering the model size and inference speed of WSSS networks,posing a major challenge for the deployment of the model in real-world applications that require real-time estimations and online decisions. Fourth,the scarcity of relevant datasets used to evaluate different WSSS models and algorithms is also a major obstacle,which leads to performance degradation and limits generalization capability. Therefore,large-scale WSSS datasets with high quality,great diversity,and wide variation of image types must be constructed.
Keywords

订阅号|日报