多特征聚合的边界引导视频图像显著目标检测
张荣国1, 郑晓鸽1, 王丽芳1, 胡静1, 刘小君2(1.太原科技大学计算机科学与技术学院;2.合肥工业大学机械工程学院) 摘 要
目的 视频显著目标检测的目的是识别和突出显示视频中的重要对象或区域。现有的方法在挖掘边界线索和时空特征之间的相关性方面存在不足,并且在特征聚合过程中未能充分考虑相关的上下文信息,导致检测结果不够精确。因此提出了多特征聚合的边界引导网络,进行显著目标边界信息和显著目标时空信息之间的互补协作。方法 首先,提取视频帧显著目标的空间和运动特征,在不同分辨率下将显著目标边界特征与显著目标时空特征耦合,突出运动目标边界的特征,更准确地定位视频显著目标;其次,采用了多层特征注意聚合模块以提高不同特征的表征能力,使得各相异特征得以被充分利用;同时在训练阶段采用混合损失来帮助网络学习,以更加准确地分割出运动目标显著的边界区域,获得期望的显著目标。结果 实验在4个数据集上与现有的5种方法进行了比较,所提方法在4个数据集上的F-measure值均优于对比方法。在DAVIS(densely annotated video segmentation)数据集上,与性能最优的模型相比F-measure值提高了0.2%,S-measure值略低于最优值0.7%;在FBMS(freiburg-berkeley motion segmentation)数据集上,F-measure值比次优值提高了0.9%;在ViSal数据集上,MAE(mean absolute error)值仅低于最优方法STVS 0.1%,F-measure值比STVS提高了0.2%;在MCL数据集上,所提方法实现了最优的MAE值2.2%,S-measure值和F-measure值比次优方法SSAV(saliency-shift aware VSOD)分别提高了1.6%和0.6%。结论 实验表明,提出的方法能够有效提升检测出的视频显著目标的边界质量。
关键词
Boundary-guided video salient object detection with multi-feature aggregation
ZhangRongguo, ZhengXiaoge1, WangLifang1, HuJing1, LiuXiaojun2(1.School of Computer Science and Technology,Taiyuan University of Science and Technology;2.School of Mechanical Engineering,Hefei University of Technology) Abstract
Objective The purpose of video salient object detection is to identify and highlight important objects or regions in a video. This task has been widely applied in various computer vision tasks such as target tracking, medical analysis, and video surveillance. Over the past few decades, significant progress has been made in the field of video salient object detection, thanks to the development of deep learning technologies, especially the wides application of convolutional neural networks. Deep learning models can automatically learn feature representations of salient objects by learning a large amount of annotated data, thereby achieving efficient detection and localization of salient objects. However, existing methods often fall short in exploring the correlation between boundary cues and spatiotemporal features. Additionally, they fail to adequately consider relevant contextual information during feature aggregation, which leads to imprecise detection results. Therefore, we propose a boundary-guided video salient object detection network with multi-feature aggregation(MFABG), which integrates both salient object boundary information and object information within a unified model, fostering complementary collaboration between them. Method Firstly, two adjacent video images are used to generate optical flow maps, and spatial and motion features of salient objects are extracted from the RGB images and optical flow maps of video frames respectively. By integrating low-level local edge information and high-level global position information from spatial features, boundary features of salient objects in video frames are obtained. At different resolutions, the boundary features of salient objects are coupled with the features of salient objects themselves. The interaction and cooperation between the boundary features and the salient object features enhance the complementarity between the two information types, emphasizing and refining the boundary features of the objects, thus achieving more accurate localization of salient objects in video images. Then, to fully utilize the extracted multi-level features and achieve selective dynamic aggregation of semantic and scale-inconsistent multi-level features, a multi-layer feature attention aggregation module is used to enhance the representation capability of features. This is achieved by varying the size of the spatial pool to achieve channel attention and using point-wise convolution to aggregate local and global contextual information in the channels, which makes the network pay attention to both large objects with a more global distribution as well as small objects with a more local distributions, facilitating the recognition and detection of salient objects under extreme scale variations, thereby generating the final salient object detection map. In addition, in the training stage, random rotation, multi-scale (scale values set to {0.75, 1, 1.25}), and mixed losses are employed. The mixed loss helps the network learn transformations between input images and ground truth at pixel level, block level, and image level by combining WBCE (weighted binary cross-entropy) loss, SSIM (structural similarity) loss, Dice loss from the boundary guidance module, and IOU (intersection over union) loss, aiming to more accurately segment salient object regions with clear boundaries. Result The proposed method is evaluated on the DAVIS (densely annotated video segmentation), FBMS (freiburg-berkeley motion segmentation), ViSal (video saliency), and MCL datasets, using three evaluation metrics which are MAE (mean absolute error), S-measure, and F-measure, and compared with five existing methods. The results indicate that the proposed method is capable of generating salient maps with clear boundaries. The proposed method outperforms the comparison methods in terms of F-measure values across four datasets. On the DAVIS dataset, compared to the best-performing DSNet (dynamic spatiotemporal network) model, the proposed method achieves the same MAE value as DSNet, with the S-measure being slightly lower by 0.7%, and the F-measure is improved by 0.2%. On the FBMS dataset, the proposed method achieves the best MAE value of 3.7%, matching the SCANet method. Additionally, it improves the S-measure by 0.3% and the F-measure by 0.9% compared to the second-best method. On the ViSal dataset, the MAE value is only 0.1% lower than the optimal method STVS, and the F-measure value is 0.2% higher than STVS. On the MCL dataset, the proposed method achieves the best MAE value of 2.2%, and the S-measure and F-measure values are 1.6% and 0.6% higher than the second-best method SSAV(Saliency-Shift Aware VSOD), respectively. To more intuitively observe the detection results of each method, the video salient object detection results of the proposed method and comparative methods are visualized. The visualization results show that previous methods mostly yield detection results with high-quality region accuracy but rough and blurry boundaries. In contrast, the proposed method can generate detection results with clear boundaries. In addition, ablation experiments on relevant modules are conducted to demonstrate the effectiveness of the different modules employed. Conclusion In this study, a network capable of achieving interactive collaboration between salient object boundary information and spatiotemporal information is proposed. The proposed method performs well on four public datasets and effectively enhances the boundary quality of detected salient objects in video frames. However, the method also has certain limitations. For instance, it may fail to detect or misidentify salient objects when multiple such objects are present in the video frames. In future work, we plan to explore more efficient spatiotemporal feature extraction schemes to capture all salient object features in video frames, thereby improving the detection capabilities of the current algorithm.
Keywords
|