多尺度条形池化与通道注意力的图像语义分割
摘 要
目的 针对自然场景下图像语义分割易受物体自身形状多样性、距离和光照等因素影响的问题,本文提出一种新的基于条形池化与通道注意力机制的双分支语义分割网络(strip pooling and channel attention net,SPCANet)。方法 SPCANet从空间与内容两方面对图像特征进行抽取。首先,空间感知子网引入1维膨胀卷积与多尺度思想对条形池化技术进行优化改进,进一步在编码阶段增大水平与竖直方向上的感受野;其次,为了提升模型的内容感知能力,将在ImageNet数据集上预训练好的VGG16(Visual Geometry Group 16-layer network)作为内容感知子网,以辅助空间感知子网优化语义分割的嵌入特征,改善空间感知子网造成的图像细节信息缺失问题。此外,使用二阶通道注意力进一步优化网络中间层与高层的特征选择,并在一定程度上缓解光照产生的色差对分割结果的影响。结果 使用Cityscapes作为实验数据,将本文方法与其他基于深度神经网络的分割方法进行对比,并从可视化效果和评测指标两方面进行分析。SPCANet在目标分割指标mIoU(mean intersection over union)上提升了1.2%。结论 提出的双分支语义分割网络利用改进的条形池化技术、内容感知辅助网络和通道注意力机制对图像语义分割进行优化,对实验结果的提升起到了积极作用。
关键词
Semantic image segmentation by using multi-scale strip pooling and channel attention
Ma Jiquan1, Zhao Shumin1, Kong Fanhui2(1.School of Computer Science and Technology,Heilongjiang University, Harbin 150080, China;2.School of Data Science and Technology, Heilongjiang University, Harbin 150080, China) Abstract
Objective Real-scenario image semantic segmentation is likely to be affected by multiple object-context shapes, ranges and illuminations. Current semantic segmentation methods have inaccurate classification results for pedestrians, buildings, road signs and other objects due to their small scales or wide ranges. At the same time, the existing methods are not distinguishable for objects with chromatic aberration, and it is easy to divide the same chromatic aberration-derived object into different objects, or segment different objects with similar colors into the same type of objects. In order to improve the performance of semantic image segmentation, we facilitate a new dual-branch semantic segmentation network in terms of strip pooling and attention mechanism (strip pooling and channel attention net (SPCANet)). Method the SPCANet can be used to extract the features of images via spatial and content perceptions. First, we employ the spatial perception Sub-net to augment the receptive field in the horizontal and vertical directions on the down-sampling stage by using dilated convolution and strip pooling with multi-scale. Our specific approach is focused on adding four parallel one-dimensional dilated convolutions with different rates to the horizontal and vertical branches on the basis of strip pooling model (based on the pooling operation which kernel size is n × 1 or 1 × n), which enhance the perception of large-scale objects in the image. Nextly, in order to improve the content perception ability of the model, we use the pre-trained VGG16 (Visual Geometry Group 16-layer network) based on ImageNet dataset as the content-perception sub-net to optimize the embedded features of semantic segmentation via spatial-perception assisted sub-net. The content sub-net can strengthen feature representation in combination with the spatial perception subnet. In addition, the second-order channel attention is used to optimize the feature assignment further between the middle and high-level layers of the network. In the network training period, the target information is focused and assigned a larger weight, and irrelevant information is suppressed and a smaller weight is assigned. By this way, the correlation is activated in the embedding features. To enhance the expression of image channel information, we use covariance and gating mechanism to achieve the second-order channel attention. Our model can be demonstrated sequentially 1) a three-channel color image is as input, 2) the spatial-based and content-oriented sub-nets are transmitted for feature encoding in the embedded space, 3) the two sets of features are fused (using the method of feature fusion for concatenate), and 4) the fused features are sent to a prediction module (head) for classification and the segmentation task. Result We use the popular benchmarks (Cityscapes) as the testing data and our results are compared with other deep neural network-based methods (including the existing network published on the Cityscapes official website and the network based on local reproduction from GitHub). We evaluate the performance qualitatively and quantitatively. The qualitative analysis is carried out by means of visual analysis and the experiment is analyzed quantitatively by public popular metrics. 1) From the perspective of the visualization of the segmentation results, the method proposed in this paper has a strong perception of wide-range objects in the image, and the overall segmentation effect is improved obviously; 2) the metrics of segmentation can reflect the result of the experiment as well. Through the experimental data found that the commonly-used metrics such as accuracy (Acc) and the mean intersection over union (mIoU) are significantly improved. The mIoU is increased by 1.2%, and the Acc is increased by 0.7%. The Ablation studies validated the effectiveness of our modules. Among them, the improved strip pooling module has a more obvious improvement effect on the segmentation result. Under the same experimental circumstances based on batch-train dataset with an input size of 512×512×3, the mIoU can be improved by 4%, and then change the input size to 768 under the same experimental conditions, the mIoU is improved by 5%. The use of second-order channel attention makes the model more sensitive to the chromatic aberration part in the image during the training process. From the visualization results based on the Cityscapes batch-train dataset, the classification result such as pedestrians is improved obviously. The stability of other classification needs to be strengthened further. In the selection of content-perception subnet, we use three pre-trained networks on the ImageNet as candidates, including VGG16, ResNet101 and DenseNet101. The pre-trained VGG16 as the content-perception sub-net can achieve the best performance. The supplementary use of content-perception sub-net enhances the information representation ability of feature maps. Conclusion We develop the image semantic segmentation algorithm in the context of attention mechanism, multi-scale strip pooling and feature fusion. To optimize our image semantic segmentation, it is harnessed by an improved strip pooling technology (the receptive field augmentation with no more parameters), second-order channel attention (channels-between information) and content perception auxiliary network. Our model can clarify the circumstances of inaccurate segmentation caused by multi-scale segmentation of objects. Our joint model with receptive fields and channel information is beneficial to the semantic image segmentation in the real scenario. To reduce the labor cost in data labeling, it can be extended to learn a more generalizing semantic image segmentation neural network through weakly supervised or unsupervised mode further.
Keywords
|