Current Issue Cover
融合改进ASPP和极化自注意力的自底向上全景分割

李新叶1,2, 陈丁1(1.华北电力大学电子与通信工程系, 保定 071003;2.华北电力大学河北省电力物联网技术重点实验室, 保定 071003)

摘 要
目的 针对ASPP(atrous spatial pyramid pooling)在空洞率变大时空洞(atrous)卷积效果会变差的情况,以及图像分类经典模型ResNet (residual neural network)并不能有效地适用于细粒度图像分割任务的问题,提出一种基于改进ASPP和极化自注意力的自底向上全景分割方法。方法 重新设计ASPP模块,将小空洞率卷积的输出与原始输入进行拼接(concat),将得到的结果作为新的输入传递给大空洞率卷积,然后将不同空洞率卷积的输出结果拼接,并将得到的结果与ASPP中的其他模块进行最后拼接,从而改善ASPP中因空洞率变大导致的空洞卷积效果变差的问题,达到既获得足够感受野的同时又能编码多尺度信息的目的;在主干网络的输出后引入改进的极化自注意力模块,实现对图像像素级的自我注意强化,使其得到的特征能直接适用于细粒度像素分割任务。结果 本文在Cityscapes数据集的验证集上进行测试,与复现的基线网络Panoptic-DeepLab(58.26%)相比,改进ASPP模块后分割精度PQ(panoptic quality)(58.61%)提高了0.35%,运行时间从103 ms增加到124 ms,运行速度没有明显变化;通过进一步引入极化自注意力,PQ指标(58.86%)提高了0.25%,运行时间增加到187 ms;通过对该注意力模块进一步改进,PQ指标(59.36%)在58.86%基础上又提高了0.50%,运行时间增加到192 ms,速度略有下降,但实时性仍好于大多数方法。结论 本文采用改进ASPP和极化自注意力模块,能够更有效地提取适合细粒度像素分割的特征,且在保证足够感受野的同时能编码多尺度信息,从而提升全景分割性能。
关键词
The improved atrous spatial pyramid pooling and polarized self-attention based bottom-up panoptic segmentation

Li Xinye1,2, Chen Ding1(1.Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China;2.Hebei Key Laboratory of Power Internet of Things Technology, North China Electric Power University, Baoding 071003, China)

Abstract
Objective Panoptic segmentation can be as a challenging task in computer vision and image segmentation nowadays. It is focused on all objects-related segmentation in an image relevant to such categories of foreground“thing”and background“stuff”. Panoptic segmentation can optimize semantic segmentation and instance segmentation to a certain extent in relevance to such domain of vision applications like autonomous driving,simultaneous localization and mapping (SLAM),multi-object tracking and segmentation(MOTS). Most of panoptic segmentation methods can be used to follow the top-down path and the principle of detection before segmentation. Such method is based on instance segmentation or object detection,and a semantic branch is added to rich semantic segmentation. The segmentation performance of these models is feasible,but it needs a complex post-processing stage to deal with branches-between and within conflicts,which can make the inference be slower. Another category of these methods can follow the idea of bottom-up,for which semantic segmentation can be regarded as the basis,and the image can be recognized as a whole at the pixel level. It can be used to optimize tedious post-processing. Recently,a bottom-up panoptic segmentation(Panoptic-DeepLab)is used to divide the panoptic segmentation task into two branches. Each branch has a specific decoder network and segmentation head network. The semantic segmentation head outputs the semantic segmentation results. The same structure-related two instance heads can be used to predict the center instance and offset simutaneously. It can get better segmentation accuracy and speed. However,the atrous spatial pyramid pooling(ASPP)module is still used in the decoder network to increase the receptive field. For ASPP,to obtain a large enough receptive field,it needs sufficient dilation rate. When the dilation rate is larger, the effect of atrous convolution is worse. On the other hand,residual neural network(ResNet)is used as a shared encoder, which may be sub-optimal for fine-grained image segmentation. To resolve the problems mentioned above,we develop a new panoptic segmentation model for better segmentation performance. Method A bottom-up panoptic segmentation method is developed in terms of improved ASPP and polarized self-attention. First,for ASPP,we redesigned it,called improved atrous spatial pyramid pooling(IASPP). Specifically,1)dilation rate of rate1-related output of 3×3 convolutions is concatenated with the original input,and it is input into 3×3 convolution with the dilation rate2;2)dilation rate1 and rate2- related output of 3×3 convolutions is concatenated with the original input,and it is input into 3×3 convolution with the dilation rate of rate3. Then,different dilation rates-related output of convolution is concatenated as well. Finally,the results are obtained and concatenated with other ASPP-related modules. Through a series of atrous convolutions and feature concatenations,final output of the IASPP can obtain a larger receptive field without ASPP-related kernel degradation. Furthermore,the IASPP are not used to increase the size of the model significantly,and the speed of the model is not increased dramatically as well. In addition,polarization self-attention(PSA)can be used to enhance the feature extraction ability of the shared backbone further. After the fourth layer of ResNet-50 is concerned about,improved polarization self-attention (IPSA)module is introduced to extract pixel-level features. This process can enhance the ability of ResNet to extract costefficient pixel-level information. The output features can be used preserve pixel-level information,and it can be applied to typical fine-grained image segmentation tasks to estimate the highly nonlinear pixel-wise semantics straightfoward. Result The method is tested on the cityscapes dataset. The cityscapes dataset is composed of 19 categories,including 11 background and 8 foreground contexts. It consists of images samples of 2 975 training,500 validation,and 1 525 test contexts. Each image has a size of 1 024×2 048 pixels approximately. The training set can be used to train the network and the validation set is used to test the network. Compared to the baseline,experimental results demonstrate that the proposed model’s panoptic quality(PQ)is improved from 58. 26% to 58. 61%,and the runtime is optimized from 103 ms to 124 ms when the improved atrous spatial pyramid pooling(IASPP)module is melted into. Additionally,after the addition of the polarized self-attention(PSA),the PQ of the model is improved from 58. 61% to 58. 86% at the cost of the runtime from 124 ms to 187 ms. After improving the polarized self-attention(IPSA),the PQ of the model is improved from 58. 86% to 59. 36% while the runtime is reached to 192 ms. We carried out visual experiments,including the visualization of the image,performance comparison of different categories,and comparison with other related methods further. Conclusion To optimize the bottom-up panoptic segmentation method,a panoptic segmentation method is developed based on improved ASPP(IASPP)and polarized self-attention(IPSA). This redesigned ASPP method can resolve the problem of atrous convolution failure effectively derived of the increase of dilation rate in ASPP,and the introduction of IPSA can improve the ability of ResNet-50 to extract pixel-level fine-grained features,and rich pixel-level feature information can be preserved in the process of feature extraction to estimate the highly nonlinear pixel-wise semantics. To improve the comprehensive performance of panoptic segmentation,it cannot only achieve better segmentation accuracy,but also maintain a better speed further.
Keywords

订阅号|日报