空洞可分离卷积和注意力机制的实时语义分割
摘 要
目的 为满足语义分割算法准确度和实时性的要求,提出了一种基于空洞可分离卷积模块和注意力机制的实时语义分割方法。方法 将深度可分离卷积与不同空洞率的空洞卷积相结合,设计了一个空洞可分离卷积模块,在减少模型计算量的同时,能够更高效地提取特征;在网络输出端加入了通道注意力模块和空间注意力模块,增强对特征的通道信息和空间信息的表达并与原始特征融合,以进一步提高特征的表达能力;将融合的特征上采样到原图大小,预测像素类别,实现语义分割。结果 在Cityscapes数据集和CamVid数据集上进行了实验验证,分别取得70.4%和67.8%的分割精度,速度达到71帧/s,而模型参数量仅为0.66 M。在不影响速度的情况下,分割精度比原始方法分别提高了1.2%和1.2%,验证了该方法的有效性。同时,与近年来的实时语义分割方法相比也表现出一定优势。结论 本文方法采用空洞可分离卷积模块和注意力模块,在减少模型计算量的同时,能够更高效地提取特征,且在保证实时分割的情况下提升分割精度,在准确度和实时性之间达到了有效的平衡。
关键词
Real-time semantic segmentation analysis based on cavity separable convolution and attention mechanism
Wang Nan1, Hou Zhiqiang1, Pu Lei2, Ma Sugang1, Cheng Huanhuan1(1.College of Computer, Xi'an University of Posts and Telecommunications, Xi'an 710121, China;2.College of Information and Navigation, Air Force Engineering University, Xi'an 710077, China) Abstract
Objective Image semantic segmentation is an essential part in computer vision analysis, which is related to autonomous driving, scenario recognitions, medical image analysis and unmanned aerial vehicle (UAV) application. To improve the global information acquisition efficiency, current semantic segmentation models can summarize the context information of different regions based on pyramid pooling module. Cavity-convolution-based multi-scale features extraction can increase the spatial resolution at different rates without changing the number of parameters. The feature pyramid network can be used to extract features and the multi-scale pyramid structure can be implemented to construct networks. The two methods mentioned above improve the accuracy of semantic segmentation. The practical applications are constrained of the size of the network and the speed of reasoning. Hence, a small capacity, fast and efficient real-time semantic segmentation network is a challenging issue to be designed. To require accuracy and real-time performance of semantic segmentation algorithm, a real-time semantic segmentation method is illustrated based on cavity separable convolution module and attention mechanism. Method First, the depth separable convolution is integrated to the cavity convolution with different rates to design a cavity separable convolution module. Next, the channel attention module and spatial attention module are melted into the performance of ending network-to enhance the representation of the channel information and spatial information of the feature, and integrate with the original features to obtain the final fused features to further improve the of the feature illustration capability. At the end, the fused features are up-sampled to the size of the original image to predict the category and achieve semantic segmentation. The targeted implementation can be segmented into feature extraction stage and feature enhancement stage. In the feature extraction stage, the input image adopts the cavity separable convolution module for intensive feature extraction. The module first uses a channel split operation to split the number of channels in half, splitting them into two branches. The following standard convolution is substituted to extract features more efficiently and shrink the number of model parameters based on deep separable convolution for each branch while. Meanwhile, the cavity convolution with different rates is used in the convolution layer of each branch to expand the receptive field and obtain multi-scale context information effectively. In the feature augmenting stage, the extracted features are re-integrated to enhance the demonstration of feature information. Our demonstration is illustrated as bellows: First, channel attention module and spatial attention module branch are melted into the model to enhance the expression of channel information and spatial information of features. Next, the global average pool branch is integrated to global context information to further improve the semantic segmentation performance. At the end, the branching features are all fused and the up-sampling process is used to match the resolution of the input image. Result Cityscapes dataset and the CamVid dataset are conducted on our method in order to verify the effectiveness of our illustrated method. The segmentation accuracy of Cityscapes dataset and CamVid dataset are 70.4% and 67.8% each. The running speed is 71 frame/s, while the model parameter amount was only 0.66 M. The demonstration illustrated that our method improves the segmentation accuracy to 1.2% and 1.2% each compared with the original method without low speed. Conclusion To customize the requirements of accuracy and real-time performance of semantic segmentation algorithm, a real-time semantic segmentation method is facilitated based on the cavity separable convolution module and the attention mechanism. This redesign depth method the can be combined with an efficient separation of convolution and cavity convolution in the depth of each separable branches with different cavity rate of convolution to obtain a different size of receptive field. The channel attention and spatial attention module are melted. Our method shrinks the number of model parameters and conducts feature information learning. Deeper network model and context aggregation module are conducted to achieve qualified real-time semantic segmentation simultaneously.
Keywords
real-time semantic segmentation depth separable convolution hole convolution channel attention spatialattention
|