融合多尺度特征与全局上下文信息的X光违禁物品检测
摘 要
目的 X光图像违禁物品检测一直是安检领域的一个基础问题,安检违禁物品形式各异,尺度变化大,以及透视性导致大量物体堆放时出现重叠遮挡现象,传统图像处理模型很容易出现漏检误检,召回率低。针对以上问题,提出一种融合多尺度特征与全局上下文信息的特征增强融合网络(feature enhancement fusion network,FEFNet)用于X光违禁物品检测。方法 首先针对特征主干网络darknet53,加入空间坐标的注意力机制,将位置信息嵌入到通道注意力中,分别沿两个空间方向聚合特征,增强特征提取器对违禁目标的特征提取能力,抑制背景噪声干扰。然后,将特征提取主干网络输出的特征编码为1维向量,利用自监督二阶融合获取特征空间像素相关性矩阵,进而获取完整的全局上下文信息,为视觉遮挡区域提供全局信息指导。针对违禁物品尺度不一的问题,提出多尺度特征金字塔融合模块,增加一层小感受野预测特征用于提高对小尺度违禁目标的检测能力。最后,通过融合全局上下文特征信息和局部多尺度细节特征解决违禁物品之间的视觉遮挡问题。结果 在SIXRay-Lite (security inspection X-ray)数据集上进行训练和验证,并与SSD (single shot detection)、Faster R-CNN、RetinaNet、YOLOv5(you only look once)和ACMNet (asymmetrical convolution multi-view neural network)模型进行了对比实验。结果表明,本文模型在SIXray-Lite数据集上的mAP (mean average precision)达到85.64%,特征增强融合模块和多尺度特征金字塔融合模块较原有模型分别提升了6.73%和5.93%,总体检测精度较原有检测网络提升了11.24%。结论 提出的特征增强融合检测模型能够更好地提取显著差异特征,降低背景噪声干扰,提高对多尺度以及小型违禁物品的检测能力。同时利用全局上下文特征信息和多尺度局部特征相结合,有效地缓解了违禁物品之间的视觉遮挡现象,在保证实时性的同时有效地提高了模型的整体检测精度。
关键词
Integrated multi-scale features and global context in X-ray detection for prohibited items
Li Chen1, Zhang Hui1,2, Zhang Zouquan1, Che Aibo1, Wang Yaonan2(1.Changsha University of Science and Technology, Changsha 410114, China;2.Hunan University, Changsha 410082, China) Abstract
Objective X-ray image detection is essential for prohibited items in the context of security inspection of those are different types, large-scale changes and most unidentified prohibited items. Traditional image processing models are concerned of to the status of missed and false inspections, resulting in a low model recall rate, and non-ideal analysis in real-time detection. Differentiated from regular optical images, X-ray images tends to the overlapping phenomena derived of a large number of stacked objects. It is challenged to extract effective multiple overlapping objects information for the deep learning models. The multiple overlapping objects are checked as a new object, resulting in poor classification effect and low detection accuracy. Our feature enhancement fusion network(FEFNet) is facilitated to the issue of X-ray detection of prohibited items based on multi-scale features and global context.Method First, the feature enhancement fusion model improves you only look once v3(YOLOv3)'s feature extractor darknet53 through adding a spatial coordinated attention mechanism. The improved feature extractor is called coordinate darknet, which embeds in situ information into the channel attention and aggregates features in two spatial directions. Coordinate darknet can extract more salient and discriminatory information to improve the feature extractor's ability. Specifically, the coordinated attention module is melted into the last four residual stages of the original darknet53, including two pooling modules. To obtain feature vectors in different directions, the width and height of the feature map are pooled adaptively. To obtain attention vectors in different directions, the feature vectors are processed in different directions through the batch normalization layer and activation layer. What's more, the obtained attention vector is applied to the input feature map to yield the model to the detailed information. Next, our bilinear second-order fusion module extracts global context features. The module encodes the highest-dimensional semantic feature information output by a melted one-dimensional vector into the feature extraction backbone network. To obtain a spatial pixel correlation matrix, the bilinear pooling is used to a two-dimensional feature undergoes second-order fusion. To output the final global context features information, the correlation matrix is multiplied by the input features up-sampled and spliced with the feature pyramid. Among them, the bilinear pool operation first obtains the fusion matrix by bilinear fusion (multiplication) of two one-dimensional vectors at the same position, and sums and pools all positions following, and obtains final L2 normalization and softmax operation after the fusion feature. Finally, the feature pyramid layer is improved in response to the problem of different scales of prohibited items. Our cross-scale fusion feature pyramid module improves the ability of multi-scale prohibited items. The multi-scale feature pyramid outputs a total of 4 feature maps of different scales as predictions, and the sizes from small to large are 13×13 pixels, 26×26 pixels, 52×52 pixels, and 104×104 pixels. Small-scale feature maps can predict large-scale targets, and large-scale feature maps are used to improve the predicting ability of small targets. In addition, the concatenate operation is replaced with adding, which can keep more activation mapping from the coordinate darknet. Meanwhile, the global context feature is connected to other local features straightforward derived of second-order fusion, and this information optimizes the obscure and occlusion phenomenon.Result Our experiment is trained and verified on the security inspection X-ray(SIXRay-Lite) dataset, which include 7 408 samples of training data and 1 500 test data samples. Our EFENet is compared to other object detection models as well, such as single shot detection(SSD), Faster R-CNN, RetinaNet, YOLOv5, and asymmetrical convolution multi-view neural network(ACMNet). This experimental results show that our method achieves 85.64% mean average precision(mAP) on the SIXray-Lite dataset, which is 11.24% higher than the original YOLOv3. Among them, the average detection accuracy of gun is 95.15%, the average detection accuracy of knife is 81.43%, the average detection accuracy of wrench is 81.65%, the average detection accuracy of plier is 85.95%, and the average detection accuracy of scissor is 84.00%. Our comparative analyses demonstrate the priority of our proposed model as mentioned below:1) in comparison with the SSD model, the mAP of the FEFNet model is increased by 13.97%; 2) compared to the RetinaNet model, the mAP of the FEFNet model is increased by 7.40%; 3) compared to the Faster R-CNN model, the mAP of the FEFNet model is increased by 5.48%; 4) compared to the YOLOv5 model, the mAP of the FEFNet model is only increased 3.61%, and 5) compared to the ACMNet model, the mAP of the FEFNet model is increased by 1.34%.Conclusion Our FEFNet can be optimized to extract significant difference features, reduce background noise interference, and improve the detection ability of multi-scale and small prohibited items. The combination of global context feature information and multi-scale local feature can alleviate the visual occlusion and obscure phenomenon between prohibited items effectively, and improve the overall detection accuracy of the model while ensuring the real-time performance.
Keywords
prohibited items detection X-ray image features enhancement fusion attention mechanism multi-scale fusion global context features
|