融合密度和精细分数的行人检测
摘 要
目的 行人检测是指使用矩形框和置信度找出图像或者视频中的所有行人。传统的图像行人检测方法对于姿态各异或者相互遮挡的行人无能为力。深度神经网络(deep neural networks,DNN)在目标检测领域表现出色,然而依然难以解决行人检测中一些问题。本文提出一种融合密度和精细分数的行人检测方法DC-CSP(density map and classifier modules with center and scale prediction)。方法 首先,在CSP(center and scale prediction)网络的基础上添加密度图模块(density map module,DMM)和分类器模块(classifier module,CM),得到DC-CSP网络;然后,针对置信度不精确问题,利用不同模块对分数预测结果的互补性质,设计阶段分数融合(stage score fusion,SSF)规则对检测分数进行更新,使得行人置信度上升、背景置信度下降;最后,基于NMS(non-maximum suppression),利用估计的行人密度图,设计改进的自适应NMS(improved adaptive NMS,IAN)后处理方法,能够进一步改善检测结果,对相互遮挡行人提高交并比(intersection over union,IOU)阈值从而减少漏检,对单个行人降低IOU阈值从而减少错检。结果 在公开数据集Citypersons和Caltech上进行定量和定性分析。定量分析中,与其他方法相比,本文方法在Citypersons数据集的Reasonable、Heavy、Partial以及Bare子集上,对数平均漏检率分别下降了0.8%、1.3%、1.0%和0.8%,在Caltech数据集的Reasonable和All子集上分别下降了0.3%和0.7%;在定性分析中,可视化结果表明,本文方法在一定程度上解决了各种不同场景下存在的相互遮挡行人漏检、单个行人错检以及置信度不精确等一系列问题。此外,消融实验证明了所设计模块及其对应规则的有效性。结论 本文方法使用联合多个模块的卷积神经网络(convolutional neural network,CNN),针对密度特征、分类特征分别设计IAN方法和SSF规则,在一定程度上解决了相互遮挡行人漏检、单个行人错检以及置信度不精确的问题,在多个数据集上证明了方法的有效性和鲁棒性。
关键词
Pedestrian detection method based on density and score refinement
Zhen Ye, Wang Zilei, Wu Feng(National Engineering Laboratory of Brain-Inspired Intelligence Technology and Application, University of Science and Technology of China, Hefei 230027, China) Abstract
Objective Pedestrian detection involves locating all pedestrians in images or videos by using rectangular boxes with confidence scores. Traditional pedestrian detection methods cannot handle situations with different postures and mutual occlusion. In recent years, deep neural networks have performed well in object detection, but they are still unable to solve some challenging issues in pedestrian detection. In this study, we propose a method called DC-CSP(density map and classifier modules with center and scale prediction) to enhance pedestrian detection by combining pedestrian density and score refinement. Under an anchor-free architecture, our method first refines the classification to obtain more accurate confidence scores and then uses different IoU (intersection over union) thresholds to handle varying pedestrian densities with the objective of reducing the omission of occluded pedestrians and the false detection of a single pedestrian. Method First, our DC-CSP network is primarily composed of a center and scale prediction(CSP) subnetwork, a density map module (DMM), and a classifier module (CM). The CSP subnetwork includes a feature extraction module and a detection head module. The feature extraction module uses ResNet-50 as its backbone, in which output feature maps are down-sampled by 4, 8, 16, and 16 with respect to the input image. The shallower features provide more precise localization information, and the deeper features contain more semantic information with larger receptive fields. Thus, we fuse the multi-scale feature maps from all the stages into a single one with a deconvolution layer. Upon the concatenation of feature maps, the detection head module first uses a 3×3 convolutional layer to reduce channel dimension to 256 and then two sibling 1×1 convolutional layers to produce the center heat map and scale map. On the basis of the CSP subnetwork, we design a density estimation module that first utilizes the concatenated feature maps to generate the features of 128 channels via a 1×1 convolutional layer, and then concatenates them with the center heat map and scale map to predict a pedestrian density map with a convolutional kernel of 5×5. The density estimation module integrates diverse features and applies a large kernel to consider surrounding information, generating accurate density maps. Moreover, a CM is designed to use the bounding boxes transformed from the center heat map and the scale map as input. This module utilizes the concatenated feature maps to produce 256-channel features via a 3×3 convolutional layer and then classifies the produced features by using a convolutional layer with a 1×1 kernel. The majority of the confidence scores of the background are below a certain threshold; thus, we can obtain a threshold for easily distinguishing pedestrians from the background. Second, the detection scores in CSP are relatively low and the CM can better discriminate between pedestrians and the background. Therefore, to increase the confidence scores of pedestrians and simultaneously decrease that of the background in the final decision, we design a stage score fusion (SSF) rule to update the detection scores by utilizing the complementarity of the detection head module and CM. In particular, when the classifier judges a sample as a pedestrian, the SSF rule will slightly boost the detection scores. By contrast, when the classifier judges a sample as the background, the SSF rule will slightly decrease. In other cases, a comprehensive judgment will be made by averaging the scores from both modules. Third, an improved adaptive non-maximum suppression (NMS), called the improved adaptive NMS(IAN) post-processing method, based on the estimated pedestrian density map is also proposed to improve the detection results further. In particular, a high IoU threshold will be used for mutually occluded pedestrians to reduce missed detection, and a low IoU threshold will be used for a single pedestrian to reduce false detection. In contrast with adaptive NMS, our IAN method fully considers various scenes. In addition, IAN is based on NMS rather than on soft NMS, and thus, it involves lower computational cost. Result To verify the effectiveness of the proposed modules, we conduct a series of ablation experiments in which C-CSP, D-CSP, and DC-CSP respectively represent the addition of the CM, DMM, and both modules to the CSP subnetwork. We conduct quantitative and qualitative analyses on two widely used public datasets, i.e., Citypersons and Caltech, for each setting. The experimental results of C-CSP verify the rationality of the SSF rule and demonstrate that the confidence scores of pedestrians can be increased while that of the background can be decreased. Simultaneously, the experimental results of D-CSP demonstrate the effectiveness of the IAN method, which can considerably reduce missed detection and false detection. For the quantitative analyses of DC-CSP, its log-average miss rate decreases by 0.8%, 1.3%, 1.0%, and 0.8% in the Reasonable, Heavy, Partial, and Bare subsets of Citypersons, respectively, and decreases by 0.3% and 0.7% in the Reasonable and Allsubsets of Caltech, respectively, compared with those of other methods. For the qualitative analyses of DC-CSP, the visualization results show that our method can work well in various scenes, such as pedestrians occluded by other objects, smaller pedestrians, vertical structures, and false reflection. Pedestrians in different scenes can be detected more accurately, and the confidence scores are more convincing. Furthermore, our method can avoid numerous false detections in situations with a complex background. Conclusion In this study, we propose a deep convolutional neural network with multiple novel modules for pedestrian detection. In particular, the IAN method and the SSF rule are designed to utilize density and classification features, respectively. Our DC-CSP method can considerably alleviate issues in pedestrian detection, such as missed detection, false detection, and inaccurate confidence scores. Its effectiveness and robustness are verified on multiple benchmark datasets.
Keywords
urban scenes pedestrian detection convolutional neural network (CNN) density map score fusion adaptive post-processing
|