自适应IoU损失和层级关联的多目标跟踪
摘 要
目的 针对模糊行人特征造成身份切换的问题和复杂场景下目标之间遮挡造成跟踪精度降低的问题,提出 AIoU-Tracker 多目标跟踪算法。方法 首先根据骨干网络检测头设计了一个特殊的 AIoU(adaptive intersectionover union)回归损失函数,从重叠面积、中心点距离和纵横比 3 个方面去衡量,缓解了由于模糊行人特征判别性不足造成的身份切换现象;其次提出了一种简单有效的层级(hierarchical)关联策略,在高分检测框和低分检测框分别关联之后,充分利用关联失败检测框周围的嵌入信息再次进行关联,提高了在遮挡条件下多目标跟踪的关联精度。结果 通过一系列的对比实验,提出的 AIoU-Tracker 跟踪方法相比于 FairMOT 跟踪方法在 MOT16 数据集上,HOTA(higher order tracking accuracy)值由 58. 3% 提高 至 59. 8%,IDF1(ID F1 score)值由 72. 6% 提高 至 73. 1%,MOTA(multi-object tracking accuracy)值由 69. 3% 提高至 74. 4%;在 MOT17 数据集上,HOTA 值由 59. 3% 提高至 59. 9%,IDF1 值由 72. 3% 提高至 72. 9%。结论 本文提出的特征平衡性跟踪方法,使边界框大小特征、热图特征和中心点偏移量特征在训练测试中达到了更好的平衡,使多目标跟踪结果更加准确。
关键词
Multi-object tracking using adaptive-IoU loss and hierarchical association
Guo Wen, Liu Qigui, Ding Xinmiao(School of Information and Electronic Engineering, Shandong Technology and Business University, Yantai 264005, China) Abstract
Objective Multiple object tracking (MOT) is a mainstream task in computer vision, which aims mainly to estimate the tracklets of multiple objects in videos and has important applications in the fields of autonomous driving, human-computer interaction, and human activity recognition. A large number of methods focus on improving the tracking performance based on the given detection results. Re-ID based trackers can be divided into two categories: separate detection and embedding (SDE) tracking models and joint detection and embedding (JDE) tracking models. The SDE tracking model tunes the detection model and the Re-ID model separately to optimize the model, but this leads to the disadvantage of the SDE tracking model being unable to perform real-time detection. The JDE tracking model performs object detection while outputting the object location and appearance embedding information for the next step of object association, thus improving the algorithm’s operational speed. However, the JDE tracking method suffers from the problem of identity switching due to ambiguous pedestrian features and the degradation of tracking accuracy due to occlusion between objects in complex scenes. An adaptive intersection-over-union (AIoU)-tracker multi-object tracking algorithm is proposed to address these issues.Method First, we utilize the backbone network detection head to design a special AIoU regression loss function that measures the overlap area, center point distance, and aspect ratio. This approach helps alleviate the problem caused by identity switching due to ambiguous pedestrian features. Second, we propose a simple and effective hierarchical association method to leverage the embedding information around association failure detection frames for Re-ID. The high-score detection frames and low-score detection frames are associated separately, improving the association accuracy of multi-object tracking under occlusion conditions. We utilize a variant of the DLA-34 network architecture as the backbone network. The model parameters are trained on the common objects in context (COCO) dataset and used to initialize the model. The experiments are conducted on a system running Ubuntu 16.04 with 64 GB of memory and a GTX2080Ti GPU. The software configuration includes CUDA 10.2. We train the model using the Adam optimizer for 30 epochs, with an initial learning rate of 10-4. The learning rate is decayed to 10-5 after 20 epochs, and the batch size is set to 16. We apply standard data augmentation techniques, including rotation, scaling, and color jittering. The input image size is adjusted to 1 088×608 pixels, and the feature map resolution is set to 272×152 pixels. We evaluate our approach on the MOT Challenge benchmark, specifically the MOT16 and the MOT17 datasets. The experiments utilize various datasets, including CrowdHuman, MIX dataset (ETH, CityPerson, CUHKSYSU, Caltech, and PRW). The ETH and CityPerson datasets only provide bounding box annotations, so we only train the detection branch on these datasets. The Caltech, MOT17, CUHKSYSU, and PRW datasets provide both bounding box positions and ID annotations, allowing for training of both branches. To ensure a fair comparison, we remove the overlapping videos between the ETH dataset and the MOT17 test dataset. The CrowdHuman dataset only contains bounding box annotations, so we perform self-supervised training on it. To evaluate the tracking performance, we use several well-defined metrics, including higher-order tracking accuracy (HOTA), multi-object tracking accuracy (MOTA), ID F1 score (IDF1), false positive, false negative, and number of identity switches (IDs). MOTA primarily assesses the performance of the detection branch, IDF1 evaluates identity preservation, focusing on the association performance, and HOTA provides a comprehensive evaluation of both the detection branch and the data association performance.Result The performance of our method is compared with that of existing methods on two datasets. The comparison results are as follows: 1) our HOTA value is 59.8% on the MOT16 dataset, which is increased by 1.5% compared with the FairMOT. Our MOTA value is 74.4% on the MOT16 dataset, which is increased by 5.1% compared with the FairMOT. Our IDF1 value is 73.1% on the MOT16 dataset, which is increased by 0.5% compared with the FairMOT. 2) The HOTA value is 59.9% on the MOT17 dataset, which is increased by 0.6% compared with the FairMOT. The IDF1 value is 72.9% on the MOT17 dataset, which is increased by 1.6% compared with the FairMOT. In addition, we conduct ablation studies on the MOT17 dataset to verify the effectiveness of different components in our method, which demonstrates that the proposed method significantly outperforms the competition in multiple object tracking. In the ablation studies, we observe a decrease in the number of identity switches through the added AIoU regression loss function. We also visualize the predicted Re-ID feature extraction positions, bounding box size feature, heat map feature, and center point offset feature. The visualization results show that our method is more robust than FairMOT. Moreover, our hierarchical association method makes the association more robust. For example, even after two frames, obscured IDs can still be associated.Conclusion The proposed feature balancing tracking method achieves better balance among the bounding box size feature, heat map feature, and center point offset feature during training and testing, resulting in more accurate multi-object tracking results. In this study, we propose two improvement measures for the FairMOT framework. First, we design an AIoU regression loss module to optimize the detection branch, enabling it to optimize targets based on the current optimal distance and extract more accurate appearance features. Second, we optimize the Re-ID branch through a hierarchical association strategy module, utilizing three-level matching to enhance the tracking system’s association performance. Experimental results demonstrate significant improvements on the MOT17 dataset, with HOTA increasing to 59.9%, IDF1 increasing to 72.9%, and MOTA increasing to 70.8%. However, a competition issue exists between the detection and Re-ID branches in the JDE tracking model, which can lead to a decrease in MOTA. Future research will focus on investigating this competition in the JDE tracking model.
Keywords
multi-object tracking (MOT) data association regression loss feature balance hierarchical association method
|