合成数据驱动目标姿态追踪的快速收敛网络
摘 要
目的 受遮挡与累积误差因素的影响,现有目标6维(6 dimensions,6D)姿态实时追踪方法在复杂场景中表现不佳。为此,提出了一种高鲁棒性的刚体目标6D姿态实时追踪网络。方法 在网络的整体设计上,将当前帧彩色图像和深度图像(red green blue-depth map,RGB-D)与前一帧姿态估计结果经升维残差采样滤波和特征编码处理获得姿态差异,与前一帧姿态估计结果共同计算目标当前的6D姿态;在残差采样滤波模块的设计中,采用自门控swish(searching for activation functions)激活函数保留目标细节特征,提高目标姿态追踪的准确性;在特征聚合模块的设计中,将提取的特征分解为水平与垂直两个方向分量,分别从时间和空间上捕获长程依赖并保留位置信息,生成一组具有位置与时间感知的互补特征图,加强目标特征提取能力,从而加速网络收敛。结果 实验选用YCBVideo(Yale-CMU-Berkeley-video)和YCBInEoAT(Yale-CMU-Berkeley in end-of-arm-tooling)数据集。实验结果表明,本文方法追踪速度达到90.9 Hz,追踪精度模型点平均距离(average distance of model points,ADD)和最近点的平均距离(average closest point distance,ADD-S)分别达到93.24及95.84,均高于同类相关方法。本文方法的追踪精度指标ADD和ADD-S在追踪精度和追踪速度上均领先于目前其他的刚体姿态追踪方法,与se(3)-TrackNet网络相比,本文方法在6 000组少量合成数据训练的条件下分别高出25.95和30.91,在8 000组少量合成数据训练的条件下分别高出31.72和28.75,在10 000组少量合成数据训练的条件下分别高出35.57和21.07,且在严重遮挡场景下能够实现对目标的高鲁棒6D姿态追踪。结论 本文网络在合成数据驱动条件下,可以更好地完成实时准确追踪目标6D姿态,网络收敛速度快,实验结果验证了本文方法的有效性。
关键词
Fast convergence network for target posetracking driven by synthetic data
Peng Hong1, Wang Qian1, Jia Di1,2, Zhao Jinyuan1, Pang Yuheng1(1.School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;2.Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao 125105, China) Abstract
Objective Rigid object pose estimation is one of the fundamental,most challenging problems in computer vision,which has garnered substantial attention in recent years. Researchers are seeking methods to localize the multiple degrees of freedom of rigid objects in a 3D scene,such as position translation and directional rotation. At the same time, progress in the field of rigid object pose estimation has been considerable with the development of computer vision techniques. This task has become increasingly important in various applications,including robotics,space orbit servicing, autonomous driving,and augmented reality. Rigid object pose estimation can be divided into two stages:the traditional pose estimation stage(e. g. ,feature-based,template matching,and 3D coordinate-based methods)and the deep learningbased pose estimation stage(e. g. ,improved traditional methods and direct or indirect estimation methods). Despite the achievement of high tracking accuracy by existing methods and their improved variants,the tracking precision substantially deteriorates when they are applied to new scenes or novel target objects,exhibiting poor performance in complex environments. In such cases,a large amount of training data is required for deep learning across multiple scenarios,incurring high costs for data collection and network training. To address this issue,this paper proposes a real-time tracking network for rigid object 6D pose with fast convergence and high robustness,driven by synthetic data. The network provides longterm stable 6D pose tracking for target rigid objects,greatly reducing the cost of data collection and the time required for network convergence. Method The network convergence speed is mainly improved by the overall design of the network,the residual sampling filtering module,and the characteristic aggregation module. The rigid 6D pose transformation is calculated using Lie algebra and Lie group theory. The current frame RGB-D image and the previous frame’s pose estimation result are transformed into a pair of 4D tensors and input into the network. The pose difference is obtained through residual sampling filtering processing and feature encoder,and the current 6D pose of the target is calculated jointly with the previous frame’s pose estimation. In the design of the residual sampling filtering module,the self-gated swish activation function is used to retain target detail features,and the displacement and rotation matrix is obtained by decoupling the target pose through feature encoding and decoder,which improves the accuracy of target pose tracking. In the design of the characteristic aggregation module,the features are decomposed into horizontal and vertical components,and a 1D feature encoding is obtained through aggregation,capturing long-term dependencies and preserving position information from time and space. A set of complementary feature maps with position and time awareness is generated to strengthen the target feature extraction ability,thereby accelerating the convergence of the network. Result To ensure consistent training and testing environments,all experiments are conducted on a desktop computer with an Intel Core i7-8700@3. 2 GHz processor and NVIDIA RTX 3060 GPU. Each target in the complete dataset contains approximately 23 000 sets of images with a size of 176 × 176 pixels,totaling about 15 GB in capacity. During training and validation,the batch size is set to 80,and the model is trained for 300 epochs. The initial learning rate is set to 0. 01,with decay rate parameters of 0. 9 and 0. 99 applied starting from the 100th and the 200th epochs,respectively. When evaluating the tracking performance,the average distance of model points(ADD)metric is commonly used to assess the accuracy of pose estimation for non-symmetric objects. This approach involves calculating the Euclidean distance between each predicted point and the corresponding ground truth point,followed by summing these distances and taking their average. However,the ADD metric is not suitable for evaluating symmetric objects because multiple correct poses may exist for a symmetric object in the same image. In such cases,the ADD-S metric is used,which projects the ground truth and predicted models onto the symmetry plane and calculates the average distance between the projected points. This metric is more appropriate for evaluating the pose tracking results of symmetric objects. The Yale-CMU-Berkeley-video(YCB-Video)dataset and Yale-CMU-Berkeley in end-ofarm-tooling(YCBInEoAT)dataset are used to evaluate the performance of relevant methods in the experiments. The YCBVideo dataset contains complex scenes captured by a moving camera under severe occlusion conditions,whereas the YCBInEoAT dataset involves tracking rigid objects with a robotic arm. These two datasets are utilized to validate the generality and robustness of the network across different scenarios. Experimental results show the tracking speed of the proposed method reaches 90. 9 Hz,and the average distance of model points(ADD)and the average distance of nearest points (ADD-S)reach 93. 24 and 95. 84,respectively,which are higher than similar related methods. Compared with the se(3)-TrackNet method,which has the highest tracking accuracy,the ADD and ADD-S of this method are 25. 95 and 30. 91 higher under the condition of 6 000 sets of synthetic data,respectively,31. 72 and 28. 75 higher under the condition of 8 000 sets of synthetic data,respectively,and 35. 75 higher under the condition of 10 000 sets of synthetic data. The method achieves highly robust 6D pose tracking for targets in severely occluded scenes. Conclusion A novel fast-converging network is proposed for tracking the pose of rigid objects,which combines the residual sampling filtering module and the characteristic aggregation module. This network can provide long-term effective 6D pose tracking of objects with only one initialization. By utilizing a small amount of synthetic data,the network quickly reaches a state of convergence and achieves desirable performance in complex scenes,including severe occlusion and drastic displacement. The network demonstrates outstanding real-time pose tracking efficiency and tracking accuracy. Experimental results on different datasets validate the superiority and reliability of this approach. In future work,we will continue to optimize our model,further improve object tracking accuracy and network convergence speed,address the limitation of requiring computer-aided design(CAD)models for the network,and achieve category-level pose tracking.
Keywords
|