面向弱纹理目标立体匹配的Transformer网络
摘 要
目的 近年来,采用神经网络完成立体匹配任务已成为计算机视觉领域的研究热点,目前现有方法存在弱纹理目标缺乏全局表征的问题,为此本文提出一种基于Transformer架构的密集特征提取网络。方法 首先,采用空间池化窗口策略使得Transformer层可以在维持线性计算复杂度的同时,捕获广泛的上下文表示,弥补局部弱纹理导致的特征匮乏问题。其次,通过卷积与转置卷积实现重叠式块嵌入,使得所有特征点都尽可能多地捕捉邻近特征,便于细粒度匹配。再者,通过将跳跃查询策略应用于编码器和解码器间的特征融合部分,以此实现高效信息传递。最后,针对立体像对存在的遮挡情况,对固定区域内的匹配概率进行截断求和,输出更为合理的遮挡置信度。结果 在Scene Flow数据集上进行了消融实验,实验结果表明,本文网络获得了0.33的绝对像素距离,0.92%的异常像素占比和98%的遮挡预测交并比。为了验证模型在实际路况场景下的有效性,在KITTI-2015数据集上进行了补充对比实验,本文方法获得了1.78%的平均异常值百分比,上述指标均优于STTR(stereo Transformer)等主流方法。此外,在KITTI-2015、MPI-Sintel(max planck institute sintel)和Middlebury-2014数据集的测试中,本文模型具备较强的泛化性。结论 本文提出了一个纯粹的基于Transformer架构的密集特征提取器,使用空间池化窗口策略减小注意力计算的空间规模,并利用跳跃查询策略对编码器和解码器的特征进行了有效融合,可以较好地提高Transformer架构下的特征提取性能。
关键词
Transformer network for stereo matching of weak texture objects
Jia Di1, Cai Peng1, Wu Si2, Wang Qian1, Song Huilun1(1.School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;2.State Grid Huludao Electric Power Supply Company, Huludao 125000, China) Abstract
Objective In recent years, the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision. Stereo matching is a classic and computationally intensive task in computer vision. It is commonly used in various advanced visual processing applications such as 3D reconstruction, autonomous driving, and augmented reality. Given a pair of distortion-corrected stereo images, the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity, also known as disparity. In recent years, many researchers have explored deep learning-based stereo matching methods, which achieving promising results. Convolutional neural networks are often used to construct feature extractors for stereo matching. Although convolution-based feature extractors have yielded significant improvements in performance, neural networks are still constrained by the fundamental operation unit of “convolution”. By definition, convolution is a linear operator with a limited receptive field. Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures. This limitation becomes particularly pronounced in stereo matching tasks. In stereo matching tasks, captured stereo image pairs inevitably contain large areas of weak texture. Substantial computational resources are required to obtain comprehensive global feature representations through repeated convolutional layer stacking. We build a dense feature extraction Transformer for the stereo matching tasks, which incorporates Transformer and convolution blocks, to address the abovementioned issue. Method In the context of stereo matching tasks, FET exhibits three key advantages. First, by addressing high-resolution stereo image pairs, the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation. This way addresses the issue of feature scarcity caused by local weak textures. Second, we utilize convolution and transposed convolution blocks for implementing subsampling and upsampling overlapping patch embeddings, which ensures that all points nearby features are captured as comprehensively as possible to facilitate fine-grained matching. Third, we experiment with employing a skip-query strategy for feature fusion between the encoder and decoder to efficiently transmit information. Finally, we incorporate the attention-based pixel matching strategy of stereo Transformer (STTR) to realize a purely Transformer-based architecture. This strategy truncates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values. Result In the experimental section, we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090. We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed. However, training a pure Transformer architecture in mixed precision proved to be unstable. The model experienced loss divergence errors after only a few iterations. We modified the order of computation for attention scores to suppress related overflows for addressing this issue. We also restructured the attention calculation method based on the additivity invariance of the softmax operation. Ablation experiments were conducted on the Scene Flow dataset. Results show that the proposed network achieves an absolute pixel distance of 0.33, an outlier pixel ratio of 0.92%, and a 98% overlap prediction intersection over union. Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios. In these experiments, the proposed method achieved an average outlier percentage of 1.78, which outperformed mainstream methods such as STTR. Moreover, in tests on the KITTI-2015, MPI-Sintel, and Middlebury-2014 datasets, the proposed model demonstrated strong generalization capabilities. Subsequently, considering the limited definition of weak texture levels in currently available public datasets, we employed a clustering approach to filter images from the Scene Flow test dataset. Each pixel in the images was treated as a sample, with RGB values serving as the feature dimensions. This clustering process resulted in quantifying the number of different pixel categories within each image, which provided a measure of the texture strength or weakness in the images. The images were then categorized into “difficult”, “moderate”, and “easy” cases based on the number of clusters. Through comparative analysis, our approach consistently outperformed existing methods across the three sample categories, with a particularly notable improvement observed in the “difficult” case category. Conclusion For the stereo matching task, we propose a feature extractor based on the Transformer architecture. First, we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor, which effectively combines the inductive bias of convolutions with the global modeling capabilities of the Transformer. In addition, the Transformer-based feature extractor can capture a broader range of contextual representations, which partially alleviates region ambiguity issues caused by local weak textures. Furthermore, we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer, which mitigates semantic discrepancies between them. We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings, which keeps the attention computation of the model within linear complexity. Experimental results demonstrate a significant improvement in weak texture region prediction, occluded region prediction, and domain generalization when compared with relevant methods.
Keywords
stereo matching low-texture target Transformer spatial pooling windows jump queries truncated summation Scene Flow KITTI-2015
|