走向通用行人重识别:预训练大模型技术在行人重识别的应用综述
冯展祥, 赖剑煌, 袁藏, 黄宇立, 赖培杰(中山大学) 摘 要
行人重识别旨在对没有视野重叠覆盖的视域拍摄的行人目标进行身份匹配,是计算机视觉的研究热点,在安防监控场景有重要的研究意义和广阔的应用前景。受限于标注成本过高,行人数据集规模较小,当前行人重识别模型性能还达不到应用的水平,通用行人重识别技术还任重道远。近年来,预训练大模型引发了广泛的关注,获得了快速的发展,其核心技术在行人重识别领域获得了越来越多的应用。本文对预训练大模型技术在行人重识别的应用进行了全面的梳理回顾。首先介绍本领域的研究背景,从行人重识别的研究现状和面对的困难出发,简要阐述了预训练技术和预训练大模型的相关技术,分析预训练大模型技术在行人重识别的研究意义和应用前景。在此基础上,对基于预训练大模型的行人重识别研究进行了详细的介绍,将已有研究分为大规模自监督预训练行人重识别、预训练大模型引导的行人重识别和基于提示学习的行人重识别三类,并在多个数据集对前沿算法的效果和性能进行对比。最后,对该任务进行了总结,分析当前研究的局限,并展望未来研究的方向。整体而言,预训练大模型技术是实现通用行人重识别不可或缺的技术,当前研究还处于探索阶段,行人重识别与预训练大模型技术的结合还不够紧密,如何结合行人先验和预训练大模型技术实现通用行人重识别需要学术界和工业界共同思考和推动。
关键词
Towards universal person re-identification: survey on the applications of large-scale self-supervised pre-train model for person re-identification
Feng Zhan Xiang, Lai Jian Huang, Yuan Zang, Huang Yu Li, Lai Pei Jie(Sun Yat Sen University) Abstract
Person re-identification (re-id) aims to recognize target pedestrians across non-overlapping camera views. Re-id is a research hotspot in computer vision, and with significant research value and widespread application prospect in security surveillance. The performance of re-id techniques meets rapid growth in recent years, and the SOTA methods outperform human beings. Furthermore, researchers pay increasing attention to re-id in challenging uncontrolled environments, including visible-infrared person re-id, occluded person re-id, cloth-changing person re-id, low resolution person re-id, and aerial person re-id. However, the performance of re-id models is still far from satisfaction and does not meat the requirements of applications. There are two major reasons. First, the existing re-id models are trained by closed datasets with single scenarios and enough labeled pedestrians. However, in application environments, there are many varying scenarios, the environments across changing cameras are very different, and the labeled pedestrians are expensive to collect. Therefore, the performance, robustness, and generalization ability of the existing methods are not enough to support realistic applications. Second, because of the high annotation cost, the scale of the re-id datasets is small. The number of training samples for re-id is much smaller than the other vision tasks, such as face recognition, object recognition, and segmentation. As a result, the re-id models may be overfitted to the training images, and the generalizablity of re-id models is not enough. Consequently, there is still a long way to reach universal person re-id.
Recently, large-scale pre-train model has attracted significant attention and got rapid development. The key techniques are important for the development of re-id techniques. In this paper, we make an overview survey on the applications of large-scale pre-train techniques for person re-id. First, we introduce the background of large-scale pre-train models. Self-supervised pre-train techniques have gained great success in natural language processing (NLP). The Transformer structure has shown superior performance to extract robust NLP features. GPT and BERT are pioneering large-scale pre-train models based on Transformer, and are proven generative for down-stream tasks. GPT3 proves that the large-scale pre-train models are competitive with the SOTA supervised models without annotations. With the successful application of GPT3, many researchers try to apply self-supervised pre-train technique to vision tasks, and some pioneering researches are made for vision-language cross-modal tasks. ViLBERT is the first attempt to learn the relationships between vision and language. The CLIP model shows great generalization ability for zero-shot vision tasks. MAE adopts the mask modelling techniques to train a pre-train model with good generalization ability. We can see that the large-scale pre-train techniques can improve the performance and generalize ability of the baseline models with large-scale unsupervised data, which is very important for re-id because we do not need to collect numerous labeled pedestrians, which is expensive for re-id. Besides, the information contained by the large-scale pre-train models can be utilized to improve the performance of re-id models.
Because self-supervised pre-train technique can promote re-id models, some researchers have tried pioneering efforts. Here we introduce the existing researches for large-scale pre-train re-id models. The current literature is categorized into 3 types, including self-supervised pre-train re-id methods, large-scale pre-train model based re-id methods, and prompt learning based re-id methods. We will show the details of the above large-scale pre-train technique based methods, and show the effects and performances of the SOTA methods on various benchmarks. self-supervised pre-train re-id methods employ self-supervised pre-train techniques and large-scale unsupervised pedestrian benchmark to train a robust pre-train model. Note that the labeled pedestrians are scarce and expensive, some researchers try to construct weakly supervised/ unsupervised benchmarks for studying self-supervised pre-train re-id techniques. SYSU-30K is the first large-scale weakly supervised re-id dataset, which is constructed by over 30 million images and 30,000 IDs from 1,000 downloaded videos. The challenges of SYSU-30K includes low-resolution, view changes, occlusion and changing illumination. LUPerson is the first large-scale unsupervised person benchmark, which contains more than 4.2 million unsupervised pedestrian images from 46000 scenes, and covers the challenges of illumination variations, changing resolution and occlusion. The researchers then adopt tracking for the LUPerson dataset and construct the weakly supervised dataset LUPerson-NL, which contains more than 10 million pedestrians and 430,000 noisy identities. With the emergence of large-scale unsupervised datasets, some researchers make studies to apply self-supervised techniques for re-id. Some studies utilize the contrastive learning framework to learn robust re-id models from unsupervised pedestrians. The MoCo framework and catastrophic forgetting score are utilized to improve the generalization ability of re-id models. Besides, some researches employ the prior knowledge of pedestrians to improve the performance of self-supervised pre-train techniques. The local structure, the view information and color information are employed to incorporate priors for pre-train re-id methods. Large-scale pre-train model based re-id methods employ the knowledge of multi-modal large-scale model and use the interaction between vision and language to improve the performance of re-id models. Because CLIP model has shown superior performance for zero-shot vision tasks, most of the related studies utilize the CLIP model to learn discriminant and robust re-id model. Llama2 is also adopted to promote re-id task. Prompt learning based re-id methods introduce prompt learning methods to learn a robust re-id model. First, prompt learning re-id methods utilize the relationships between text description and visual features to learn a more discriminative and robust model. Besides, the researchers focus on employing the prompts to make the model adaptive to different environments so that we can obtain a universal re-id model which can cope with changing environments. Experimental results show that self-supervised techniques, large-scale pre-train models, and prompt learning methods can significantly improve the performance and generalization ability of re-id models. We can get a more universal re-id model for unseen scenarios.
Finally, we conclude the the overview of the current literature, analyze the limitation of the existing researches, and discuss the potential directions for future researches. In conclusion, the large-scale pre-train techniques are essential for universal re-id. The existing researches are pioneering and immature. The connection between re-id and large-scale pre-train model is loose. How to combine the prior of pedestrians and the knowledge of large-scale model so as to achieve universal re-id still needs the joint thinking and promotions from the academia and industry.
Keywords
|