Current Issue Cover
视觉Transformer识别任务研究综述

周丽娟, 毛嘉宁(郑州大学计算机与人工智能学院, 郑州 450001)

摘 要
Transformer模型在自然语言处理领域取得了很好的效果,同时因其能够更好地连接视觉和语言,也激发了计算机视觉界的极大兴趣。本文总结了视觉Transformer处理多种识别任务的百余种代表性方法,并对比分析了不同任务内的模型表现,在此基础上总结了每类任务模型的优点、不足以及面临的挑战。根据识别粒度的不同,分别着眼于诸如图像分类、视频分类的基于全局识别的方法,以及目标检测、视觉分割的基于局部识别的方法。考虑到现有方法在3种具体识别任务的广泛流行,总结了在人脸识别、动作识别和姿态估计中的方法。同时,也总结了可用于多种视觉任务或领域无关的通用方法的研究现状。基于Transformer的模型实现了许多端到端的方法,并不断追求准确率与计算成本的平衡。全局识别任务下的Transformer模型对补丁序列切分和标记特征表示进行了探索,局部识别任务下的Transformer模型因能够更好地捕获全局信息而取得了较好的表现。在人脸识别和动作识别方面,注意力机制减少了特征表示的误差,可以处理丰富多样的特征。Transformer可以解决姿态估计中特征错位的问题,有利于改善基于回归的方法性能,还减少了三维估计时深度映射所产生的歧义。大量探索表明视觉Transformer在识别任务中的有效性,并且在特征表示或网络结构等方面的改进有利于提升性能。
关键词
Vision Transformer-based recognition tasks: a critical review

Zhou Lijuan, Mao Jianing(School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China)

Abstract
Due to its ability to model long-distance dependencies,self-attention mechanism for adaptive computing,scalability for large models and big data,and better connection between vision and language,Transformer model is beneficial for natural language processing and computer vision apparently. To melt Transformer into vision tasks,such vision Transformer methods have been developing intensively. Current literatures can be summarized and analyzed for multiple applications-related methods. However,these different applications are often heterogeneous for various methods. In addition,comparative analysis is often focused on between Transformer and traditional convolution neural networks(CNNs), and multi-Transformer models are less involved in and linked mutually. We summarize and compare more than 100 popular methods of vision Transformer for various recognition tasks. Global recognition-based methods are reviewed for such classification of image and video contexts,and local recognition-based methods of object detection and vision segmentation. We summarize the methods in the context of face recognition,action recognition and pose estimation based on three specific recognition tasks mentioned above. Furthermore,solo task and independent domain methods are summarized,which can be used for image classification,object detection and other related vision tasks. The performance of these Transformerbased models are compared and analyzed on the public datasets as well. Image classification is mostly used to represent features in terms of visual and class tokens. The vision Transformer(ViT) and data-efficient image Transformers(DeiT)-illustrated models have its potentials for ImageNet datasets. Object detection tasks are required to detect targeted objects derived from input visual data,and the coordinates and labels of a series of bounding boxes are predictable as well. Object detection is illustrated by detection Transformer(DETR),which can alter the indirectness of previous classification and regression through proposals,anchors or windows. Subsequently,other related literatures are focused on improving the feature maps,computational complexity and convergence speed of DETR to a certain extent,such as conditional DETR, deformable DETR,unsupervised pre-training DETR(UP-DETR). Additionally,Transformer-based models have preferred relevant to such applications of salient object detection,point cloud 3D detection and few-shot object detection. Semantic segmentation tasks are required for an assignment from class label to each pixel in the image and the bounding box of the object like object detection can be predicted and optimized further. However,semantic segmentation can be used to determine pixel classes only,and it is still challenged to identify multiple instances-between similar pixels. Transformer is also paid attention to improve U-Net for medical image segmentation. It is possible to link the Transformer with pyramid network,or design different decoder structures for pixel-by-pixel segmentation,such as segmentation Transformer progressive upsampling(SETR-PUP) and segmentation Transformer multi-level feature aggregation(SETR-MLA). Mask classification methods are commonly used in instance segmentation and it can also be used for semantic segmentation via Transformer structure like a segmenter. Instance segmentation is similar to the combination of object detection and semantic segmentation. Compared to the bounding box of object detection,the output of instance segmentation is a mask,which can segment the edges of objects and distinguish different instances of similar objects. It can optimize the ability of semantic segmentation to some extent. Transformer can be used to melt more end-to-end methods into instance segmentation,and the quality of the masks can be used and improved during the segmentation process. Transformer can provide an alignment-free method for face recognition,and it can handle noises in related to facial expressions and racial bias. Action recognition tasks are required to classify videos-input human actions,which are similar to image classification tasks and additional processing of the temporal dimension is not avoidable. Transformer is developed for modeling long-term temporal and spatial dependencies for action recognition beyond two-stream network and three-dimensional convolution. Pose estimation is usually recognized as a human body keypoints-sorted problem and parts-between spatial relationship is identified. It consists of 2D pose estimation and 3D pose estimation. The former one is generally used to determine two-dimensional coordinates of body parts,while the latter one adds depth information on the basis of two-dimensional coordinates. Transformer is used to refine keypoint features for pose estimation,and the modeling of intra-frame node relationships and inter-frame temporal relationships are optimized as well. Multi-task models based Transformer research is focused on the integration of image classification,object detection and semantic segmentation tasks. Some other related popular models are also proposed that can be used in vision and language domains. Extensive research has shown the effectiveness of the vision Transformer in recognition tasks,and feature representation or network structure-relevant optimization is beneficial for its performance improvement. Future research direction are predicted in relevance to such effective and efficient methods for accuracy preservation in the context of positional encoding, self-supervised learning, multimodal integrating, and computational cost cutting.
Keywords

订阅号|日报