局部特征融合的小样本分类
摘 要
目的 小样本学习是一项具有挑战性的任务,旨在利用有限数量的标注样本数据对新的类别数据进行分类。基于度量的元学习方法是当前小样本分类的主流方法,但往往仅使用图像的全局特征,且模型分类效果很大程度上依赖于特征提取网络的性能。为了能够充分利用图像的局部特征以及提高模型的泛化能力,提出一种基于局部特征融合的小样本分类方法。方法 首先,将输入图像进行多尺度网格分块处理后送入特征提取网络以获得局部特征;其次,设计了一个基于Transformer架构的局部特征融合模块来得到包含全局信息的局部增强特征,以提高模型的泛化能力;最后,以欧几里得距离为度量,计算查询集样本特征向量与支持集类原型之间的距离,实现分类。结果 在小样本分类中常用的3个数据集上与当前先进的方法进行比较,在5-way 1-shot和5-way 5-shot的设置下相对次优结果,所提方法在MiniImageNet数据集上的分类精度分别提高了2.96%和2.9%,在CUB (Caltech-UCSDBirds-200-2011)数据集上的分类精度分别提高了3.22%和1.77%,而在TieredImageNet数据集上的分类精度与最优结果相当,实验结果表明了所提方法的有效性。结论 提出的小样本分类方法充分利用了图像的局部特征,同时改善了模型的特征提取能力和泛化能力,使小样本分类结果更为准确。
关键词
Local feature fusion network-based few-shot image classification
Dong Yangyang1, Song Beibei1, Sun Wenfang2(1.School of Information Engineering, Chang'an University, Xi'an 710064, China;2.School of Aerospace Science and Technology, Xidian University, Xi'an 710126, China) Abstract
Objective The emerging convolutional neural network based(CNN-based)deep learning technique is beneficial for image context like its recognition,detection,segmentation and other related fields nowadays. However,the learning ability of CNN is often challenged for a large number of labeled samples. The CNN model is disturbed of over-fitting problem due to insufficient labeled sample for some categories. The collection task of labeled samples is time-consuming and costly. However,human-related perception has its ability to learn from a small number of samples. For example,it will be easily recognized other related new images in these categories even under a few pictures of each image category circumstances. To make CNN model have the learning ability similar to human,a new machine learning algorithm is concerned about more,called few-shot learning. Few-shot learning can be used to classify new image categories in terms of a limited amount of annotation data. Current metric-based meta-learning methods can be as one of the effective task for fewshot learning methods. However,it is implemented on the basis of global features,which cannot represent the image structure adequately. More local feature information is required to be involved in as well,which can provide discriminative and transferable information across categories. Furthermore,there are some local features representation-derived metric methods can be used to obtain pixel-level deep local descriptors as the local feature representation of image via removing the last global average pooling layer in CNN. However,local descriptors are depth but the classification effect is restricted by sacrificed contextual information of the image. Additionally,for the feature extraction network,due to limited labeled instances,it is challenged to learn a good feature representation and generalize new categories. To utilize the local features of image and improve the generalization ability of the model,we develop a few-shot classification method in terms of local feature fusion. Method First,to obtain local features,the input image is divided into H×W local blocks and then transferred to the feature extraction network. This feature representation-related method can demonstrate local information of the image and its context information. Multi-scale grid blocks are illustrated as well. Second,to learn and fuse the relationship between multiple local feature representations,we design a Transformer architecture based local feature fusion module because the self-attention mechanism in Transformer can capture and fuse the relationship between input sequences effectively. Each local feature consists of the information of other local features and it has fusion-after simultaneous global information. And,we concatenate the multiple local feature representations of each image as the final output. The feature representation of the original input image is enhanced and the generalization ability of the model can be improved after that. Finally,the Euclidean distance between the query image embedding and the support class prototype is calculated to classify the query image. Our training process is divided into two steps:pre-training and meta-training. For the pre-training stage,the Sofamax layer-attached backbone network is used to classify all images of the training set. To improve the generalization ability of the model,we use the data-augmented methods of random cropping,horizontal flipping and color jittering. After the pre-training,the backbone network in the model is initialized with pre-trained weights,and other components are then fine-tuned. For meta-learning stage,the episode training strategy is implemented for training. To make a fair comparison with other few-shot classification methods,the ResNet12 structure is used as the feature extractor of the backbone network,and the cross entropy loss of classification is optimized through stochastic gradient descent(SGD). The initial learning rate of the model is set to 5×10-4,and we set 100 epochs in total,the learning rate is decreased by half every 10 epochs,100 episodes and 600 episodes for training and validate in each epoch. The domain difference is larger since there are more samples in TieredImageNet dataset,more iteration is required to make the model convergent. Therefore,we set 200 epochs,and the learning rate is decreased by half very 20 epochs. In the test stage,to evaluate the average classification accuracy,such 5 000 episodes are selected from the test set randomly. Result Comparative analysis is based on three benchmark datasets in few-shot classification. For MiniImageNet dataset,each average classification accuracy is optimized by 2. 96% and 2. 9% under 5-way 1-shot and 5-way 5-shot settings. For CUB dataset,each of average classification accuracy is increased by 3. 22% and 1. 77%. For TieredImageNet dataset,the proposed method is equivalent to the state-of-the-art method in average classification accuracy. To fully verify the effectiveness of the proposed method,a large number of ablation experiments are also carried out as well. Conclusion We develop a local feature fusionbased method for few-shot classification. It is beneficial to make sufficient local features and the feature extraction ability and generalization ability of the model can be optimized as well. Our Transformer architecture based local feature fusion module can enhance feature representation further,which can be embedded into other few-shot classification methods potentially.
Keywords
|