Current Issue Cover
融合弱监督目标定位的细粒度小样本学习

贺小箭, 林金福(华南理工大学计算机科学与工程学院, 广州 510006)

摘 要
目的 小样本学习旨在通过一幅或几幅图像来学习全新的类别。目前许多小样本学习方法基于图像的全局表征,可以很好地实现常规小样本图像分类任务。但是,细粒度图像分类需要依赖局部的图像特征,而基于全局表征的方法无法有效地获取图像的局部特征,导致很多小样本学习方法不能很好地处理细粒度小样本图像分类问题。为此,提出一种融合弱监督目标定位的细粒度小样本学习方法。方法 在数据量有限的情况下,目标定位是一个有效的方法,能直接提供最具区分性的区域。受此启发,提出了一个基于自注意力的互补定位模块来实现弱监督目标定位,生成筛选掩膜进行特征描述子的筛选。基于筛选的特征描述子,设计了一种语义对齐距离来度量图像最具区分性区域的相关性,进而完成细粒度小样本图像分类。结果 在miniImageNet数据集上,本文方法在1-shot和5-shot下的分类精度相较性能第2的方法高出0.56%和5.02%。在细粒度数据集Stanford Dogs和Stanford Cars数据集上,本文方法在1-shot和5-shot下的分类精度相较性能第2的方法分别提高了4.18%,7.49%和16.13,5.17%。在CUB 200-2011(Caltech-UCSD birds)数据集中,本文方法在5-shot下的分类精度相较性能第2的方法提升了1.82%。泛化性实验也显示出本文方法可以更好地同时处理常规小样本学习和细粒度小样本学习。此外,可视化结果显示出所提出的弱监督目标定位模块可以更完整地定位出目标。结论 融合弱监督目标定位的细粒度小样本学习方法显著提高了细粒度小样本图像分类的性能,而且可以同时处理常规的和细粒度的小样本图像分类。
关键词
Weakly-supervised object localization based fine-grained few-shot learning

He Xiaojian, Lin Jinfu(Department of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China)

Abstract
Objective Few-shot learning (FSL) aims to learn emerged visual categories derived from constraint samples. A scenario of few-shot learning is the model learning via the classification strategy in the meta-train phase. It is required to recognize previously unseen classes with few labeled data in the meta-test phase. Current few-shot image classification methods focus on a robust global representation based learning., It is challenged to facilitate in-situ fine-grained image classification in spite of a common few-shot image classification existing. Such a global representation cannot capture the local and subtle features well, which is critical for fine-grained image recognition. The fine-grained image datasets samples are constrained due to the high cost of labeling, which is a tailored scenario of few-shot learning. Therefore, fine-grained images recognition is lack of annotated data. To fulfill image classification, fine-grained image recognition is based on the most discriminative region location and the discriminate features utilization. However, many fine-grained image recognition methods cannot be straightforward to the fine-grained few-shot task due to limited annotation data (e.g., bounding box). Thus, it is necessary to promote the few-shot learning and the fine-grained few-shot learning tasks both. Method Weakly-supervised object localization (WSOL) analysis is beneficial to the fine-grained few-shot classification task. Most fine-grained few-shot datasets are merely involved the label-based annotation due to the high cost of the pixel-level annotation. In addition, WSOL can provide the most discriminative regions directly, which is critical to general image classification and fine-grained image classification both. However, many existing WSOL methods cannot achieve complete localization of objects. For instance, class activation map (CAM) can update the last few layers of the classification network to obtain the merely class activation map via global maximum pooling and fully connected layers. To tackle these issues, we yield a self-attention based complementary module (SACM) to fulfill the WSOL. Our SACM contains the channel-based attention module (CBAM) and classifier module. Based on the spatial attention mechanism of the feature maps, CBAM can directly generate the saliency mask for the feature maps. A complementary non-saliency mask can be obtained through the threshold at the same time. To obtain the saliency and complementary non-saliency feature maps each, the saliency mask and the complementary non-saliency mask spatial-wise multiplies with the feature map. The classifier can obtain a more complete class activation map by assigning the saliency and non-saliency feature maps into the same category. Subsequently, we utilize the class activation map to filter and obtain the useful local feature descriptors for classification, which is as the descriptor representation. Additionally, images, the metric method cannot be directly applied to the fine-grained few-shot image classification in terms of common images based few-shot classification. We harness the semantic alignment distance to measure the distance between the two fine-grained images through the optioned feature descriptors and the naive Bayes nearest neighbor (NBNN) algorithm. First, we clarify the most neighboring descriptor among the supporting set through cosine distance for each query feature descriptor, which is denoted as the most neighboring cosine distance. Then, we accumulate the most neighboring cosine distance of each optioned feature descriptor to obtain the semantic alignment distance. The above two phases are merged into the semantic alignment module (SAM). Each feature descriptor in the query image can be accurately aligned by the support feature descriptor through the nearest neighbor cosine distance. This guarantees that the content between the query image and the supporting image can be semantically aligned. Meanwhile, each feature descriptor has a larger search space than the previous high-dimensional feature vector representation, which is equivalent to classification in a relative "high-data" regime, thereby improving the tolerance of the metric to noise. Result We carried out a large number of experiments to verify the performance. On the miniImageNet dataset, the proposed method gains 0.56% and 5.02% improvement than the second place under the 1-shot and 5-shot settings, respectively. On the fine-grained datasets Stanford Dogs and Stanford Cars, our method improves by 4.18%, 7.49%, and 16.13, 5.17% under 1-shot setting and 5-shot setting, respectively. In CUB 200-2011, our method also improves 1.82% under 5-shot. Our approach can be applied to both general few-shot learning and fine-grained few-shot learning. The ablation experiment demonstrates that to feature descriptors filtering improves the performance of fine-grained few-shot recognition via SACM-based activation map classification. Meanwhile, our proposed semantic alignment distance improves the classification performance of few-shot classification under the same conditions compared to the Euclidean distance. Extra visualization illustrates the proposed SACM can localize the key interval objects based on merely label-based annotations. Conclusion Our WSOL-based fine-grained few-shot learning method has its priorities for common and fine-grained few-shot learning both.
Keywords

订阅号|日报