双分支注意和特征交互的小样本细粒度学习
文浪, 苟光磊, 白瑞峰, 缪宛谕(重庆理工大学) 摘 要
目的 细粒度图像分类旨在区分视觉上高度相似但语义不同的类别。在实际应用中,获取大量带标签的数据往往成本昂贵且需要专业技能。传统的分类方法难以捕捉细粒度图像中的微小差异,使得在少量样本环境下,细粒度图像分类的性能较差。因此,研究如何应用小样本学习方法来解决细粒度问题显得尤为重要。为此,提出了一种双分支注意和特征交互的小样本细粒度图像分类方法。方法 首先,在特征提取网络中设计了双分支注意力模块,该模块通过空间和通道两个路径动态调整模型对不同部分的关注程度,从而学习细粒度图像中的更多细节特征和辨别特征。其次,采用随机抽样策略生成查询子集,并引入特征交互模块计算查询子集与支持集样本之间的相关性,进而对支持特征自适应分配权重,以增强样本特征中最具区分性的区域。最后,通过结合关系网络度量和余弦相似度来测量查询样本与支持集类原型之间的相关分数,实现图像分类。结果 在CUB-200-2011数据集上,本文方法在5-way 1-shot与5-way 5-shot任务设置下的分类准确率相较于次优方法高出5.95%和1.21%。在Stanford Dogs数据集上,本文方法在1-shot与5-shot下的分类准确率相较于次优方法提高了4.15%和2.29%。在Stanford Cars数据集上,本文方法的分类准确率优于绝大多数对比实验方法。复杂度分析实验表明,本文提出的双分支注意力模块在内存开销和训练时间方面表现良好。此外,可视化实验结果显示该模块在捕捉细粒度长距离依赖关系方面的有效性,从而更全面地识别细粒度图像特征。结论 所提出的小样本细粒度分类方法在未显著增加模型复杂度的情况下,增强了样本特征的表达能力。同时,该方法优化了样本分布,使同类样本在特征空间中更加紧密相邻,而不同类别样本则相对更为远离。
关键词
Bifurcated attention and feature interaction for few-shot fine-grained learning
Wen Lang, Gou Guanglei, Bai Ruifeng, Miao Wanyu(Chongqing University Of Technology) Abstract
Objective Fine-grained image classification seeks to differentiate between categories that are visually very similar yet semantically distinct, making it significantly different from general image classification tasks. In practice, collecting a large amount of labeled fine-grained image data is often both time-consuming and costly. Accurate data annotation typically requires the expertise of specialists, further raising the difficulty of building such datasets. Traditional classification methods are limited in their ability to capture the minute variations between categories that fine-grained images present, which often leads to unsatisfactory performance, particularly in scenarios with limited sample sizes. As such, it has become critical to explore how few-shot learning methods can be employed to solve fine-grained image classification problems. Few-shot fine-grained image classification addresses the challenge of learning from a very limited number of labeled examples while aiming to accurately differentiate between similar categories. Current metric-based meta-learning approaches are some of the leading few-shot learning methods. However, these methods tend to rely on global image features, making it difficult for them to fully capture the intricate structures and subtle differences that characterize fine-grained images. Moreover, existing few-shot learning techniques often struggle with the unique challenges posed by fine-grained classification tasks, including large intra-class variability and high inter-class similarity. These challenges can severely limit classification performance. To address these issues, we propose a novel few-shot fine-grained image classification method that integrates bifurcated attention and feature interaction mechanisms. Method Our approach begins with the introduction of a bifurcated attention module within the feature extraction network. This module is designed to dynamically adjust the model’s attention to different parts of the image through two pathways: spatial attention and channel attention. The spatial attention path allows the model to focus on important regions of the image based on their spatial relevance, while the channel attention path adjusts the focus based on the importance of different feature channels. By utilizing these two distinct pathways, the model can dynamically adjust its focus to emphasize the most critical aspects of the image, improving its ability to distinguish between fine-grained details. The features extracted from these two attention branches are then concatenated along the channel dimension, allowing the model to capture more detailed and discriminative features that are crucial for fine-grained image classification. By focusing on important areas of the image, the bifurcated attention module reduces unnecessary computations on irrelevant or less informative regions, leading to improved computational efficiency. This not only lowers the overall computational burden of the model but also allows it to better highlight the subtle differences between fine-grained categories. In addition, we employ a random sampling strategy to generate query subsets, reducing the number of parameters involved in the classification process. This method helps streamline the computational process, making the model more efficient in handling few-shot tasks. After generating these query subsets, we average the features of each category within the subset and introduce a feature interaction module. This module calculates the correlation between the query subset and the support set samples, enabling the model to better capture the relationships between these samples. By utilizing the computed correlation, the model adaptively assigns weights to the support features, emphasizing the most distinctive regions in the feature space of the samples. At the same time, we take into account the inter-channel dependencies within the support features themselves, selectively highlighting the most important features. This allows the model to better focus on key aspects of the image that are most relevant to distinguishing between similar categories. To further enhance classification performance, we combine relation network-based metrics with cosine similarity to measure the correlation between query samples and support set prototypes. By incorporating both of these metrics, we ensure that the model can accurately assess the similarities and differences between samples, ultimately achieving improved few-shot fine-grained image classification performance. Result Our proposed method demonstrates strong performance across several benchmark datasets. On the CUB-200-2011 dataset, the classification accuracy of our method surpasses that of the second-best method by 5.95% and 1.21% in the 5-way 1-shot and 5-way 5-shot task settings, respectively. This indicates a significant improvement in fine-grained classification accuracy, particularly in the few-shot learning scenario where only a limited number of labeled examples are available. Similarly, on the Stanford Dogs dataset, our method achieves a 4.15% improvement in classification accuracy for the 1-shot task and a 2.29% improvement for the 5-shot task compared to the second-best method. These results further validate the effectiveness of our approach in handling few-shot fine-grained classification tasks. Additionally, on the Stanford Cars dataset, our method outperforms most comparative methods, showcasing its generalizability across different fine-grained image datasets. Furthermore, our complexity analysis experiments demonstrate that the proposed bifurcated attention module performs well in terms of memory overhead and training time. Despite its ability to capture more detailed features, the module does not introduce significant computational complexity. Visualization experiments further confirm the module's effectiveness in capturing long-range dependencies within fine-grained images. By focusing on these dependencies, our method is able to more comprehensively identify the distinctive features of fine-grained images, allowing for improved classification performance. Conclusion The few-shot fine-grained image classification method proposed in this study enhances the feature representation of samples without significantly increasing the complexity of the model. By incorporating bifurcated attention and feature interaction mechanisms, our method is able to better capture the subtle differences between fine-grained categories, leading to improved classification performance. Additionally, our method optimizes the distribution of samples in feature space, ensuring that samples from the same category are more closely clustered together, while samples from different categories are more clearly separated. Compared to other baseline methods, our approach demonstrates superior classification accuracy and overall performance in few-shot fine-grained classification tasks.
Keywords
few-shot learning fine-grained image classification attention mechanism metric learning meta learning feature interaction
|