Current Issue Cover
互补注意多样性特征融合网络的细粒度分类

黄港1, 郑元林1,2, 廖开阳1, 蔺广逢1, 曹从军1,2,3, 宋雪芳1(1.西安理工大学印刷包装与数字媒体学院, 西安 710048;2.陕西省印刷与包装工程重点实验室, 西安 710048;3.中国印刷与包装工程技术研究中心, 西安 710048)

摘 要
目的 基于Transformer架构的网络在图像分类中表现出优异的性能。然而,注意力机制往往只关注图像中的显著性特征,而忽略了其他区域的次级显著信息,基于自注意力机制的Transformer也是如此。为了获取更多的有效信息,从有区别的潜在性特征中学习到更多的可判别特征,提出了一种互补注意多样性特征融合网络(complementary attention diversity feature fusion network,CADF),通过关注次显特征和对通道与空间特征协同编码,以增强特征多样性的注意感知。方法 CADF由潜在性特征模块(potential feature module,PFM)和多样性特征融合模块(diversity feature fusion module,DFFM)组成。PFM模块通过聚合空间与通道中感兴趣区域得到显著性特征,再对特征的显著性进行抑制,以强制网络挖掘潜在性特征,从而增强网络对微小判别特征的感知。DFFM模块探索特征间的相关性,对不同尺寸的特征交互建模,以得到更加丰富的互补信息,从而产生更强的细粒度特征。结果 本文方法可以端到端地进行训练,不需要边界框和多阶段训练。在CUB-200-2011(Caltech-UCSD Birds-200-2011)、Stanford Dogs、Stanford Cars以及FGVC-Aircraft (fine-grained visual classification of aircraft)4个基准数据集上验证所提方法,准确率分别达到了92.6%、94.5%、95.3%和93.5%。实验结果表明,本文方法的性能优于当前主流方法,并在多个数据集中表现出良好的性能。在消融研究中,验证了模型中各个模块的有效性。结论 本文方法具有显著性能,通过注意互补有效提升了特征的多样性,以此尽可能地获取丰富的判别特征,使分类的结果更加精准。
关键词
Mutual attention diversity feature fusion network-relevant fine-grained classification

Huang Gang1, Zheng Yuanlin1,2, Liao Kaiyang1, Lin Guangfeng1, Cao Congjun1,2,3, Song Xuefang1(1.College of Faculty of Printing, Packaging Engineering and Digital Media Technology, Xi'an University of Technology, Xi'an 710048, China;2.Key Laboratory of Printing and Packaging Engineering of Shaanxi Province, Xi'an 710048, China;3.Printing and Packaging Engineering Technology Research Centre of Shaanxi Province, Xi'an 710048, China)

Abstract
Objective Fine-grained requirement is focused on images segmentation for such domain like multiple wild birds or vehicles-between features extraction in related to transferring benched category into more detailed subcategories. Due to the subtle inter-category differences and large intra-category is existed,it is challenging to capture specific regions-targeted subtle differences for classification. The attention mechanism are still used to pay attention to the salient features in the picture only although Transformer architecture-based network has its potentials for image classification,and most of the latent features are ignored and self-attention mechanism-based Transformer are required to be involved as well. To get more effective information,discriminative latent features-derived feature representations are required to be learnt for fine-grained classification. To get more effective feature,we develop a complementary attention diversity feature fusion(CADF)network, which can extract multi-scale features and models from the channel and spatial feature interactions of images. Method A mutual attention diversity feature fusion network is facilitated and it consists of two modules:1)potential feature module (PFM):it can be focused on the features of different parts,and the salient features can be enhanced with the preservation of latent features. 2)Diversity feature fusion module(DFFM):multiple features-between channel and spatial information interaction modeling is used to enhance rich feature,and information of specific parts of features can be enhanced in terms of feature fusion module. The scalable features can realize mutual-benefited information,and robustness of the features can be enhanced and the features can be more discriminative. Our network proposed is configured in PyTorch on an NVIDIA 2080Ti GPU. The weight parameters of the model are initialized using ImageNet classification dataset-related SwinTransformer parameters pre-trained. The optimization is performed on the AdamW optimizer with a momentum of 0. 9 and a cosine annealing scheduler. The batch size is set to 6,the learning rate of the backbone layer is set to 0. 000 1,the newly layer is added and set to 0. 000 01,and a weight decay of 0. 05 is used as well. For training,the input images are resized to 550×550 pixels and cropped to 448×448 pixels randomly,and random horizontal flips are used for data augmentation further. For testing,the input images are resized to 550×550 pixels and cropped to 448×448 pixels from the center. The hyper-parameters of λ = 1,β = 0. 5 is set as well. Result To verify the effectiveness,the experiments are carried out on four fine-grained datasets:CUB-Birds,Stanford Dogs,Stanford Cars,and FGVC-Aircraft. Each of the classification accuracy can be reached to the following percentages:92. 6%,94. 5%,95. 3% and 93. 5%. For the ablation experiments,the effectiveness of the PFM module and the DFFM module are verified further. Compared to the benched framework,it can improve the accuracy greatly via adding the PFM module only. Swin-B + PFM can be used to improve the accuracy by 1. 4%,1. 4% and 0. 8% on multiple datasets of CUB-Birds,Stanford Dogs and Stanford Cars datasets. Compared to PFM module-added network only,the accuracy of each feature exchange fusion module (Swin-B + PFM + DFFM) is also improved by 0. 4%,0. 5% and 0. 3% as well. It shows that the CADF model has strong feature extraction ability to a certain extent,and the effectiveness of each structure in the network is verified on the dataset potentially. The feature visualization is conducted to get the regional features of attention mechanism intuitively. In the ablation study,the effectiveness of each module in this model is verified further. Conclusion To resolve the problem of insufficient attention mechanismbased feature extraction,we develop a latent feature extraction method for fine-grained image classification further.
Keywords

订阅号|日报