双粒度特征融合网络的跨模态行人再识别
摘 要
目的 可见光-红外跨模态行人再识别旨在匹配具有相同行人身份的可见光图像和红外图像。现有方法主要采用模态共享特征学习或模态转换来缩小模态间的差异,前者通常只关注全局或局部特征表示,后者则存在生成模态不可靠的问题。事实上,轮廓具有一定的跨模态不变性,同时也是一种相对可靠的行人识别线索。为了有效利用轮廓信息减少模态间差异,本文将轮廓作为辅助模态,提出了一种轮廓引导的双粒度特征融合网络,用于跨模态行人再识别。方法 在全局粒度上,通过行人图像到轮廓图像的融合,用于增强轮廓的全局特征表达,得到轮廓增广特征。在局部粒度上,通过轮廓增广特征和基于部件的局部特征的融合,用于联合全局特征和局部特征,得到融合后的图像表达。结果 在可见光-红外跨模态行人再识别的两个公开数据集对模型进行评估,结果优于一些代表性方法。在SYSU-MM01(Sun Yat-sen University multiple modality 01)数据集上,本文方法rank-1准确率和平均精度均值(mean average precision,mAP)分别为62.42%和58.14%。在RegDB (Dongguk body-based person recognition database)数据集上,本文方法rank-1和mAP分别为84.42%和77.82%。结论 本文将轮廓信息引入跨模态行人再识别,提出一种轮廓引导的双粒度特征融合网络,在全局粒度和局部粒度上进行特征融合,从而学习具有判别性的特征,性能超过了近年来一些具有代表性的方法,验证了轮廓线索及其使用方法的有效性。
关键词
Dual grained feature fusion network-relevant cross-modality pedestrian re-identification
Ma Xiaofeng1, Cheng Wengang1,2(1.School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China;2.Engineering Research Center of Intelligent Computing for Complex Energy Systems, Ministry of Education, Baoding 071003, China) Abstract
Objective Visible-infrared cross-modality pedestrianre-identification(VI-ReID)is focused on same identityrelated images between the visible and infrared modality. As a popular technique of intelligent surveillance,it is still challenged for cross-modality. To optimize intra-class variations in RGB image-based pedestrian re-identification(RGB-ReID) task,one crucial challenge in VI-ReID is to bridge the modality gap between the RGB and infrared (IR)images of the same identity because current methods are mainly derived from modality-incorporated feature learning or modality transformation approaches. Specifically,modality-incorporated feature learning methods are used to map the inputs of RGB and IR images into a common embedding space for cross-modality feature alignment. A two-stream convolutional neural network(twostream CNN)architecture has been recognized and some discriminative constraints are developed as well. However,each filter is constrained for a small region,convolutions is challenged to interlinked to spatial-ranged concepts. Quantitative tests illustrates that the CNNs are strongly biased toward textures rather than shapes. Moreover,existing methods in VIReID focus on the global or local feature representation only. Another path of modality transformation approaches can be used to generate cross-modality images-relevant pedestrain images or transform them into an intermediate modality. Generative adversarial network(GAN)and encoder-decoder structures are commonly used for these methods. However,due to the distorted IR-to-RGB translation and additional noises,image generation is incredible and GAN models are difficult to be converged. The RGB images consist of three-color channels,while IR images only contain a single channel reflecting the thermal radiation emitted from the human body and its contexts. Compared to the missing colors and textures in IR images,we review VI-ReID problem and the contour is realized in terms of a relative effective feature. Furthermore,contour is a modality-shared path as it keeps consistent for both of IR and RGB images,which is more accurate and reliable than generated intermediate modality. We implement VI-ReID-integrated contour strategy. To develop its optimization,we take the contour as an auxiliary modality to narrow the modality gap. Meanwhile,we tend to introduce local features into our model to collaborate with global ones further beyond part-based features. Method A contour-guided dual-grained feature fusion network(CGDGFN)is developed for VI-ReID. It consists of two types of fusion. The first type is concerned of the image to contour fusion,which is called global-grained fusion(G-Fusion)at image-level. G-Fusion can output the augmented contour features. The other type can realize fusion between augmented contour features and local features,which is oriented at image and part mixed level. As the local feature is involved in,it called as local-grained fusion(L-Fusion)for simplicity. The proposed CGDGFN consists of four branches,which are 1)RGB images,2)IR images,3)RGB contour and 4)IR contour. First,the input of the network is a pair of RGB and IR images,while they are fed into RGB branch and IR branch,and a contour detector generates their contour images. Then,RGB and IR-relevant contour images of two modalities are fed into RGB-contour branch and IR-contour branch. The ResNet50 is used as the backbone architecture for each branch. The first convolutional layer in each branch has independent parameters to capture modality-specific information,while the remaining blocks are used to learn weight-shared modality-invariant features. In addition,RGB branch and IR branch average pooling layer structure is optimzed for part-based features extraction. G-Fusion is used to fuse an image to its corresponding contour image. After G-Fusion,the augmented contour features will be produced by the global average pooling layer of RGB-contour branch and IR-contour branch. Meanwhile,RGB branch and IR branch can output corresponding local features. RGB local feature as well as IR local feature is an array of feature vectors and its length is clarified by the partition setting. Two of local feature extraction methods are involved in:1)uniform partition and 2)soft partition. L-Fusion is responsible for fusing the augmented contour features and corresponding local ones. The implementation of our method is based on the Pytorch framework. We adopt the ResNet50 pre-trained on ImageNet as the backbone network,and the stride of the last convolutional layer is set to 1 to obtain feature maps with higher spatial size. The batch size is set to 64. For each batch,we select 4 identities in random and each identity includes 8 visible images and 8 infrared images. The input images are resized to 288×144 pixels,random cropping and random horizontal flip are used for data augmentation. The stochastic gradient descent(SGD)optimizer is used for optimization and the momentum is set to 0. 9. We train the model for 60 epochs at first. The initial learning rate is set as 0. 01 and warmup strategy is applied to enhance performance. To realize its soft partition,we fine-tune our model for additional 20 epochs as well. Result The proposed CGDGFN method is compared with state-of-the-art(SOTA)VI-ReID approaches on two databases Sun Yat-sen University multiple modality 01 (SYSU-MM01)and Dongguk body-based person recognition database (RegDB),which is composed of the methods of global-feature,local-feature,and image generation. The standard cumulated matching characteristics (CMC)and mean average precision(mAP)are employed to evaluate the performance. Our proposed method has obtained 62. 42% rank-1 identification rate and 58. 14% mAP score on SYSU-MM01, and the values of rank-1 and mAP on RegDB are reached to 84. 42% and 77. 82% each in comparison with popular SOTA approaches on both datasets. Conclusion We introduce the contour clue to VI-ReID. To leverage contour information,we took the contour as an auxiliary modality,and a contour-guided dual-grained feature fusion network(CGDGFN)is developed for VI-ReID. Global-grained fusion(G-Fusion) can enhance the original contour representation and produce augmented contour features. Local-grained fusion(L-Fusion) can fuse the part-based local features and the augmented contour features to output its powerful image representation further.
Keywords
|