Current Issue Cover
红外与可见光图像渐进融合深度网络

邱德粉, 胡星宇, 梁鹏伟, 刘贤明, 江俊君(哈尔滨工业大学计算机科学与技术学院, 哈尔滨 150001)

摘 要
目的 红外与可见光图像融合的目标是获得具有完整场景表达能力的高质量融合图像。由于深度特征具有良好的泛化性、鲁棒性和发展潜力,很多基于深度学习的融合方法被提出,在深度特征空间进行图像融合,并取得了良好的效果。此外,受传统基于多尺度分解的融合方法的启发,不同尺度的特征有利于保留源图像的更多信息。基于此,提出了一种新颖的渐进式红外与可见光图像融合框架(progressive fusion, ProFuse)。方法 该框架以U-Net为骨干提取多尺度特征,然后逐渐融合多尺度特征,既对包含全局信息的高层特征和包含更多细节的低层特征进行融合,也在原始尺寸特征(保持更多细节)和其他更小尺寸特征(保持语义信息)上进行融合,最终逐层重建融合图像。结果 实验在TNO(Toegepast Natuurwetenschappelijk Onderzoek)和INO(Institut National D’optique)数据集上与其他6种方法进行比较,在选择的6项客观指标上,本文方法在互信息(mutual Information,MI)上相比FusionGAN(generative adversarial network for infrared and visible image fusion)方法提升了115.64%,在标准差(standard deviation,STD)上相比于GANMcC(generative adversarial network with multiclassification constraints for infrared and visible image fusion)方法提升了19.93%,在边缘保存度Qabf上相比DWT(discrete wavelet transform)方法提升了1.91%,在信息熵(entopy,EN)上相比GANMcC方法提升了1.30%。主观结果方面,本文方法得到的融合结果具有更高的对比度、更多的细节和更清晰的目标。结论 大量实验表明了本文方法的有效性和泛化性。与其他先进的方法相比,本文方法在主观和客观评估上都显示出更好的结果。
关键词
A deep progressive infrared and visible image fusion network

Qiu Defen, Hu Xingyu, Liang Pengwei, Liu Xianming, Jiang Junjun(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China)

Abstract
Objective Multi-modal images have been developed based on multiple imaging techniques. The infrared image collects the radiation information of the target in the infrared band. The visible image is more suitable to human visual perception in terms of higher spatial resolution, richer effective information and lower noise. Infrared and visible image fusion (IVIF) can integrate the configurable information of multi-sensors to alleviate the limitations of hardware equipment and obtain more low-cost information for high-quality images. The IVIF can be used for a wide range of applications like surveillance, remote sensing and agriculture. However, there are several challenges to be solved in multi-modal image fusion. For instance, effective information extraction issue from different modalities and the problem-solving for fusion rule of the complementary information of different modalities. Current researches can be roughly divided into two categories: 1) traditional methods and 2) deep learning based methods. The traditional methods decompose the infrared image and the visible image into the transform domain to make the decomposed representation have special properties that are benefit to fusion, then perform fusion in the transform domain, which can depress information loss and avoid the artifacts caused by direct pixel manipulation, and finally reconstruct the fused image. Traditional methods are based on the assumptions on the source image pair and manual-based image decomposition methods to extract features. However, these hand-crafted features are not comprehensive and may cause the sensitivity to high-frequency or primary components and generate image distortion and artifacts. In recent years, data-driven deep learning-based image fusion methods have been developing. Most of the deep learning based fusion methods have been oriented for the infrared and visible image fusion in the deep feature space. Deep learning-based fusion methods can be divided into two categories: 1) convolutional neural network (CNN) for fusion, and 2) generative adversarial network (GAN) to generate fusion images. CNN-based information extraction is not fully utilized by the intermediate layers. The GAN-based methods are challenged to preserving image details in adequately. Method We develop a novel progressive infrared and visible image fusion framework (ProFuse), which extracts multi-scale features with U-Net as our backbone, merges the multi-scale features and reconstructs the fused image layer by layer. Our network has composed of three parts: 1) encoder; 2) fusion module; and 3) decoder. First, a series of multi-scale feature maps are generated from the infrared image and the visible image via the encoder. Next, the multi-scale features of the infrared and visible image pair are fused in the fusion layer to obtain fused features. At last, the fused features pass through the decoder to construct the fused image. The network architecture of the encoder and decoder is designed based on U-Net. The encoder consists of the replicable applications of recurrent residual convolutional unit (RRCU) and the max pooling operation. Each down-sampling step can be doubled the number of feature channels, so that more features can be extracted. The decoder aims to reconstruct the final fused image. Every step in the decoder consists of an up-sampling of the feature map followed by a 3 × 3 convolution that halves the number of feature channels, a concatenation with the corresponding feature maps from the encoder, and a RRCU. At the fusion layer, our spatial attention-based fusion method is used to deal with image fusion tasks. This method has the following two advantages. First, it can perform fusion on global information-contained high-level features (at bottleneck semantic layer), and details-related low-level features (at shallow layers). Second, our method not only perform fusion on the original scale (maintaining more details), but also perform fusion on other smaller scales (maintaining semantic information). Therefore, the design of progressive fusion is mainly specified in the following two aspects: 1) we conduct image fusion progressively from high-level to low-level and 2) from small-scale to large-scale progressively. Result In order to evaluate the fusion performance of our method, we conduct experiments on publicly available Toegepast Natuurwetenschappelijk Onderzoek (TNO) dataset and compare it with some state-of-the-art (SOTA) fusion methods including DenseFuse, discrete wavelet transform (DWT), Fusion-GAN, ratio of low-pass pyramid (RP), generative adversarial network with multiclassification constraints for infrared and visible image fusion (GANMcC), curvelet transform (CVT). All these competitors are implemented according to public code, and the parameters are set by referring to their original papers. Our method is evaluated with other methods in subjective evaluation, and some quality metrics are used to evaluate the fusion performance objectively. Generally speaking, the fusion results of our method obviously have higher contrast, more details and clearer targets. Compared with other methods, our method preserves the detailed information of visible and infrared radiation in maximization. At the same time, very little noise and artifacts are introduced in the results. We evaluate the performances of different fusion methods quantitatively via using six metrics, i.e., entropy (EN), structure similarity (SSIM), edge-based similarity measure (Qabf), mutual information (MI), standard deviation (STD), sum of the correlations of differences (SCD). Our method has achieved a larger value on EN, Qabf, MI and STD. The maximum EN value indicates that our method retains richer information than other competitors. The Qabf is a novel objective quality evaluation metric for fused images. The higher the value of Qabf is, the better the quality of the fusion images are. STD is an objective evaluation index that measures the richness of image information. The larger the value, the more scattered the gray-level distribution of the image, the more information the image carries, and the better the quality of the fused image. The larger the value of MI, the more information obtained from the source images, and the better the fusion effect. Our method has an improvement of 115.64% in the MI index compared with the generative adversarial network for infrared and visible image fusion (FusionGAN) method, 19.93% in the STD index compared with the GANMcC method, 1.91% in the edge preservation (Qabf) index compared with the DWT method and 1.30% in the EN index compared with the GANMcC method. This indicates that our method is effective for IVIF task. Conclusion Extensive experiments demonstrate the effectiveness and generalization of our method. It shows better results on the evaluations in qualitative and quantitative both.
Keywords

订阅号|日报