聚合细粒度特征的深度注意力自动裁图
摘 要
目的 从图像中裁剪出构图更佳的区域是提升图像美感的有效手段之一,也是计算机视觉领域极具挑战性的问题。为提升自动裁图的视觉效果,本文提出了聚合细粒度特征的深度注意力自动裁图方法(deep attention guided image cropping network with fine-grained feature aggregation,DAIC-Net)。方法 整体模型结构由通道校准的语义特征提取(semantic feature extraction with channel calibration,ECC)、细粒度特征聚合(fine-grained feature aggregation,FFA)和上下文注意力融合(contextual attention fusion,CAF)3个模块构成,采用端到端的训练方式,核心思想是多尺度逐级增强不同细粒度区域特征,融合全局和局部注意力特征,强化上下文语义信息表征。ECC模块在通用语义特征的通道维度上进行自适应校准,融合了通道注意力;FFA模块将多尺度区域特征级联互补,产生富含图像构成和空间位置信息的特征表示;CAF模块模拟人眼观看图像的规律,从不同方向、不同尺度显式编码图像空间不同像素块之间的记忆上下文关系;此外,定义了多项损失函数以指导模型训练,进行多任务监督学习。结果 在3个数据集上与最新的6种方法进行对比实验,本文方法优于现有的自动裁图方法,在最新裁图数据集GAICD (grid anchor based image cropping database)上,斯皮尔曼相关性和皮尔森相关性指标分别提升了2.0%和1.9%,其他最佳回报率指标最高提升了4.1%。在ICDB (image cropping database)和FCDB (flickr cropping database)上的跨数据集测试结果进一步表明了本文提出的DAIC-Net的泛化性能。此外,消融实验验证了各模块的有效性,用户主观实验及定性分析也表明DAIC-Net能裁剪出视觉效果更佳的裁图结果。结论 本文提出的DAIC-Net在GAICD数据集上多种评价指标均取得最优的预测结果,在ICDB和FCDB测试集上展现出较强的泛化能力,能有效提升裁图效果。
关键词
Deep attention guided image cropping with fine-grained feature aggregation
Fang Yuming, Zhong Yu, Yan Jiebin, Liu Lixia(School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330032, China) Abstract
Objective Image cropping is a remarkable factor in composing photography's aesthetics, aiming at cropping the region of interest (RoI) with a better aesthetic composition. Image cropping has been widely used in photography, printing, thumbnail generating, and other related fields, especially in image processing/computer vision tasks that need to process a large number of images simultaneously. However, modeling the aesthetic properties of image composition in image cropping is highly challenging due to the subjectivity of image aesthetic assessment (IAA). In the past few years, many researchers tried to maximize the visual important information to crop a target region by feat of salient object detection or eye fixation. The results are often not in line with human preferences due to the lack of consideration of the integrity of image composition. Recently, owing to the powerful representative ability of deep learning (mainly refers to convolutional neural network (CNN)), many data-driven image cropping methods have been proposed and achieved great success. The cropped RoI images have a substantial similarity, making distinguishing the aesthetics between them, which is different from natural IAA, more difficult. Most of existing CNN-based methods only focus on feature corresponding to each cropped RoI and use rough location information, which is not robust enough for complex scenes, spatial deformation, and translation. Few methods consider the fine-grained features and local and global context dependence, which is remarkably beneficial to image composition understanding. Motivated by this, a novel deep attention guided image cropping network with fine-grained feature aggregation, namely, DAIC-Net, is proposed. Method In an end-to-end learning manner, the overall model structure of DAIC-Net consists of three modules:semantic feature extraction with channel calibration(ECC), fine-grained feature aggregation (FFA), and global-to-local contextual attention fusion (CAF). Our main idea is to combine the multiscale features and incorporate global and local contexts, which contribute to enhancing informative contextual representation from coarse to fine. First, a backbone is used to extract high-level semantic feature maps of the input in ECC. Three popular architectures, namely, Visual Geometry Group 16-layer network (VGG16), MobileNetV2, and ShuffleNetV2, are tested, and all of the variants achieve competitive performance. The output of the backbone is followed by a squeeze and excitation module, which exploits the attention between channels to calibrate channel features adaptively. Then, an FFA module connects multiscale regional information to generate various fine-grained features. The operation is designed for capturing higher semantic representations and complex composition rules in image composition. Almost no additional running time is observed due to the low-dimensional semantic feature sharing of the FFA module. Moreover, to mimic the human visual attention mechanism, the CAF module is proposed to recalibrate high fine-grained features, generating contextual knowledge for each pixel by selectively scanning from different directions and scales. The input features of the CAF module are re-encoded explicitly by fusing global and local attention features, and it generates top-to-down and left-to-right contextual regional attention for each pixel, obtaining richer context features and facilitating the final decision. Finally, considering the particularity of image cropping scoring regression, a multi-task loss function is defined by incorporating score regression, pairwise comparison, and correlation ranking to train the proposed DAIC-Net. The proposed multi-task loss functions can explicitly rank the aesthetics to model the relations between every two different regions. An NVIDIA GeForce GTX 1060 device is used to train and test the proposed DAIC-Net. Result The performance of our method is compared with six state-of-the-art methods on three public datasets, namely, grid anchor based image cropping database (GAICD), image cropping database (ICDB), and flickr cropping database (FCDB). The quantitative evaluation metrics of GAICD contain average Pearson correlation coefficient (PCC), average Spearman's rank-order correlation coefficient (SRCC), best return metrics (AccK/N), and rank-weighted best return metrics (wAccK/N) (i.e., higher is better over these metrics). Intersection over union and boundary displacement error are adopted as evaluation metrics in the two other datasets. The GAICD dataset is split into 2 636 training images, 200 validating images, and 500 test images. ICDB and FCDB contain 950 and 348 test images respectively, which are not used for training by all compared methods. Experimental results demonstrate the effectiveness of DAIC-Net compared with other state-of-the-art methods. Specifically, SRCC and PCC increase by 2.0% and 1.9%, and other best return metrics increase by 4.1% at most on the GAICD. The proposed DAIC-Net outperforms most of the other methods despite very minimal room for improvement on ICDB and FCDB. Qualitative analysis and user study of each method are also provided for comparison. The results demonstrate that the proposed DAIC-Net generates better composition views than the other compared methods. Conclusion In this paper, a new automatic image cropping method with fine-grained feature aggregation and contextual attention is presented. The ablation study demonstrates the effectiveness of each module in DAIC-Net, and further experiments show that DAIC-Net can obtain better results than other methods on the GAICD dataset. Comparison experiments on the ICDB and FCDB datasets verify the generalization of DAIC-Net.
Keywords
automatic image cropping image aesthetics assessment (IAA) region of interest (RoI) spatial pyramid pooling (SPP) attention mechanism multi-task learning
|