视觉语言多模态预训练综述
摘 要
在多模态机器学习领域,为特定任务而制作的人工标注数据昂贵,且不同任务难以进行迁移,从而需要大量重新训练,导致训练多个任务时效率低下、资源浪费。预训练模型通过以自监督为代表的方式进行大规模数据训练,对数据集中不同模态的信息进行提取和融合,以学习其中蕴涵的通用知识表征,从而服务于广泛的相关下游视觉语言多模态任务,这一方法逐渐成为人工智能各领域的主流方法。依靠互联网所获取的大规模图文对与视频数据,以及以自监督学习为代表的预训练方法的进步,视觉语言多模态预训练模型在很大程度上打破了不同视觉语言任务之间的壁垒,提升了多个任务训练的效率并促进了具体任务的性能表现。本文总结视觉语言多模态预训练领域的进展,首先对常见的预训练数据集和预训练方法进行汇总,然后对目前最新方法以及经典方法进行系统概述,按输入来源分为图像—文本预训练模型和视频—文本多模态模型两大类,阐述了各方法之间的共性和差异,并将各模型在具体下游任务上的实验情况进行汇总。最后,总结了视觉语言预训练面临的挑战和未来发展趋势。
关键词
Comprehensive review of visual-language-oriented multimodal pre-training methods
Zhang Haoyu1, Wang Tianbao1, Li Mengze1, Zhao Zhou1, Pu Shiliang2, Wu Fei1(1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310013, China;2.Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou 310051, China) Abstract
Multimodal machine learning has been challenging for labor-intensive and labeled cost and data migration constraints,which requires amount of retraining process,resulting in low efficiency and imbalanced resources allocation for multiple training tasks.To learn the internal knowledge representation and meet the requirement of the related downstream visual language multimodal tasks,pre-training model is carried out for large-scale data training task through self-supervision,the multiple modes information extraction and integration of the data set context,etc.The exploration of pre-trained models is focused on cheaper labeled data due to the expensive human labels.First,the model is pre-trained based on cheap labeled data,and the model is fine-tuned using less expensive human annotations.Large-scale data and long time span training are often required to pre-train the model because of the less information and noise derived from cheap labeled data.The large-scale unlabeled-data-based pre-trained model not only transfer the more general knowledge to the target task through the learned unlabeled data,but also get a better parameter initial point through the pre-training learning.The future multimodal contexts have their potentials like learning demonstration,sentiment analysis and task-oriented large-scale human-computer interactions.Multimodal pre-training models can be as a pathway derived of weak artificial intelligence from local to global.It is possible to transfer multi-tasks learning results to non-supervision multi-domains data automatically and quickly.The plain text pre-training model can cover less online data only,and richer data have not been fully utilized and learned.Multimodal-contexts are benefited from information gathering,context perception,knowledge learning,and demonstration.To generate commonly-used artificial intelligence model,the pre-training model has been developing from single-modal to multi-modal.The intensive growth of pre-training models has extended to the field of visual and textual interaction since 2019.Thanks to the large-scale image-text pairs and video data online and the growth of pre-training technique like self-supervised learning,the visual-language multimodal pre-training model has been promoted and bridged the gap between different visual-language tasks,which optimizes multi-task training and improves the performance of specific tasks.Current multimodal researches are challenged to an intelligent system organizing,multimodal information perceiving and the semantic gap bridging.We review existing pre-training datasets and pre-training methods,and propose a systematic overview of the latest and traditional methods.The universals and differences between the methods are critical analyzed,and the experimental conditions of each model are summarized on specific downstream tasks.Finally,the challenges and future research direction of visual language pre-training are predicted.
Keywords
multimodal machine learning visual language multimodality pre-training self-supervised learning image-text pre-training video-text pre-training
|