Current Issue Cover
人类面部重演方法综述

刘锦1,2, 陈鹏1,2, 王茜1, 付晓蒙1,2, 戴娇1, 韩冀中1(1.中国科学院信息工程研究所, 北京 100093;2.中国科学院大学网络空间安全学院, 北京 100049)

摘 要
随着计算机视觉领域图像生成研究的发展,面部重演引起广泛关注,这项技术旨在根据源人脸图像的身份以及驱动信息提供的嘴型、表情和姿态等信息合成新的说话人图像或视频。面部重演具有十分广泛的应用,例如虚拟主播生成、线上授课、游戏形象定制、配音视频中的口型配准以及视频会议压缩等,该项技术发展时间较短,但是涌现了大量研究。然而目前国内外几乎没有重点关注面部重演的综述,面部重演的研究概述只是在深度伪造检测综述中以深度伪造的内容出现。鉴于此,本文对面部重演领域的发展进行梳理和总结。本文从面部重演模型入手,对面部重演存在的问题、模型的分类以及驱动人脸特征表达进行阐述,列举并介绍了训练面部重演模型常用的数据集及评估模型的评价指标,对面部重演近年研究工作进行归纳、分析与比较,最后对面部重演的演化趋势、当前挑战、未来发展方向、危害及应对策略进行了总结和展望。
关键词
Critical review of human face reenactment methods

Liu Jin1,2, Chen Peng1,2, Wang Xi1, Fu Xiaomeng1,2, Dai Jiao1, Han Jizhong1(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China)

Abstract
Current image and video data have been increasing dramatically in terms of huge artificial intelligence (AI)-generated contents.The derived face reenactment has been developing based on generated facial images or videos.Given source face information and driving motion information,face reenactment aims to generate a reenacted face or corresponding reenacted face video of driving motion information in related to the animation of expression,mouth shape,eye gazing and pose while preserving the identity information of the source face.Face reenactment methods can generate a variety of multiple feature-based and motion-based face videos,which are widely used with less constraints and becomes a research focus in the field of face generation.However,almost no reviews are specially written for the aspect of face reenactment.In view of this,we carry out the critical review of the development of face reenactment beyond DeepFake detection contexts.Our review is focused on the nine perspectives as following:1) the universal process of face reenactment model;2) facial information representation;3) key challenges and barriers;4) the classification of related methods;5) introduction of various face reenactment methods;6) evaluation metrics;7) commonly used datasets;8) practical applications;and 9) Conclusion and future prospect.The identity information and background information is extracted from source faces while motion features are extracted from driving information,which are combined to generate the reenacted faces.Generally,latent codes,3D morphable face models (3DMM) coefficients,facial landmarks and facial action units are all served as motion features.Besides,there exist several challenges and problems which are always focused in related research.The identity mismatch problem means the inability of face reenactment model to preserve the identity of source faces.The issue of temporal or background inconsistency indicates that the generated face videos are related to the cross-framing jitter or obvious artifacts between the facial contour and the background.The constraints of identity are originated from the model design and training procedure,which can merely reenact the specific person seen in the training data.As for the category of face reenactment methods,image-driven methods and cross-modality-driven methods are involved according to the modality of driving information.Based on the difference of driving information representation,image-driven methods can be divided into four categories.The driving information representation includes facial landmarks,3DMM,motion field prediction and feature decoupling.The subclasses of identity restriction (yes/no issue) can be melted into the landmark-based and 3DMM-based methods further in terms of whether the model could generate unseen subjects or not.Our demonstration of each category,corresponding model flowchart and following improvement work will be illustrated in detail.As for the cross-modality driven methods,the text and audio related methods are introduced,which are ill-posed questions due to audio or text facial motion information may have multiple corresponding solutions.For instance,different facial poses or motions of same identity can produce basically the same audio.Cross-modality face reenactment is challenged to attract attention,which will also be introduced comprehensively.Text driven methods are developed based on three stages in terms of driving content progressively,which are extra required audio,restricted text-driven and arbitrary text-driven.The audio driven methods can be further divided into two categories depending on whether additional driving information is demanded or not.The additional driving information refers to eye blinking label or head pose videos,which offer auxiliary information in generation procedure.Moreover,comparative experiments are conducted to evaluate the performance between various methods.Image quality and facial motion accuracy are taken into consideration during evaluation.The peak signal-to-noise ratio (PSNR),structural similarity index measure (SSIM),cumulative probability of blur detection (CPBD),frechet inception distance (FID) or other traditional image generation evaluation metrics are adopted together.To judge the facial motion accuracy,landmark difference,action unit detection analysis,and pose difference are utilized.In most facial-related cases,the landmarks,the presence of action unit or Euler angle are predicted all via corresponding pre-trained models.As for audio driven methods,the lip synchronization extent is also estimated in the aid of the pretrained evaluation model.Apart from the Objective evaluations,subjective metrics like user study are applied as well.Furthermore,the commonly-used datasets in face reenactment are illustrated,each of which contains face images or videos of various expressions,view angles,illumination conditions or corresponding talking audios.The videos are usually collected from the interviews,news broadcast or actor recording.To reflect different level of difficulties,the image and video datasets are tested related to indoor and outdoor scenario.Commonly,the indoor scenario refers to white or grey walls while the outdoor scenario denotes actual moving scenes or the news live room.As for Conclusion part,the practical applications and potential threats are critically illustrated.Face reenactment can contribute to entertainment industry like movie video dubbing,video production,game character avatar or old photo colorization.It can be utilized in conference compressing,online customer service,virtual uploader or 3D digital person as well.However,it is warning that misused face reenactment behaviors of lawbreakers can be used for calumniate,false information spreading or harmful media content creation in DeepFake,which will definitely damage the social stability and causing panic on social media.Therefore,it is important to consider more ethical issues of face reenactment.Furthermore,the development status of each category and corresponding future directions are displayed.Overall,model optimization and generation-scenario robustness are served as the two main concerns.Optimization issue is focused on data dependence alleviation,feature disentanglement,real time testing or evaluation metric improvement.Robustness improvement of face reenactment denotes generate high-quality reenacted faces under situations like face occlusion,outdoor scenario,large pose faces or complicated illumination.In a word,our critical review covers the universal pipeline of face reenactment model,main challenges,the classification and detailed explanation about each category of methods,the evaluation metrics and commonly used datasets,the current research analysis and prospects.The potential introduction and guidance of face reenactment research is facilitated.
Keywords

订阅号|日报