数字人风格化、多模态驱动与交互进展
潘烨, 李韶旭, 谭帅, 韦俊杰, 翟广涛, 杨小康(上海交通大学) 摘 要
风格化数字人是在计算机图形学、视觉艺术和游戏设计等领域中迅速发展的一个领域。近年来,数字人物的设计和制作技术取得了显著的进步,使得数字人物能够具有更加逼真的外观和行为,同时也可以更好地适应各种艺术风格和情境。本文围绕风格化数字人任务,围绕数字人的风格化生成、多模态驱动与用户交互三个核心研究方向的发展现状、前沿动态、热点问题等进行系统性综述。针对数字人的风格化生成,从显式三维模型和隐式三维模型两种数字人的三维表达方式对于方法进行分类,其中,显式三维数字人风格化以基于优化的方法、基于生成对抗网络的方法、基于引擎的方法为主要分析对象,隐式三维数字人风格化从通用隐式场景风格化方法以及针对人脸的隐式风格化进行回顾。针对数字人的驱动,根据驱动源的不同,本文从显式音频驱动,文本驱动和视频驱动三个方面进行回顾。根据驱动实现算法的不同,本文从基于中间变量、基于编码-解码结构等方面进行回顾,此外算法还根据中间变量的不同可分为基于关键点、三维人脸和光流的方法。针对数字人的用户交互,目前主流的交互方式是语音交互,本文对于语音交互模块从自动语音识别和文本转语音合成两方面进行了回顾,对于数字人的对话系统模块,从自然语言理解和自然语言生成等方面进行了回顾。在此基础上,展望了风格化数字人研究的未来发展趋势,为后续的相关研究提供参考。
关键词
Advances in digital characters stylization, multimodal animation and interaction
Ye Pan, Shaoxu Li, Shuai Tan, Junjie Wei, GuangTao Zhai, Xiaokang Yang(Shanghai Jiao Tong University) Abstract
Stylized digital characters have emerged as a fundamental force in reshaping the landscape of computer graphics, visual arts, and game design. Their unparalleled ability to mimic human appearances and behaviors, coupled with their flexibility in adapting to a wide array of artistic styles and narrative frameworks, has underscored their growing importance in crafting immersive and engaging digital experiences. This comprehensive exploration delves deep into the complex world of stylized digital humans, exploring their current development status, identifying the latest trends, and addressing the pressing challenges that lie ahead in three foundational research domains: creation of stylized digital humans, multi-modal driving mechanisms, and user interaction modalities. The first domain, creation of stylized digital humans, examines the innovative methodologies employed in generating lifelike yet stylistically diverse characters that can seamlessly integrate into various digital environments. From advancements in 3D modeling and texturing to the integration of artificial intelligence for dynamic character development, this section provides a thorough analysis of the tools and technologies that are pushing the boundaries of what digital characters can achieve. In the realm of multi-modal driving mechanisms, the paper investigates the evolving techniques in animating and controlling digital humans through a range of inputs such as voice, gesture, and real-time motion capture. This section delves into how these mechanisms not only enhance the realism of character interactions but also open up new avenues for creators to involve users in interactive narratives in more meaningful ways. Lastly, the discussion on user interaction modalities explores the various ways in which end-users can engage with and influence the behavior of digital humans. From immersive virtual and augmented reality experiences to interactive web and mobile platforms, this segment evaluates the effectiveness of different modalities in creating a two-way interaction that enriches the user's experience and deepens their connection to the digital characters.
At the heart of this exploration lies the creation of stylized digital humans, a field that has witnessed remarkable progress in recent years. The generation of these characters can be broadly classified into two categories: explicit 3D models and implicit 3D models. Explicit 3D digital human stylization encompasses a range of methodologies, including optimization-based approaches that meticulously refine digital meshes to conform to specific stylistic attributes. These techniques often involve iterative processes that adjust geometric details, textures, and lighting to achieve the desired aesthetic. Generative adversarial networks (GANs), as a cornerstone of deep learning, have revolutionized this landscape by enabling the automatic generation of novel stylized forms that capture intricate nuances of various artistic styles. Furthermore, engine-based methods harness the power of advanced rendering engines to apply artistic filters and effects in real-time, offering unparalleled flexibility and control over the final visual output. Implicit 3D digital human stylization, on the other hand, draws inspiration from the realm of implicit scene stylization, particularly through the lens of neural implicit representations. These approaches offer a more holistic and flexible way to represent and manipulate 3D geometry and appearance, enabling stylization that transcends traditional mesh-based limitations. Within this framework, facial stylization holds a special place, requiring a profound understanding of facial anatomy, expression dynamics, and cultural nuances. Specialized methods have been developed to capture and manipulate facial features in a nuanced and artistic manner, fostering a level of realism and emotional expressiveness that is crucial for believable digital humans.
Animating and controlling the behavior of stylized digital humans necessitates the use of diverse driving signals, which serve as the lifeblood of these virtual beings. This paper delves into three primary sources of these signals: explicit audio drivers, text drivers, and video drivers. Audio drivers leverage speech recognition and prosody analysis to synchronize digital human movements with spoken language, enabling them to lip-sync and gesture in a natural and expressive manner. Text drivers, on the contrary, rely on natural language processing (NLP) techniques to interpret textual commands or prompts and convert them into coherent actions, allowing for a more directive form of control. Video drivers, perhaps the most advanced in terms of realism, employ computer vision algorithms to track and mimic the movements of real-world actors, providing a seamless bridge between the virtual and the physical worlds. Supporting these drivers are sophisticated implementation algorithms, many of which rely on intermediate variable-driven coding-decoding structures. Keypoint-based methods play a pivotal role in capturing and transferring motion, allowing for the precise replication of movements across different characters. 3D face-based approaches, meanwhile, focus on facial animation, utilizing detailed facial models and advanced animation techniques to achieve unparalleled realism in expressions and emotions. Optical flow-based techniques, on the other hand, offer a holistic approach to motion estimation and synthesis, capturing and reproducing complex motion patterns across the entire digital human body.
The true magic of stylized digital humans lies in their ability to engage with users in meaningful and natural interactions. Voice interaction, currently the mainstream mode of communication, relies heavily on automatic speech recognition (ASR) for accurate speech-to-text conversion and text-to-speech synthesis (TTS) for generating natural-sounding synthetic speech. The dialogue system module, a cornerstone of virtual human interaction, emphasizes the importance of natural language understanding (NLU) for interpreting user inputs and natural language generation (NLG) for crafting appropriate responses. When these capabilities are seamlessly integrated, stylized digital humans are capable of engaging in fluid and contextually relevant conversations with users, fostering a sense of intimacy and connection.
Looking ahead, the study of stylized digital characters promises to continue its ascendancy, fueled by advancements in deep learning, computer vision, and NLP. Future research may delve into integrating multiple modalities for richer and more nuanced interactions, pushing the boundaries of what is possible in virtual human communication. Innovative stylization techniques that bridge the gap between reality and fiction will also be explored, enabling the creation of digital humans that are both fantastical and relatable. Moreover, the development of intelligent agents capable of autonomous creativity and learning will revolutionize the way stylized digital humans can contribute to various industries, including entertainment, education, healthcare, and beyond. As technology continues to evolve, stylized digital humans will undoubtedly play an increasingly significant role in shaping how we engage with digital content and each other, ushering in a new era of digital creativity and expression. This paper serves as a valuable resource for researchers and practitioners alike, offering a comprehensive overview of the current state of the art and guiding the way forward in this dynamic and exciting field.
Keywords
stylization, digital characters, face drive, human-computer interaction, 3D modeling, deep learning, neural network
|