INTERSPEECH’24] Two papers have been accepted

2 more properties

  Two papers have been accepted to INTERSPEECH 2024

Title: Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert
Authors: Han EunGi* (POSTECH), Oh Hyun-Bin* (POSTECH), Kim Sung-Bin (POSTECH), Corentin Nivelet Etcheberry (POSTECH), Suekyeong Nam (KRAFTON), Janghoon Ju (KRAFTON), Tae-Hyun Oh (POSTECH)
* denotes the equal contribution
Speech-driven 3D facial animation has recently garnered attention due to its cost-effective usability in multimedia production. However, most current advances overlook the intelligibility of lip movements, limiting the realism of facial expressions. In this paper, we introduce a method for speech-driven 3D facial animation to generate accurate lip movements, proposing an audio-visual multimodal perceptual loss. This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts. Furthermore, to incorporate audio-visual perceptual loss, we devise an audio-visual lip reading expert leveraging its prior knowledge about correlations between speech and lip motions. We validate the effectiveness of our approach through broad experiments, showcasing noticeable improvements in lip synchronization and lip readability performance. Codes are available at https://3d-talking-head-avguide.github.io/.
Title: MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset
Authors: Kim Sung-Bin* (POSTECH), Lee Chae-Yeon* (POSTECH), Gihun Son* (POSTECH), Oh Hyun-Bin (POSTECH), Janghoon Ju (KRAFTON), Suekyeong Nam (KRAFTON), Tae-Hyun Oh (POSTECH)
* denotes the equal contribution
Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 200 hours of talking videos in 20 languages. Utilizing this dataset, we present a baseline model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance.