Carnegie Mellon University

File(s) under embargo





until file(s) become available

Talking us into the Metaverse: Towards Realistic Streaming Speech-to-Face Animation

posted on 2024-04-19, 17:49 authored by Salvador MedinaSalvador Medina

 Speech-to-face animation aims to create a realistic visual representation of a person’s face based on their voice. Developing realistic facial animations of a person from a speech signal pertains to the main challenges of accurately capturing and reproducing the face’s complex motions, including the tongue’s motion. In this work, we use a combination of speech-pathology study techniques and machine learning to analyze a person’s speech patterns and map them onto a digital avatar. The ultimate objective of our research is to animate realistic avatars in real time. To achieve this goal, we introduce a large-scale dataset that comprises 2.55 hours of corresponding speech and motion data captured from the tongue, lips, and jaw using electromagnetic articulography, while the facial motion was captured through stereo video. As an initial step, we make an exploratory analysis of different deep-learning-based methods for accurate and generalizable speech-to-tongue animation. We evaluate several encoder-decoder network architectures and audio features ranging from traditional to self-supervised audio representations. The best model achieves a temporal mean error of 1.77 𝑚𝑚 when predicting the tongue, lips, and jaw motion and delivers realistic animations on singing audio regardless of training the model only using neutral speech from a single actor. Although adept at tongue movement predictions, this approach was limited in facial animations, prompting the evolution towards the IMFT’23 dataset that captures intricate facial motion pairing 2.28 hours of audio, video, and facial, lips, jaw, and tongue 3D landmarks. Our proposed Phonetically Informed Speech-Animation Network (PhISANet), results in animations with a sub-millimeter mean vertex error. It does this by incorporating WavLM feature encodings and by pioneering the use of a Connectionist Temporal Classifier (CTC) through Multi-task Learning in the speech-to-animation field of study. PhISANet generalizes across voices from different ages, genders, and languages. Perception user studies confirm that a CTC offers superior animation accuracy and realism and perceptible improvements in tongue and lip animations. Furthermore, to ensure animation quality, we advocate using visual speech recognition networks, specifically the AV-HuBERT model, as a benchmark. This research pushes the boundaries of realistic speech-to-animation, emphasizing the promise of real-time applications and setting a precedent in the speech-to-animation field of study 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Alexander Hauptmann

Usage metrics



    Ref. manager