Carnegie Mellon University
Wen_cmu_0041E_10802.pdf (8.26 MB)

Reconstruction of Human Faces from Voice

Download (8.26 MB)
posted on 2023-01-23, 21:07 authored by Yandong WenYandong Wen

Voices and faces play pivotal roles in our social interactions. Despite their different physical manifestations, voices and faces contain highly similar types of information, including linguistic information (phonemes for voice and viseme for faces), affective state, and identity characteristics (weight, gender, age, etc.). For this reason, the associations between voices and faces have gathered significant research interest in psychology, cognitive science, artificial intelligence, and many other fields. 

In this thesis, we attempt to explore the identity associations between voices and faces by developing computational mechanisms for reconstructing faces from voices. More specifically, the task is designed to answer the question: Given an unheard audio clip spoken by an unseen person, can we algorithmically picture a face that has as many associations as possible with the speaker, in terms of identity? 

The link between voice and face has been established from many perspectives. Direct relationships include the effect of the underlying skeletal and articulator structure of the face and the tissue covering them, all of which govern the shapes, sizes, and acoustic properties of the vocal tract that produces the voice. Less directly, the same genetic, physical, and environmental influences that affect the development of the face also affect the voice. Given these demonstrable dependencies, it is reasonable to hypothesize that it may be possible to reconstruct faces from voice signals algorithmically. Our hypothesis is that if any facial parameter influences the speaker’s voice, its effects on the voice must be discoverable by a properly designed computational model

This thesis presents how we approach the goal of generating faces from voices in three stages. First, we consider the cross-modal matching problem: given a voice recording, one must select the speaker’s face from a gallery of face images. To this end, we propose disjoint mapping networks to learn representations of voices and faces in a shared space, such that their representations can be compared to one another. The results of matching empirically demonstrate the possibility of disambiguating faces from the voice. Second, we address the problem of reconstructing 2D face images from voices. We propose a simple but effective computational framework based on generative adversarial networks (GANs). The generated face images are visually plausible and have identity associations with the true speaker. Last, we investigate the problem of reconstructing 3D facial shapes from voices. We propose an anthropometry-guided framework that identifies which anthropometric measurements (AMs) are predictable from voice, and then reconstructs the 3D facial shapes from those predictable AMs. Compared to baseline methods, our results demonstrate notable improvements, especially in reconstructing the shapes of speakers’ noses. 




Degree Type

  • Dissertation


  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)


Rita Singh, Bhiksha Raj

Usage metrics


    Ref. manager