Speech2Face synthesizes someone’s face image from hearing their speech. We train it with 2 millions of video clips with near 100,000 different people's faces.
The work is an effort to better understand the capabilities of machine perception, i.e., the speech-face association.When we hear a voice on the radio or the phone call, we, human, often build a mental model to imagine how the person looks. Our work can be considered as a replication of a human mental model by machine. For the Speech2Face task, we rarely understood how strongly we human can parse and whether it is indeed correct or just noisy bias. The reconstructed face by Speech2Face could be used as a proxy to study these.
We can imagine a range of applications, including for privacy-minded people who want to share real photos of themselves off the internet or video calls.