Making the Heads Speak

So we made the likenesses of our heads using Metahuman and cloned our voices using ElevenLabs. The missing piece of the puzzle is then getting the heads to ‘speak’ the voices. This is a process called lip-syncing, and is traditionally a time consuming process done by skilled animators. It involves matching the voice sounds called phonemes to mouth shapes. In cartoons it’s often quite crude - just changing between and open and closed moth shapes in South Park for instance. It has become more sophisticated in recent years with computer games, but again requires skill and expensive software. We need to be able to automate this process.

AI to the rescue again. Nvidia is one of the leading companies in producing the graphics cards used in computers, for playing computer games and crypto mining and also for doing AI training and inferencing (using a AI models). It is in their interest to develop software applications for their graphics cards that give producers and consumers new capabilities, and that is what they have done by developing an AI tool called Audio2face. Its free, works with Unreal and Metahumans (amongst other software), and if you have ever tried to do lip-syncing is nothing short of amazing.

Audio2face - ‘mark’ being animated by an audio stream.

Audio2face is not only able to do the lip-syncing of the face, but also detect and add ‘emotion’ based on the intonation and tone of the audio file fed into it. It even just makes the heads move a little and blink the eyes when nothing is going on. Amazing! It’s free you just need to register with Nvidia and download the quite large package. There are video tutorials available about how to hook it up to a Metahuman in the Unreal Engine.

Engineering the Telephone Interface

Simulated Selves: In Conversation