Cloning our voices
As an artist / hacker who has been mucking about with computers since the 1970’s I always look first for an open source solution for any problem, partially because its free, but also because its ‘open’, which means you have more control in customising the software and you know that you can create a long term archive of the software, whereas commercial software is a black box and companies come and go quite rapidly in the tech world.
There has been a lot in the news in recent years about “deep fake” videos, and yes the software is out there to make these - but often these tools require a lot of computing resources time and trial and error to produce the videos you may have seen on the internet. We needed to produce 10’s of hours of cloned voice for this project, so they just aren’t practical. Open source projects are usually hosted on a website called GitHub which is easy to access and download, but using these tools do require skills some hacker skills in being able to build software and fix problems.
There have been some remarkable advances in research projects for voice cloning in the last few years from big companies like OpenAI and Deepmind, with impressive demonstrations, but the actual code and trained models haven’t been made available because of concerns about deep fakes.
There are two strategies one can take to produce an artificial copy of a voice. The traditional way is to collect a large sample of the voice (like at least an hours worth) with a transcription and then train or fine tune a neural network. The other ‘cloning’ approach is to have a much larger model that has been trained on a wide variety of voices in a more general way, this approach allows you to just provide quite a short sample of the voice you want to recreate.
There are some notable open source projects available though:
We have used Whisper for recognising speech and transcribing the conversations Svenja and I have had. It was released into the wild by OpenAI and there are a bunch of apps and open source projects available that use it. Its pretty awesome in its accuracy compared to anything I have used in the past.
Bark released into the wild by a company called Suno AI, is capable of very good voice cloning, but is a bit unpredictable and often takes several tries to get a good result.
Tortise TTS is also quite good for cloning, but is very sloooow
We considered another project called Piper in which you train a voice, the trained voice sounds quite good, and its very fast and consistent, but lacks intonation and expressiveness - it sounds a bit flat and robotic.
For this artwork as the voice is a big element it’s important that it is expressive, so in the end we decided to use a commercial service called ElevenLabs. There are quite a few other companies offering similar services, but ElevenLabs is currently considered to be the best and most consistent. It is quite expensive though and we spent a large chunk of our software budget on generating the audio.
Below is an example of Svenja’s voice trained for Piper, and with Eleven Labs voice cloning