Don’t worry, this technology isn’t very convincing… er, yet
This technology, when perfected, will be ideal for generating fake audio clips of people saying things they never actually said. In the words of Red Dwarf’s Kryten: file that under ‘B’ for blackmail.
Chinese internet giant Baidu’s AI team is well known for its work on developing realistic sounding speech from text scripts. Now, its latest research project, revealed this week, shows how a generative model can learn the characteristics of a person’s voice and recreate that sound to make the person say something else entirely.
In the first example here, the orginal clip a woman’s voice is heard saying: “the regional newspapers have outperformed national titles.” After her voice is cloned, she now appears to be saying: “the large items have to be put into containers for disposal”.
So, as you can hear, the results aren’t perfect. The best clips generated from the model are still pretty noisy and lower quality than the original speech. But the “neural cloning system” developed by the researchers manages to retain the British accent and sounds quite similar.
The researchers introduce two different approaches to building a neural cloning system: speaker adaptation and speaker encoding.
Speaker adaptation involves training a model on various speakers with different voices. The team used the LibriSpeech dataset, containing 2,484 speakers, to do this. The system learns to extract features from a person’s speech in order to mimic the subtle details of their pronunciation and rhythm.
Speaker encoding involves training a model to learn the particular voice embeddings from a speaker, and reproduces audio samples with a separate system that has been trained on many speakers.
After training on LibriSpeech, up to ten audio samples of any speaker are taken from another dataset. VCTK contains clips from 109 native English speakers with different accents. Basically, after being trained on voices from the LibriSpeech dataset, it has to copy new vocals from speakers in the VCTK dataset.
Sercan Arik, co-author of the paper and a research scientist at Baidu Research, explained to The Register that the speaker encoding method is much easier to implement in real life for speakers such as digital assistants compared to the speaker adaptation technique.
“A speaker adaptation approach requires users to read a specific utterance from a given text, whereas speaker encoding works with random utterances. This means speaker adaptation may not be as easily deployed on user devices, as it has more challenges to scale up to many users. Instead, speaker encoding is much easier for deployment purposes – it can even be deployed on a smartphone – as it is fast and has low memory requirements.”
The idea that AI can be manipulated to spread false information is a real concern to many in the industry. The recent 100-page report on the ways machine learning can be used maliciously, written by a panel of experts has fueled the debate on the future of fake news.
The latest research from Baidu shows while it’s possible to generate fake speech, the current performance is not good enough to fool humans yet.
Arik said a more varied dataset that is of higher quality is one way to improve the end results. He stressed that there was “still some room for improvement in the voice cloning deep learning model itself”, and that the paper “does not claim production-quality results that are indistinguishable from human voices yet”.
But it’s not all bad news. The voice cloning technology can also be used for good-natured purposes too.
“For example, a mom can easily configure an audiobook reader with her own voice to read bedtime stories for her kids when she is not available. However, as this technology improves and becomes widespread, we do see the need to implement precautions and measures to ensure this technology is not taken advantage of and used as intended,” he warned. ®