By Ilya Feige, Scott Stevenson and Alex Moore
Political disinformation is not new. But over the next few years, developments in Artificial Intelligence will make it possible to synthesize persuasive video and audio content in which public figures can be made to say anything.
Working with the BBC’s R&D team, we set out to show how easy it is already to synthesise a public figure's voice (in this case that of Donald Trump) in very high fidelity, just by typing in text. It is both a warning and a crucial first step in creating systems which can detect fake content.
You can judge for yourself how convincing the results are in this BBC report. Generating high-fidelity, synthetic human speech directly from written text is challenging for two reasons. Firstly, the information contained in written text, which is essentially a combination of a limited number of written characters, is very different from that contained in the vast spectrum of sound waves that constitute audible speech and it requires highly sophisticated mapping to match them up.
Secondly, the human ear is sensitive to sound waves extending over an impressively large spectrum of frequencies, so generating human-quality speech requires the algorithm to correctly predict the sound wave thousands of times per second. By comparison, the human eye can only perceive data at around 30 frames per second.
To overcome these challenges we used an algorithm based on a model known as WaveNet. This algorithm predicts the sound wave simultaneously at many frequency levels, so that the resulting speech contains the higher harmonics required to convincingly replicate human speech. Even with a great model, an intense amount of training is required, taking weeks even using the massive parallel processing power of a GPU.
Once the model is trained, any new sentence can be synthesised virtually in real time. Furthermore, these models can replicate an individual’s voice with only a few hours of that individual's speech to train on. In this case we were able to do this by first training the model using 25 hours of generic data from other speakers from north America. Only then did we move on to the final phase of training where the model learned Trump’s particular style of speech. This means the technology can be used to "fake" the voice of anyone who has a few hours of audio already in the public domain.
As you can hear in the demonstration model we built for the BBC, the machine-generated audio is still far from perfect but - unlike a human mimic - with more time and more training this is likely to improve exponentially to the point where the synthesised version will eventually be indistinguishable from the human original.
The implications of this are immense. Not only will it make it easier for hostile state actors to produce disinformation, but it means anyone, from teenagers in their bedrooms to activist groups, will be able to generate fake videos of public figures talking, and use them to spread disinformation around the web.
However, AI also offers a solution to this potential threat. New AI algorithms can be developed to detect and flag the fake content, to stop it ever being uploaded and spread. Over the coming months, we hope to explore further how this technology can protect our democracies from political disinformation.
Photo credit: Gage Skidmore