Man vs Machine - Artificial Intelligence Produces Human Voice - Raises Several Questions
Now that the Artificial Intelligence systems only require a short amount of audio to train in order to create a viable artificial voice that impersonates the speaking style and tone of an individual, the opportunity for misuse increases.
The long-drawn tussle between man and machine has made another breakthrough. It’s the astounding pace at which Artificial Intelligence system can clone human voice. Using snippets of voices, Chinese Technology Leader Baidu's ‘Deep Voice’ can generate new speech, accents, and tones in only 3.7 seconds in comparison to the 30 minutes of audio the company’s voice cloning tool required a year back. This demonstrates the accelerating advances, the technology to produce artificial voices, has made in such a short span of time. Also, it is indicative of the capabilities getting stronger and becoming more realistic with time, which may lead to abuse of the technology.
Power of AI Voice Generation
As is true for all artificial intelligence algorithms, the more data is fed to the voice cloning tools such as Deep Voice to train with, the more realistic the results they produce. If you have tried listening to audio demos at GitHub, you must be overwhelmed with appreciation for the dimension of what the technology can do including being able to switch the gender of the voice as well as modify accents and styles of speech.
For more technology insights, follow me @Asamanyakm
Google has been working on using its AI systems for text to speech conversion that leverages the company’s deep neural network and speech generation method WaveNet and recently released Tacotron 2. WaveNet analyzes a visual representation of audio called a spectrogram to generate audio. It is used to create the voice for Google Assistant. This iteration of the technology is so good; it's nearly impossible to tell what’s AI-generated and what voice is human-generated. The algorithm has learned how to enunciate complex words and names that would have been tell-tale signs of a machine as well as how to better articulate words.
These advances in Google’s voice generation technology have allowed for Google Assistant to offer celebrity cameos. John Legend's voice is now an option on any device in the United States with Google Assistant such as Google Home, Google Home Hub, and smartphones. John’s voice will only respond to specific questions such as "How's the weather" and "How far is the Sun from the earth" and is available to sing happy birthday on command. Google is working on making more celebrity cameos available, to select from.
Another example of just how precise the technology has become, a Jordan Peterson (the author of 12 Rules for Life) AI model sounds just like him rapping Eminem’s "Lose Yourself" song. The creator of the AI algorithm used only six hours of Peterson talking (taken from readily available recordings of online) to train the machine learning algorithm to create the audio. The algorithm accepts short audio clips and learns how to synthesize speech in the same style as that of the speaker. If you try to listen, you will see just how amazingly successful it was.
This advanced technology lays the red carpet for companies such as Lyrebird to provide new services and products. Lyrebird uses artificial intelligence to create voices for chatbots, audiobooks, video games, text readers and more. The company acknowledges on its website that “with great innovation comes great responsibility” accentuating the importance of forerunners of this technology to take appropriate steps to avoid misuse of the technology.
How can the Technology be Misused and what are the Precautions?
Like other nascent technologies, artificial voice can have many benefits but can also be used to mislead individuals or pose a threat to mankind. As the AI algorithms get better and it becomes tough to differentiate between real and artificial, those with differently oriented mindset may seek more opportunities to use it to manipulate the truth.
According to research, our brains don’t register significant differences between real and artificial voices, in comparison to real and artificial images. In fact, human brains find it harder to detect duplicate voices than to decipher duplicate images.
Now that these AI systems only require a short amount of audio to train in order to create a viable artificial voice that impersonates the speaking style, pitch and tone of an individual, the opportunity or threat for misuse increases. So far, researchers couldn’t identify a neural distinction for how a brain can distinguish between real and fake. Consider how artificial voices might be used in an interview, news segment or press conference to make listeners believe they are listening to an authority figure in the government or a chairperson of a company. Criminals can use the technology to frame false evidence to conceal facts or manipulate the sequence of events.
Increasing awareness that this technology exists and how advanced it is, will be the first step to protect listeners from falling for artificial voices in case they are used to mislead or manipulate. The real fear is that human beings can be crafted to act on something that is spurious because it sounds like it's coming from someone real. Technology evangelists are attempting to find a technical solution to safeguard mankind. The good part about any technology, however advanced it gets, is to generate output basis input. Cyber laws and governance will have to be in place for imposing ethical practices. Counter solutions can be used to keep monitoring and exposing threats. However, a technical solution will not be cent percent foolproof and secure. The human ability to think critically to assess a situation, evaluate the source of information and verify its validity will play an increasingly important role as these technologies take exponential leaps.
Share your thoughts, connect with me @Asamanyakm