S1E8 - Voice Editing Software - All The Things You Didn't Say


This week, we are going to dive into voice editing software. Some of the big players in the industry are Voco, lyrebird, Wavenet & Baidu all of which are making it possible for individuals to capture speech samples and use those samples to mimic and create speeches or statements that individuals never said.

I’m pretty excited to be covering this. Let’s jump right in!

Voice editing software has been around for quite some time, we use it to while making movies, podcasts, radio ads as well as a whole bunch of other creative projects.  

Software such as Audacity, or Adobe Audition have made it incredibly easy for individuals to produce quality audio content basically forever. The one problem that creatives run into while editing is that sometimes they recording ends up being unusable for various reasons.

 

For instance, during an interview the speaker mumbles or a key statement is inaudible, or even that the recording became corrupted or destroyed at some point after and redoing the whole interview to get a sound bite is just impossible.

Well companies recognized these issues and formulated a solution. In late 2016 Adobe made the announcement during their Sneak Peaks event that led to some wide spread concern.

They were developing voice editing software named Project Voco that could take a sample of someone’s speech, then analyze and alter the sample to include words that were never in the speech to begin with and would sound exactly like the person.

BBC covered the event in an article that states the following:

“At a live demo in San Diego on Thursday, Adobe took a digitized recording of a man saying "and I kissed my dogs and my wife" and changed it to say "and I kissed Jordan three times. The edit took seconds and simply involved the operator overtyping a transcript of the speech and then pressing a button to create the synthesized voice track.”

The new audio sample would be almost indistinguishable from the original and as they improve on the software, the synthesized audio could one day be impossible to identify.

Around the same time, Google announced a rival software called Wavenet. The software which has been researched and developed by the company deepmind can be described as follows.  

WaveNet, is a deep generative model of raw audio waveforms. WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.”

From my understanding, Wavenet takes the sample, analyzes the waveform for patterns and then uses this information to create a pattern to the sample. This pattern is then used when creating new audio by predicting what the next bit will should sound like to fit with the original sample. The deepmind website does an incredible job explaining their development and research process. The link is in the show notes on Lshompole.com if you’d like to read through it for a deeper understanding.

 

Shortly after the announcements of both Google and Adobe, another company jumped into the ring. Lyrebird, a small Canadian company came out the gates with significant improvements to Adobes’ Voco Project. Lyrebird would only need to capture 60 seconds of sample speech to be able to convincingly mimic and alter the sample.

Not only that, Lyrebird also added emotion to the tech. The software can infuse emotion that was not originally in the sample to the new synthesized speech!

Sure, there are a lot of great uses for this technology. James Vincent, a writer for “The Verge” spoke with Lyrebird reps early in 2017 shortly after the string of announcements from Google & Adobe. The Lyrebird reps said that  “The resulting speech can be put to a wide range of uses, says Lyrebird, including “reading of audio books with famous voices, for connected devices of any kind, for speech synthesis for people with disabilities, for animation movies or for video game studios.”

Google now also boasts that this tech can be used to improve AI customer service speech, and from their website there’s not much of a mention of the negative uses for this technology.

And on the Adobe blog, the company described the very issue I brought up earlier this episode.  

“When recording voiceovers, dialogue, and narration, wouldn’t you love the option to edit or insert a few words without the hassle of recreating the recording environment or bringing the voiceover artist in for another session? #VoCo allows you to change words in a voiceover simply by typing new words.”

Voco would make the lives of every audio creative simpler, easier, and more efficient. This is undoubtedly true. After the initial learning curve podcasters, voiceover creatives, Movie producers and musicians will have more options when it comes to their art.

Although the benefits of this tech are substantial, the possibility for misuse is significant. This software does pose some substantial ethical dilemmas and will lead to significant changes in how we interact with information.

The ones that I find the most intriguing and most likely are

1.     Hacking

2.     Misinformation and spread of propaganda

3.     The impact on evidence and the ability to identify fake audio

4.     The simultaneous use of video editing & voice editing software and its impact on our understanding and perception of the real world

Let’s start off with hacking.

For quite some time, biometrics and voice analytics has been used as a method to verify an individual’s identity. The U.S government has been using tech at border crossings since 1996. Currently some banks also use it to identify callers.

For example,

“In May 2013 it was announced that Barclays Wealth was to use passive speaker recognition to verify the identity of telephone customers within 30 seconds of normal conversation.[15] The system used had been developed by voice recognition company Nuance (that in 2011 acquired the company Loquendo, the spin-off from CSELT itself for speech technology), the company behind Apple's Siri technology. A verified voiceprint was to be used to identify callers to the system and the system would in the future be rolled out across the company.”

So what happens when your voice sample is used to hack into your bank account?

With a software like Lyrebird, all a hacker would need to do is call you up on the phone and keep you on the line for 60 having a regular conversation.

Or in a simpler example, they could be one of your Instagram followers waiting for you to post a 1-minute long video hanging out and talking to your friends. That’s all that the software would need to be able to effectively synthesize any and everything that could be said. In the future, this technology will even be able to use your voice, emotion and all, to speak clearly and believably in any language known to man.

Adobe and Lyrebird have both addressed the possibility of hackers using their software to access individual’s personal information or credentials. Both companies are also taking very different approaches to safeguarding our data.

Adobe is researching and developing a digital watermark that would allow voice biometric software to detect when their voice editing software is being used.

However, my concern is that the digital watermarking tech will always have to stay ahead of any hacking tech that is developed to outsmart it. I expect that no matter how complex and intelligent the watermark, there will always be a hacker out there capable of circumventing the security as we have seen with just about any technology currently available to us.

Lyrebird takes a different approach; they intend to make their software completely available to everyone.

The Verge article states that Lyrebirds “solution is to release the technology publicly and make it “available to anyone.” That way, they say, the damage will be lessened because “everyone will soon be aware that such technology exists”

While speaking to “The Verge” Alexandre de Brébisson of Lyrebird adds: “The situation is comparable to Photoshop. People are now aware that photos can be faked. I think in the future, audio recordings are going to become less and less reliable [as evidence].”

While making everyone aware that audio can be faked is a first step to protecting individuals from falling victim to hacking, it doesn’t do much more than that.

Individuals will just be more aware that at any moment, their voice can be used to con companies into believing that they authorized things like wire transfers or removed fraud notifications on their bank accounts via a phone call.

 

Now the possibility of misinformation and spread of propaganda through audio editing and video editing is my favorite

In another article on “The Verge” by James Vincent we saw just how easy it can be to edit voice and video to create a buzz through society. In the article discussing the Adobe demo with Jordan Peele, James Vincent writes:

“Using some of the latest AI techniques, Peele ventriloquizes Barack Obama, having him voice his opinion on Black Panther (“Killmonger was right”) and call President Donald Trump “a total and complete dipshit.”

The video was made by Peele’s production company using a combination of old and new technology: Adobe After Effects and the AI face-swapping tool FakeApp. The latter is the most prominent example of how AI can facilitate the creation of photorealistic fake videos. It started life on Reddit as a tool for making fake celebrity porn, but it has since become a worrying symbol of the power of AI to generate misinformation and fake news.”

Sure, using this audio software by itself is believable, but imagine having video that reinforces and matches the message. Obama discussing Black Panther can be deemed harmless.

But Donald Trump effectively declaring the onset of World War 3? That video would be viral in seconds and the consequences can be extreme.

Our ability to judge the authenticity of information we receive through the internet is falling short with every improvement on technology and AI.

Every single round of software updates to technology like Lyrebird or Project Voco will mean that more and more of us are unable to take in information and come to a true conclusion.

As we saw during the 2016 election – and as I covered in Episode 7 – where Russian Bots were used to effectively change the outcome of the presidential election in the U.S.

The use of voice editing would have a tenfold effect on society. Audio is better understood and believed. Additionally, used in conjunction with bots, edited/synthesized video & the internet, we could find ourselves in a mess all from the spread of fake news (as Trump calls it).

Last of all, lets discuss the impact on evidence.

Video and audio has been used for years in trials, disputes, and even for insurance claims.

The ability to record sound and video as an event occurs has been incredibly valuable for not only police officers and judges, but also for the media.

As human beings, seeing and hearing is believing. So what happens when the audio is altered. What happens if let’s say during a recorded interaction between a police officer and an individual the conversation that originally occurred is altered to make it so that the individual sounds to be uncooperative, or threatening.

What happens when in that interaction, the police officer ends up shooting and killing the individual.

What happens when that altered audio is now used as evidence in a case?

The possibility of evidence tampering is significant and it’s safe to say that there is enough of a motive and incentive in these types of cases for tampering to occur.

Lyrebird, Adobe’s Project Voco, and Google’s Wavenet create opportunity. Both good and bad.

In my opinion, as technology develops, the need for ethical guidelines and regulations increases. Companies should be required to consider the negative consequences of developing new tech. They should also be able to produce effective methods that work to identify the misuse of their technology and provide protection in those cases.

At the moment the only one of these products that are available on the market, though not the complete version, is Google’s Wavenet. A portion of the software is available through Google’s Could Text to Speech site where users can enter in text and use the software to convert the text to speech in a multitude of languages varying in speed and tone.

Lyrebird and Adobe’s Project Voco are both not yet available to the public and pricing has yet to be discussed. Although according to the article by Vincent, Lyrebirds “de Brébisson says more than 6,000 individuals have signed up for early access to its APIs, and Lyrebird is working to improve its algorithms, including adding support for different languages like French. “This technology is going to happen,” says de Brébisson. “If it’s not us it’s going to be someone else.”

While an Adobe Spokesperson has mentioned that no official ship date has been announced for Project Voco, but in the meantime, Adobe is focusing on researching the digital watermarking tech that will allow individuals to know when their voice editing software is being used.

So at this moment since none of this tech is available for regular use, there is no tips for protection that I could make for you to implement.

Once this software is released I intend on revisiting and updating you all on how to better protect yourself.

I hope that you found this interesting, and if you want to know more you can head over to Lshompole.com where I have a whole bunch of links to articles and other information about voice editing software.

As always if you have any stories about your interaction with tech companies, apps or gadgets that you want to share with me or Creepy Tech Listeners, please send your story or audio clip over to WYN@Lshompole.com and I will feature it on an upcoming episode.

If you enjoyed this episode please head over to the iTunes podcasting app and rate, review and subscribe. I just got the first to ratings for Creepy Tech and it definitely made my day! Thank you!

Hope to have you back next Tuesday!