Photoshop for speech is becoming a reality with Adobe Audition’s VoCo plugin. What are the implications (and possibilities) for radio broadcasting?

Letting people say what you want them to say. VoCo makes it possible. The Adobe MAX 2016 presentation (of what looks like an early beta) is impressive. Adobe claims that, after analysing 20 minutes of someone’s voice material, VoCo not only turns those spoken words into written text. It also lets you edit voice-based text, and transform that into new speech by the same voice! Sandra Müller, co-founder of the German fair radio initiative for authenticity in radio broadcasting, is fascinated. And worried.



“A computer voice never gets ill, and won’t ask for a raise”



Sandra Müller expects that within 10 years, VoCo will be part of radio’s workflow (image: SWR 4)



Be open towards listeners

Sandra, what are your thoughts about VoCo for radio?

“Looking at Adobe’s presentation, I must admit: it’s fascinating to see how easily you can edit audio just by editing text. But the general perception of audio may change once VoCo will be used on a larger scale. The first demo doesn’t sound perfect yet, but developments in speech synthesis and editing software have been huge over the last couple of years. I expect that it will evolve rapidly to a level where you can produce a simple information text with a synthetic voice. Soon, listeners won’t be able to distinguish between something you’ve said yourself, and something that is synthesised. Radio is now still halfway on the ‘safe side’ of credibility. When people hear a radio sound bite, they usually feel like: yes, I believe this person said it like this. Yes, it’s already possible to manipulate speech through editing, but that’s a lot of work. However, once VoCo can quickly synthesise a voice, and everyone can use it so there will be some fake audios circulating, our audience could start questioning whether what they hear on the radio is authentic or not. Just like the credibility and meaningfulness of images has suffered since the introduction of Photoshop.”



Be society’s content watchdog

Zeyu Jin said in his presentation that artificial speech will have an audio watermark, so we’ll know whether a sound bite is authentic or has been manipulated. Nothing to worry about?

“Hahaha, it’s nice that they at least thought about this aspect. Nevertheless, this probably won’t help in our daily practice. Every day, I see fake pictures in my Facebook timeline. Some of those mock-ups have been posted from a knowledge that they are fake, but when you look at people’s comments, some of those fake works are falsely believed to be authentic. Of course, there are always experts who can prove that something is fake, but by then, it may have caused a big impact anyway. The same could be true for audio. Someone would hear me say something I didn’t really say, and even though another user would comment below: ‘This is fake; Sandra would never say such a thing’, most people won’t read it. Therefore, I’m skeptic if such a watermark really helps.”



Consider VoCo service automation

Maybe there’s a role for (radio) journalism to uncover such fakes, in cases where it serves the public interest. In your blog of November 9, 2016, you took a bet that within 10 years, ‘vocoting’ will be an established technology in radio, like voice-tracking is today. What makes you think that VoCo will be used in radio?

“I have thought about where it could be comfortable and economically useful for radio. Say I’m running several stations and I need 4 people who are reading the news every day, it would be much cheaper to have 4 VoCo voices. I would just need one editor to type in the words, which would then be read by 4 different ‘voices’. A computer voice never gets ill, and won’t ask for a raise. Some station managers might say: let’s use technology to save budget.”

It might be happening already. I’ve heard of a German commercial radio operator who had a female presenter record all kinds of traffic-related voiceover parts, which are now being used for computer-generated traffic reports during non-stop or nighttime hours.

“Yes, this technology is already being used for traffic information, that’s correct. I believe this even happens at a public radio station in northern Germany. This method is relatively simple, because it doesn’t consist of complete sentences, so there’s no specific speech melody required. People seem to accept it, and are getting used to the ‘voice’ of their navigation device or smartphone.”



“In classic news bulletins, the human presenter may become expendable one day”



Information-oriented content, read with a neutral intonation, could technically be handled by an automated system based on speech synthesis one day (image: 123RF / Andrey Tsidvintsev)



Make radio sound personal

Do you believe that computers will ever be able to perfectly mimic our way of speaking with a human’s natural pause and emphasis, but especially with emotion?

“I expect that it will take a lot of time until they can speak with human emotion. For a long time, machines won’t be able to whisper, scream, laugh, or be sad. But for your iPhone, you can already buy several voice packages to replace Siri. I’ve found another female voice, with a significantly larger voice database, to sound much better in terms of natural pauses and accentuations. In radio, informational content is being presented without a lot of emotion, such as the news. Within a few years, speech synthesis could be so advanced that you’ll feel like a human being is reading something in a factual; less-emotional; information-oriented way.”



Employ expert news journalists

The example of newscasts is a good one, because they tend to sound more monotone than dynamic. But do you expect that a computer voice could ever convey to listeners an impression that this is a person who actually knows what he’s talking about?

“That’s a really good question. Maybe that’s the distinction point where good radio presenters will be different from the outset, and set themselves apart from a synthetic voice. A bad news anchor will be more easily replaced by a machine than a good one. That’s an interesting point of view.”

It could be new assessment for news anchors. “If you can beat the machine, you’ll have the job!”

“Haha. Who knows?”



Have personality news anchors

“Maybe there’s also a good side. The VoCo trend could make radio managers understand: my presenter is a really good one when he’s not just reading pre-written positioners or pre-produced wordings, but comes across as a human being; he will put emotion and personality into the words. That would work anywhere you want personality. But in classic news bulletins, the human presenter may become expendable one day.”

But isn’t it so that — especially with news & current affairs programming — a presenter really has to understand a topic? For example: when I’m typing in a Dutch sentence into Google Translate for rendition into English, even with basic statements, there are those odd-feeling differences. Because a machine is just not human…

“In ten years from now, I would love to say: ‘I was too optimistic about technological advancements, and too pessimistic about human implementation of those possibilities’. There’s nothing I like more than to loose my bet, hahaha.”



“Think before you’re cutting around in someone’s words”



Talents should know both technical and ethical principles of editing (image: Make a Website Hub)



Reflect your conversations honestly

From your experience as a radio reporter, editor, presenter & teacher, what are your do’s & don’ts for ethical radio journalism, especially in relation to VoCo technology?

“You always have an obligation to ask yourself while you’re editing: is my selection of sound bites reflecting the person interviewed? Does my montage actually represent what I have discussed and experienced with him during the conversation? That includes dealing with perfect actualities; a delicate aspect. Sometimes, a person delivers a 20-second statement, where you’re going like: ‘wow, that’s a really catchy one, and it also has the perfect duration; I’m using it’. But that’s exactly when you want to take a step back, and say: ‘okay, cool sound bite; really catchy; perfect duration, but does it really express what he said, as well as his attitude?’. Before you finish your editing, put yourself in the position of your conversation partner, and ask: ‘if this is being broadcast, would I feel that I have been thoroughly represented?’ That’s a great responsibility that we already have in this day and age.”



Avoid changing essential parts

Müller’s advice is to always be conscious about which edits you do and do not make, including cutting out ‘uhhs’ and other words that people are using to think. “When someone who is not used to being interviewed is a bit insecure, I find it okay to help him speak a bit more fluently. But I have to be careful in case of a politician, as it may also indicate that he was hesitant, or unsure about what he said. Whether that aspect is being expressed, or if I’m just letting him talk along without including those thought breaks, makes a difference! Before saving your final edit, you want to think: have I changed anything significant; added anything judgemental?”



Keep short stories focused

As a radio reporter working for various stations with different formats, from long-form features to 1’30” bits, do you feel like you can always tell a story like you’d want to?

“I’m not a fundamental opponent of format radio; there should be a certain expectancy of a radio program and of its content. I’m also not fundamentally against short form. I still find it possible to tell a good story within a short timeframe. But you have to realise that not every topic is suitable for that, and short form bits make it even more important to think: did I modify anything that I shouldn’t?” She adds that you want to limit short items to one certain aspect. “Telling the complete story in 1’30” usually doesn’t work, so rather choose one aspect, and cover that well. By the way, I’ve noticed that new interns are often immediately trained in technical aspects of audio editing, but that they’re not always told: think before you’re cutting around in someone’s words. I think that should be explained more often.”



“VoCo is just another method for which I’d like to see the same standards”



Sandra Müller would like to avoid anything that can mislead listeners, therefore she has mixed feelings about Adobe’s VoCo technology, and apparently she’s not the only one (image: Twitter)



Stay true to reality

Looking at positive aspects, which ways of using VoCo could be beneficial, and ethically okay?

“I can imagine its use for traffic reports and service-related information. I’m currently indecisive about if and how I would use it myself. If I’m having a cold, VoCo could do the voiceovers in my report, but I doubt whether I’d want that. Correcting minor mistakes is a possibility, like when an unexpected background noise has made a few words in a recorded interview inaudible. But I would always stay away from everything that replaces or creates a complete statement, no matter if a reporter or an interviewee says it. Also because it may open a door. Should a station deliver news using VoCo, I would at least expect transparency, so listeners are aware that it’s a computer-generated voice. However, it may cause people to insinuate things like: ‘you guys are using this technology, so you’re probably using it in other ways as well’. To me, the easiest way of not being vulnerable to that would be not using this technology.”



Establish ethical work codes

German stations could produce great comedy bits, making Angela Merkel say whatever they like, haha.

“Yes, that’s similar to what’s already being done with voice actors who are mimicking politicians and celebrities. If it happens in a comedy context, and in a transparent setting, it’s something that I could accept. But I don’t know whether I would do it, would I be running a station myself. An ethical ground rule for me is: we don’t pretend things that are not true — one of the fair radio principles [scroll down for English version, TG]. That’s why I’m having reservations about vocoding. If it’s not clear if there is a person or a machine speaking, those listeners who find out, may feel deceived.”



Use only authentic soundscapes

“For me, this also applies to existing audio production methods. Personally, I would not use library sound effects underneath a radio report if a reporter hasn’t been at the location personally. VoCo is just another method for which I’d like to see the same standards. Everything that can mislead listeners is to be avoided. Of course, the times are changing. 70 years ago, people might have assumed that all music that’s being heard on the radio, is played live. Today, we all know that this is usually not the case. What matters is that we’re avoiding any disillusion or deceit of the audience. Therefore, we should think about possible consequences of developments like VoCo for our daily workflow and ethical conduct to maintain radio’s credibility.”




Coincidentally, right before this interview (which we recorded about a week ago) was published, Adobe’s VP of Creativity, Mark Randall, published Controversy and Opportunity in Innovation about Project VoCo. His point is that ‘technology is an extension of human ability and intent’, therefore ‘every technology comes with positive and negative consequences’, and that altering speech (also like VoCo does) is already possible with existing tools, it just isn’t fast and easy to do yet.

Regarding concerns about possible misuse, he writes: ‘although we were already thinking of ways to add watermark technology to Project VoCo, we’ve gained renewed insight into the types of metadata management, watermarking or forensic technology some people desire to manage trust and authenticity in audio recording’. Randall adds that ‘Project VoCo may never become a product feature — many Sneaks never do. But now we have better insight into the benefits our customers find most useful. We also have new feedback to consider as we participate in discussions with professional organisations and standards bodies about the use of digital media.’



Header image: Thomas Giger, source material: Adobe)