Across Acoustics

How Intelligible Are Cloned Voices?

ASA Publications' Office

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 28:25

You may have heard of deepfakes—AI clones of people used to create lifelike video and audio to manipulate an audience. AI cloning technology, however, has much broader applications than just subterfuge. In this episode, we talk with Patti Adank (University College London), who studied the intelligibility of voice clones compared to their natural counterparts and sheds light on some potential benefits of this technology.


Associated paper: Patti Adank and Han Wang. "Voice clones are easier to understand in noise than their human originals: The voice cloning intelligibility benefit." J. Acoust. Soc. Am. 159 (2026). https://doi.org/10.1121/10.0043094.


Read more from The Journal of the Acoustical Society of America (JASA).
Learn more about Acoustical Society of America Publications.

Music Credit: Min 2019 by minwbu from Pixabay. 

ASA Publications (00:26)

You may have heard of deepfakes, which are AI clones of people used to create lifelike video and audio to manipulate an audience. AI cloning technology, however, has much broader applications than just subterfuge, and today we're going to talk to a researcher who explores just that. With me is Patti Adank, who will be discussing her recent JASA article, “Voice clones are easier to understand in noise than their human originals: The voice cloning intelligibility benefit.” Thanks for taking the time to speak with me, Patti. How are you?

 

Patti (00:55)

I'm good, thanks, how are you?

 

ASA Publications (00:58)

Good. So first, tell us a bit about your research background.

 

Patti (01:01)

First, I want to say I'm really excited to be here, and thank you for the opportunity to let me talk about our research here. My research background, so I'm originally more of a linguist. So originally I started off as a linguist, an experimental phonetician. I did my PhD on vowel acoustics. After that I did my PhD well, then, and retrained as a cognitive neuroscientist. Most of my work on has focused on how we understand other people in difficult or challenging listening conditions, such as background noise or a noisy café or something like that. And I’ve studied various degradations, as I call them, over the last 15 years or so. And anything that is sort of unfamiliar or more difficult, that you're just not used to, can be seen as a degradation. This could be things like an unfamiliar accent, like mine, for instance, or it could be just background noise, it could be someone who speaks really fast or anything really that you're just not familiar with.

 

This study had two first authors. It was me and my former PhD student Han Wang. And we had a really nice distribution of work between us because he is quite technical and does lots of machine learning and very specialized statistical analyses. He did that part of the study. I mostly did the data collection, conceptualization and writing. And I also did most of the statistical analyses, but he did all the sort of quite complex regression analyses.

 

ASA Publications (02:24)

What is voice cloning?

 

Patti (02:26)

Voice cloning is a subset of synthetic speech generation effectively. So voice cloning is an application of AI-generated speech and refers to the process of creating a synthetic version of a person's voice that very closely mimics their speech characteristics based on a very short sample—it can be as little as three to ten seconds of recorded speech.

 

ASA Publications (02:48)

It’s kind of crazy that you can clone with just that little bit of speech. 

 

Patti (02:41)

Yeah. And it's pretty good too. 

 

ASA Publications (02:30)

Yeah. What are some benefits and drawbacks of this technology?

 

Patti (02:59)

I think when people hear clone speech and voice clones, I think most people will not immediately jump to the idea that this is something that's really, you know, necessarily something for the good of humanity because people have seen this in the news, usually relating to scams or deep fakes. I’ll start with the drawbacks effectively. Because you can clone someone's voice very quickly, within three or ten seconds, I mean, you can do that with your own voice, but criminals can also do that with your child's voice or your mother's voice or someone else's voice. So this makes it very easy build scams, like for instance telephone scams where the scammer may use the voice of someone who's familiar to you to say, “Oh, I'm in trouble, please help me.” And also people in the media and be cloned, and this can be used for misinformation, for instance. And there's lots of papers out there already, sort of warning us about the, sort of, drawbacks and the dangers of cloning or deep fakes, as they're also called. 

 

And I think, so I wanted to sort of move this towards more looking at the benefits. Because initially, I mean, this technology was not designed for the drawbacks or to commit fraud. The benefits, for instance, are especially, like, assistive health technologies. One example is voice banking. So voice banking is essentially creating a voice for those who at some point may not be able to use their voice anymore. For instance, people who will lose their voice as a result of motoneuron disease. And in these cases, people can record several hours of their voice and store that for later usage and maybe implement it in assistive devices so they can still speak with their voice at that point.

 

The other thing you can do is maybe personalization of training materials for hearing aids and cochlear implants, personalized audiobooks, personal assistants having a specific voice. For instance, I could also think about something like reading machines for the blind in a personal or familiar voice. So anything essentially where you want to personalize something, you can use voice cloning.

 

You can also think about maybe preservation of say accents or voices for cultural or historical archives. We have issues with accents dying out and this will be a way of maybe preserving some of those accents. I mean, there’s many things. You can think about educational applications, but a lot of it's anything where you want to personalize something that’s having to do with speech. It could also be branding. So you say a brand wants to use specific voice. And it essentially means to a degree that, you know, someone gives a sample and then that sample can be used again and again. So those are the benefits. So there are loads, so essentially both in healthcare but also commercial.

 

ASA Publications (05:29)

Wow, yeah, it does sound like there's so many applications that I had never even considered. The thing about creating somebody's voice if they lose their voice for some reason and need a cloned version is really interesting. So how does synthetic speech typically compare to human produced speech, particularly in terms of intelligibility?

 

Patti (05:49)

Modern speech synthesis originated after the Second World War, and it started more or less when researchers in Haskins Labs were given money to build a reading machine for the blind. And they were the first to recreate speech without a speaker present. And they did that at the time by painting something called formant tracks on more or less glass plates and sort of sounding those out. Ever since those days, those techniques have become a lot more sophisticated, and the quality improved a lot as well. So initially when you go back to the 80s, you had things like the Klatt synthesizers, you had other techniques like PSOLA, which were essentially a pitch overlap and add techniques. And these were mostly concatenation based, where you just took like a little bit of speech and sort of concatenated them.

 

But now there's been a giant leap. So in the 90s there was a leap and then in the last 10-15 years there was another leap and that's due to the implementation of deep neural networks and artificial intelligence in this technique. And what people have done in terms of standard speech synthesis, or text-to-speech synthesis, people, you can either use a specific voice, a newly created synthetic voice, and that's often still based on an existing person. I think people will be familiar with some synthetic voices, such as Alexa and Siri. And these are essentially voice clones of one voice that are produced by standard text to speech systems. And these voices, Alexa and Siri, are actually based on hours and hours of recording of actual people who have donated their voice and done lots and of recordings. And essentially those are now used as the voice of Apple and Amazon.

 

Text-to-speech systems tend to have only a few voice options. They tend to generally, if you have a standard text-to-speech system, you can choose a voice. You have maybe one female voice, one male voice, and sometimes voices for different languages. And these voices may be based on real humans, or they may be constructed using sort of small modifications of a standard voice. 

 

And to go back to your question about synthetic speech is that these voices that have been used, so in Syri and Alexa, they're extremely intelligible. People have done research evaluating the intelligibility of those synthetic voices, and they seem to be optimized for intelligibility. They evaluated them automatically, but also with humans, with human listeners, and they find that these voices are just, you know, super intelligible. And that  is generally because the speakers who modeled for these voices were already really intelligible. So these are... tend to be voice actors, and they know how to speak so their voice is intelligible. So the short answer is, yes, synthetic speech  didn’t used to be as intelligible as human speech, but these standard voices, these standard synthetic voices, they are much more intelligible than sort of most human speech.

 

ASA Publications (08:27)

Okay, okay. So then how do cloned voices compared to these previous examples of synthetic speech, and how would those differences affect intelligibility?

 

Patti (08:37)

The way voice cloning technology differs from standard speech synthesis is that cloned voices are copies of existing humans. So anyone's speech can be cloned. And that essentially means  that because individuals' voices differ in intelligibility, that you sort of maybe port or take that sort of variation in intelligibility into the cloned voice. If you clone the voices of say 20 people, then they will vary in intelligibility, because these voices have not been pre-selected on their intelligibility. 

 

When we started this study, we expected that maybe these voice cloning systems would prioritize a likeness between the model, the original human, and the voice clone, and maybe put less priority on how intelligible these voices were. So we just didn't know. So we think, okay, so also when you clone a voice, you take on board all the accent features, you take on board people's age. So you just take in all this extra variability as well, which may interfere with how intelligible one of those voices is.

 

ASA Publications (09:35)

Okay, okay, so what was the goal of this study?

 

Patti (09:38)

The goal of the study was to evaluate whether cloned voices were, how do they relate in intelligibility to their human originals. And because I didn't really know that much about a voice cloning systems at the time, and the voice cloning system that we used to clone voices is a commercial system, so we didn't have that much insight in exactly what the system did and what sort of features of the synthetic speech it produced were prioritized. So we didn't know, okay, if we have cloned all these people, will these voice be maybe less intelligible because their human original was also maybe less intelligible. So we just wanted to know, okay so if we clone some voices, if we embed them in different noise levels and compare them directly with the human originals, how do they compare?

 

ASA Publications (10:27)

So then how did you evaluate the intelligibility of cloned versus human-produced speech?

 

Patti (10:28)

We ran an experiment. Generally in speech science, if people want to evaluate the intelligibility of speaker or some type of speech, what they do is they make recordings of a more or less standard set of sentences, also known as the Harvard sentences or the IEEE sentences, which is a set of 720 sentences that were generated and designed in, I think, 1929 and have been used for nearly 100 years. And what you do is you take a selection of those sentences, which are sort of more or less phonetically balanced. And so most phonemes or sounds occur equally often, and you embed them in noise, and then you give them to listeners, and then you get some sort of value out. 

 

In this case, what we did is we embedded them in four noise levels— from very difficult to fairly easy—and asked listeners to just repeat what they heard. We just played the sentence and to them and we asked them, okay, type in what you heard. And we did this in an online study. We had 80 listeners listen to 80 sentences each and they listened to 40 clone sentences and 40 human sentences in four noise levels. And they didn't hear the same sentence twice; so they heard either in a clone version or a human version. 

 

And we didn't tell them that some of the voices were cloned because we didn't want to bias them beforehand. So at the end of the study we did tell them, okay, so half these sentences were cloned, and we asked them to do a little task to see how good they were at identifying which of the two sentences, if we presented them with a cloned sentence and a human sentence or the same sentence, and we asked them, okay, say which one is human? They could do that only in 70 % of cases. So the listeners were not that good at saying which one were the clones. They were still better than I was because I couldn't tell at all. I was completely at random. I couldn't do that. But my listeners were surprisingly good, I think, but still, you know, if you don't know… None of my 80 listeners said, “Okay, half of these were clones,” in the comments. I asked them, “Did you notice anything, you know, strange about this data?” And none of them said, “No, I knew these were clones.” No one knew that. They had no idea.

 

ASA Publications (12:33)

That's so interesting, yeah. Okay, so, sort of the big question, how did cloned speech end up comparing to human-produced speech?

 

Patti (12:41)

It was much more intelligible, across all levels. So we had those four noise levels. And, again, so this was surprise to us as well, because we thought, okay, you know, because it is synthetic speech, it may, you know, take away some of those intelligibility features of synthetic speech, but it maybe what works against us is the fact that we have different accents in there as well. So we didn't expect that. But in some conditions, the cloned speech was up to 20% more intelligible. If for  the human speech condition, listeners got say, 65% correct, and for the cloned speech, they got 85% correct. 

 

ASA Publications (13:15)

Interesting.

 

Patti (13:16)

Yeah, so massive difference. And, yeah, so, without them noticing there was anything sort of weird.

 

ASA Publications (13:23)

Very interesting. How did accent affect intelligibility, since that's one of those things that you said you assumed would come through with the cloning that would be different from synthetic speech? Did this relationship affect listeners' perception of the intelligibility of the cloned speech?

 

Patti (13:37)

Well, so we did also, we also asked them to rate how standard the accent was. Initially, we ran a pilot, so before we ran this bigger study, we ran a pilot with only 40 listeners, and we found the same result. When I found that initial pilot result, I was like, okay, so where does this come from? And one of the things we evaluated was to see whether maybe somehow the clone speech sort of strips away all these accent features and makes a more generic voice. And from our previous work, we know that more standard sort of accent voices are easier to understand, and can have a similar intelligibility benefit. 

 

So we thought, right, we're going to do this experiment properly now. We're going to ask people all these things afterwards. We're going to ask them to rate the clarity. We're going to ask them to rate the sort of standard accentedness of these sentences. So in the real experiment that has been published now, had 80 listeners who were debriefed, had to say which one was the human sentence, and also judge both the clarity and the accentedness. First of all, we found that the clarity in the cloned voices were much clearer, well, actually a little bit clearer than the human voices. But strangely enough, the cloned voices came out as more accented.

 

ASA Publications (14:50)

Interesting.

 

Patti (14:51)

And I have no idea why. Because, again, so I am supposed to be a trained listener, but I'm not that trained, not for British English. So I can't really hear if I listen to the sentences, whether, you know, someone is, say, from the Liverpool area, whether all the accent features are still present there.

 

One of my colleagues is a trained listener on British English, and she said, yes, there are clear accent differences here, and this is not reflected. I think that the accent features may have just been a little bit muddled. This is nothing against 11 Labs. I think it's just something that has happened in the cloning process. We supplied training material to 11 Labs to create those voices. And we had something like three minutes of speech that we supplied to 11 Labs. And then we used that to generate 120 in total sentences, which we used 80 in the experiment. And I think when you want to clone a voice perfectly, then maybe even three minutes may not be enough, because we cloned it with very sort of static, these Harvard sentences, these really boring sentences. So maybe we should have had the speakers in different kind of conditions. Maybe have some spontaneous speech, have them speak you know faster, slower, etc. And I think maybe the material we supply to the cloning system wasn't enough to really, you know, reflect those accent features and because it sounded a bit weird, people think, “Okay, this is weird,” so weird becomes less standard. 

 

I think a further thing that may explain why we didn't find that the clone sentences were actually less accented is that maybe the accent sort of synthesis of the voice cloning system wasn't as coherent. And as a result, the non-standard accents that were in our voice data set were not replicated faithfully, whereas we had some people there who had slightly northern English accents, sort of slightly Liverpoolian or Scouse accents, and maybe not all of those features were replicated properly. And maybe people then interpret that as slightly less standard as a result, because it just sounds slightly off. And maybe that slightly offness makes it sound less accented. But again, I haven't done very in-depth phonetic analysis by ear of these recordings, so I can't be sure of that.

 

 

ASA Publications (16:59)

Okay, okay, got it. You also looked at some acoustic features that may have impacted the perceived intelligibility of the clone speech versus the human-produced speech. How did you do that and what did you find?

 

Patti (16:59)

Yeah, so now the acoustic analysis. You know we have speech signal. You can measure everything. You can do so many measurements. Yeah, and I think I have to constrain a little bit what we did. So I thought, OK, what I'm going to measure is I’m going to look at some sort of general acoustic features, things like harmonic-to-noise ratio, spectral slope, pitch, pitch variability, formants, vowel space, things like that, but I also added in something like 20 features for voice quality. Because I had an inkling that what the voice cloning system did was sort of smooth out that voice source in the sort of final vocoding process, in which they sort of create that voice. 

 

And the first thing I did, we measured something like 47 acoustics features. And we found that there were some overall features. So, overall, the cloned speech had a bit more pitch variability, it had more stability in most of the voice source features. It had less jitter and shimmer. So overall, the voice source was a bit, well, smoother. And they also had a better harmonic-noise ratio in relevant frequency bands. Those were the main differences between the cloned speech and the human speech. And those are also features that have been associated with higher intelligibility in human listeners. However, we couldn't directly relate that to intelligibility. We did some correlations and found more or less that there was a bit of a different balance of which of those features were relevant for the intelligibility of human speech and the intelligibility of cloned speech. We couldn't go really any further. 

 

We also did an analysis, a principal component analysis, followed by a linear discriminant analysis, to see if we could get the computer, the machine essentially, to identify which voice was human and which one was cloned based on those acoustic features. And that was fairly successful. So where humans could identify the human speaker in 70% of cases, the machine could do that in 80% of cases. So we were beat again there by the machine. So, yeah, that was what we found with the acoustic features. 

 

But I think until we have the capacity to directly manipulate features in that cloned speech, we don't really know what helps those cloned voices to be more intelligible. And this is because we worked with this commercial software, so we don't have any way of manipulating how those voices come out. We can't do anything to maybe make the voices even more intelligible.

 

We had 10 voices in this experiment, and the voices were intrinsically very different in intelligibility.  Some of them improved, their clones were only about five percent more intelligible and others were about 23 percent more intelligible. Some voices were improved a lot. They were all more or less raised to sort of same level of intelligibility. So they were sort of equalized in terms of intelligibility. But how that was done, I think I need to do a lot more research to see what what exactly it is that makes voices more intelligible, these cloned voices more intelligible.

 

ASA Publications (20:14)

Right. Well, good segue. What are the next steps for the research?

 

Patti (20:17)

Well, so as I already mentioned that it would be great if we had a bit more control over the cloning system. So I'm hoping to collaborate with some people who are much better at doing speech synthesis. And what we're going to do then is look at maybe some open-source cloning systems. I think there's a new system every day that's being published. There's a lot of proliferation in free, open-source cloning systems that should allow us to do a little bit more manipulation of that voice source and of that vocoding process.

 

So far, we thought, okay, so we want to manipulate, but until we can do that, we've done some other things. So the first thing we did looked at an older listener group. We said, okay, this could potentially be really interesting, the intelligibility benefit for maybe the older listeners. I replicated this experiment in people between 45 and 65, so the more middle-aged really, but you know, I did this online and there weren’t that many people over 65 there. So we tested 40 people, and we replicated the effects exactly in all the listeners. 

 

So the other thing we did is we wanted to know, so this cloning benefit seems to rely to a large degree on the smoothness of the voice source and on some sort of harmonic structure and on fine spectral detail. So what, would this benefit still be there if we would remove most of that? So we did that by noise vocoding the cloned voices, which essentially replaces all harmonic and fine spectral information with noise bands. And so we repeated this experiment in 80 listeners with six-band noise-vocoded speech, and we evaluated first of all, is this intelligibility benefit still there afterwards? And the second thing is, will people be able to adapt to cloned vocoded speech? Because people are known to fairly quickly adapt to noise-vocoded speech, and normally show something like 15 or 20 percent improvement over the course of, say, 40 trials. So we did the experiment and we found, yes, we find exactly the same noise-cloned benefit we find that the intelligibility benefit again, with 14 percent higher accuracy for the cloned speech and also we found that human listeners adapt to both the human and the cloned speech on equal terms. So again, that's really promising because that essentially means that maybe you can noise-vocode cloned speech and maybe use that to train people who are going to have a, say, cochlear implant implanted with that kind of speech. So they are better prepared without having to use all this speech material from real human speakers, but you can use cloned speech instead.

 

Oh, and the final thing we did is we also did a pilot with 40 listeners with an open-source cloning system, and we replicated the effect again. So it seems to be fairly robust, this effect. However, I need to say that one other paper who's looked at this as well, and they didn't really find the effect, and it was probably because they didn't really use enough training material, I think. And also I think that maybe we used speakers that were really highly intelligible already. I think what we need to do next is also to look at more common speakers, so people who are not voice actors, and see if this effect still remains if we do that.

 

ASA Publications (23:17)

Yeah, yeah, that's interesting too. So it is kind of surprising that cloned speech ended up being more understandable than human speech. How do you think this finding might impact future uses of cloned speech?

 

Patti (23:28)

I think it may make all these potential applications that I talked about before more feasible, right? Because before, voice-cloning systems were tested. They are tested by people who make, you know, voice-cloning systems. They test the voices they produce, but they tend to do that automatically. So they have, say, automatic speech recognition systems and objective sort of noise and intelligibility measures, but they don't tend to do the kind of human benchmarking we do. So now we've done that, right? So now that we know, okay, we find that clone speech from both 11 Labs and from an open source F5 TTS, they're both intelligible. So it's not specific to one system. So that essentially means that you can start thinking about actual applications using this kind of speech. But the next step needs to be looking at maybe a wider range of speakers before you can do that really. But I think the future is of looking a bit more bright in this sense. Because those voices are more intelligible and that effect sort of remains across platforms and across different types of speech.

 

ASA Publications (24:30)

It is very exciting kind of, you know, that these cloned voices could more robust, I guess, for use in all the various tasks that you kind of mentioned before. 

 

What kind of ethical considerations do you have to make when using cloned voices and research?

 

Patti (24:48)

In a current dataset that we used for this study, so this was a dataset, a set of recordings that we took from an online available data space, so we didn't have to deal with ethics at that stage. But if we're recording new speakers, then we always give people several levels of consent to choose from, to say, okay, and of course, you if they don't consent to their voice being cloned, we don’t record them, of course, and we don't clone their voice against explicit consent but we should be aware that this is something that especially if you're working with voice actors, that we're working with people's livelihood need to be very careful of that, and we try to be very sensitive of that in recording new voices.

 

ASA Publications (25:27)

How do you think advances in cloned voices and intelligibility could impact the research field?

 

Patti (25:32)

One of the things that would be really exciting, I think, is if we could generate sort of limitless speech material that we could use in our experiments. So whenever we're running an experiment now, we are limited by the databases we have. And in London we have some databases and we have sets of recordings, but it’s never exactly what we need. And sometimes, especially if you want to look at, say, an accent, perception of different accents or different dialects, then it would be useful if you had the same speaker speak different dialects or accents, right? So you can use what's known as a matched-guide design, where you have the same speaker speak two accents. And there are people like that who can speak two accents, but they're very rare. So it would be great if next time you run an experiment on accent perception, it would be great if you could use the same speaker speaking in French accent, but also maybe in a Spanish accent, or in Glaswegian accent, and then compare intelligibility of those accents across that speaker without having to use a different speaker. So, it will make a lot of things like that much easier.

 

I maybe have a very narrow view where I think about, okay, this is the kind of things I would do for accent perception and these are the barriers I've had before. But it would be good if we could also do things like maybe age your speaker and see how voices, if we can produce an older speaker and see how that again, affects intelligibility, but also maybe more social and sociolinguistic experiments. Like, you have the same speaker, you age the speaker and then you see, okay, do, say, stereotypical attitudes towards that speaker change? Right? So you can use that kind of thing as well. So you can make more psychological, ask more psychological questions as well, if you're more flexible, and what you can do with the voices. So those are the kind of options you can. Again, I haven't really thought this through too much. But I think just the capacity to generate never-ending speech samples would be great.

 

A final thing where cloned voices could be useful the field, especially in assistive technologies, is combining it with the finding that familiar voices are much more intelligible in challenging listening conditions. So, and I think I've already talked about this before, that one advantage of using clones is that you can clone familiar people to you. And maybe we can embed that in future technology, for instance, in personalized hearing tests or by calibrating a hearing aid using familiar cloned voices, or maybe even personalized therapy with voice avatars. I think there’s lots of really cool options that we can do and new avenues we can take in the field.

 

 

ASA Publications (28:08)

Well, it is exciting to think about some of the positive applications of voice cloning technology. And hopefully it will open new avenues for enhancing accessibility, or as you mentioned, you know, giving you more options for research methods. I really appreciate you taking the time to speak with me today, and I wish you the best of luck in your future research.

 

Patti (28:27)

Thank you, was great to talk to you. Really enjoyed it.