Voice cloning: anyone tried it?

Hello everyone!

So @cogdog and I were talking about potential uses of AI voice cloning and how foreign-language content can be dubbed into English and other languages using people’s own cloned voices, with separate tracks for host and guest just like the Zelensky–Fridman video (skip around and switch the audio track in the YouTube player settings). This tech is fascinating!

Has anyone tried ElevenLabs or similar tools to dub podcasts or lectures into English (or your local language) with cloned voices or use AI for any other voice stuff? If so, what did you build or do? What tools did you use? What were people’s reactions to what you created? If you have tech tips, pitfalls, or have discovered best practices, do share! :wink:

1 Like

If I understand your question Jan, I came across this use case last week as I was adding some resources to OER Commons related to The American Library Association’s conference last year. Michael Hanegan (cc’ed), one of the two authors of this paper used ElevenLabs to clone his voice and narrate Artificial Intelligence and the Future of Theological Education (Version 2.0) . The paper has an embedding to the recording on soundcloud, which I thought was a useful use case and accessibility feature!

2 Likes

That’s a brilliant use of AI-generated voices, @Peter – thanks for sharing!

It’s funny how this flips the usual approach to accessibility on its head: we used to start with audio and transcribe it into text and now we begin with text and generate audio from it. :smile: Because AI-generated human voices are finally good enough, rather than too annoying.

I really like how Michael Hanegan also uses Google’s NotebookLM – super handy for people who prefer listening over reading. Some commuters just drop in documents, turn them into audio, and listen while they drive / ride a train, so I used to think of NotebookLM as a personal tool (something you’d use just for yourself). But its use in education (e.g. sharing AI-generated podcasts) totally makes sense.

Do you think universities and schools are catching on to the potential of AI-generated voice, or is it still early days? What do you see at ISKME/OER Commons? :slight_smile:

Hello Jan

I’m looking forward to trying Resemble AI speech-to-speech " Professional Voice Cloning" product. Apparently, ten minutes of recorded speech will generate a voice clone virtually indistinguishable from the original speaker, and the end product (your voice clone) includes 149+ supported languages. I’m interested to see (or ‘hear’) how the product addresses syntactic structure, but it sounds intriguing. Good luck, and do let us know what you discover! Best, Meg

Oh, your experiment sounds like a lot of fun, Megan! :star_struck:

What are you going to use Resemble AI for? Is that a research / curiosity thing or do you have something else in mind?

Let us know what you learn along the way!

PS: This is a pretty cool case study I found. I’m a bit ambivalent about it because if this tech becomes too ubiquitous, it could eliminate the need to learn languages for most people, but I surely do like the idea of being able to watch (local) news from anywhere in the world.

Nothing as exalted as research, Jan. My little organization provides low- or no-cost assistance to struggling nonprofits and social enterprises. We often use “pitch decks”, voiced by us, to provide training presentations in best practices, grant application guidelines, basic accountancy, etc. It would be nice to be able to “attach”(?) a voice over that is warm and “human” in another language, thus making our services available to a broader demographic.

When I do this, I plan to test the first “translation” in a language I already speak fluently, just to see what the result is. If it’s good, really good, it would make our services exportable much more quickly. As it stands, with the exceptions of French and Spanish (which I speak), we need to enlist (at a price) the assistance of voice talent.

I’ll keep you posted, and thank you for the encouraging words.

1 Like

We’ve seen a good amount of chatter about the potential of AI tools for accessibility, and text to speech seems like a low-hanging fruit for that – a lot of the models that have come out in the past year or two do a much better job of making the audio versions readings a lot more palatable versus a lot of the machine generated speech tools we’ve seen up to now. I’ve also seen one or two NotebookLM-generated podcasts get added to OER Commons as OER themselves, although I’ve had to reach out to some of the submitters to ask them to also include the source material they used when generating the podcasts. It’s been a while though, and I think that’s partly to do with the fact that creating the podcast has lost its novelty (at least here in North America) for some folks.

1 Like

Thanks for opening this up, Jan. If you listen to the video segment of Lex Fridman’s intro, he explains that teh recording was done mostly in Russian languages for which he and Zelenskyy are fluent, with mixed in parts in English and Ukranian. The time lag for live translations, to him, got in the way.

If I understand this then, they took the recording, and used AI/LLM to transcribe the audio, and then dod voice generation on the transcripts. I am not clear if the transcription source fed to ElevenLabs was a compilation into all one language.

Jan was responding to me team message about an upcoming podcast recording with one of our colleagues in Japan. I had suggested to our guest that I can send the questions ahead and ask in English, and allow them to respond in Japanese, where ideally they would be best able to communicate their ideas.

This was something we did previously in OEGlobal Voices Episode 68 where María Soledad Ramírez Montoya listened to my questions in English and responded in Spanish.

Our published audio us in mixed language, but I had the Descript software we use do separate transcripts in Spanish and English, which were then combined back into English and Spanish transcripts. I would like to think the mixed language version is worth listening if one is fluent in English and Spanish, and for someone like me, I’d be happy to listen to Marisols voice as I follow along with a translated transcript.

I am curious to explore Resemble AI, thanks @megolosk! It still looks like steps are needed to extract from a mixed language recording a single language transcript, right? Or are there transcription tools that can generate single language transcripts from mixed sources?

I will have to look for some tool, as I note that Descript does not transcribe from Japanese.

Indeed there is much of interest and value to provide content in multi languages, especially for regionally specific ones, as @danmcguire has been doing for his projects in Africa.

Are we nearing the Universal Translator? Beam one to me

1 Like

I could not help but to explore. I did a voice training with Resemble.ai, which yielded this clip of AI Alan.

Actually, this is rather Alan sounding! Of course I can’t do more until I $$ up.

Just for fun, I copied the first two paragraphs of my reply above (including typos, sigh), and put into my Descript editor that has been trained in my voice. I can’t say I like it much, in this link you can first hear my Desript generated voice followed by real me recording the same as if I had spoken.

But I can see the potential here!

1 Like

Thanks for doing that, cogdog! Full marks! But agreed, as the $$ tiers are by the second, one has to think carefully before moving forward… :pensive_face:

And now we’ll have voice cloning in ~real time! :slight_smile:

Over the years, we’ve also been creating much more immersive experiences in Google Meet. That includes technology that’s helping people break down language barriers with speech translation, coming to Google Meet. In near real time, it can match the speaker’s voice and tone, and even their expressions — bringing us closer to natural and free-flowing conversation across languages.

(From the Google I/O 2025 keynote, emphasis mine)