Voice cloning: anyone tried it?

jan · May 19, 2025, 6:30pm

Hello everyone!

So @cogdog and I were talking about potential uses of AI voice cloning and how foreign-language content can be dubbed into English and other languages using people’s own cloned voices, with separate tracks for host and guest just like the Zelensky–Fridman video (skip around and switch the audio track in the YouTube player settings). This tech is fascinating!

Has anyone tried ElevenLabs or similar tools to dub podcasts or lectures into English (or your local language) with cloned voices or use AI for any other voice stuff? If so, what did you build or do? What tools did you use? What were people’s reactions to what you created? If you have tech tips, pitfalls, or have discovered best practices, do share!

Peter · May 19, 2025, 7:36pm

If I understand your question Jan, I came across this use case last week as I was adding some resources to OER Commons related to The American Library Association’s conference last year. Michael Hanegan (cc’ed), one of the two authors of this paper used ElevenLabs to clone his voice and narrate Artificial Intelligence and the Future of Theological Education (Version 2.0) . The paper has an embedding to the recording on soundcloud, which I thought was a useful use case and accessibility feature!

jan · May 20, 2025, 3:30am

That’s a brilliant use of AI-generated voices, @Peter – thanks for sharing!

It’s funny how this flips the usual approach to accessibility on its head: we used to start with audio and transcribe it into text and now we begin with text and generate audio from it. Because AI-generated human voices are finally good enough, rather than too annoying.

I really like how Michael Hanegan also uses Google’s NotebookLM – super handy for people who prefer listening over reading. Some commuters just drop in documents, turn them into audio, and listen while they drive / ride a train, so I used to think of NotebookLM as a personal tool (something you’d use just for yourself). But its use in education (e.g. sharing AI-generated podcasts) totally makes sense.

Do you think universities and schools are catching on to the potential of AI-generated voice, or is it still early days? What do you see at ISKME/OER Commons?

megolosk · May 20, 2025, 10:35am

Hello Jan

I’m looking forward to trying Resemble AI speech-to-speech " Professional Voice Cloning" product. Apparently, ten minutes of recorded speech will generate a voice clone virtually indistinguishable from the original speaker, and the end product (your voice clone) includes 149+ supported languages. I’m interested to see (or ‘hear’) how the product addresses syntactic structure, but it sounds intriguing. Good luck, and do let us know what you discover! Best, Meg

jan · May 20, 2025, 12:27pm

Oh, your experiment sounds like a lot of fun, Megan!

What are you going to use Resemble AI for? Is that a research / curiosity thing or do you have something else in mind?

Let us know what you learn along the way!

PS: This is a pretty cool case study I found. I’m a bit ambivalent about it because if this tech becomes too ubiquitous, it could eliminate the need to learn languages for most people, but I surely do like the idea of being able to watch (local) news from anywhere in the world.

megolosk · May 20, 2025, 1:21pm

Nothing as exalted as research, Jan. My little organization provides low- or no-cost assistance to struggling nonprofits and social enterprises. We often use “pitch decks”, voiced by us, to provide training presentations in best practices, grant application guidelines, basic accountancy, etc. It would be nice to be able to “attach”(?) a voice over that is warm and “human” in another language, thus making our services available to a broader demographic.

When I do this, I plan to test the first “translation” in a language I already speak fluently, just to see what the result is. If it’s good, really good, it would make our services exportable much more quickly. As it stands, with the exceptions of French and Spanish (which I speak), we need to enlist (at a price) the assistance of voice talent.

I’ll keep you posted, and thank you for the encouraging words.

Peter · May 20, 2025, 2:54pm

We’ve seen a good amount of chatter about the potential of AI tools for accessibility, and text to speech seems like a low-hanging fruit for that – a lot of the models that have come out in the past year or two do a much better job of making the audio versions readings a lot more palatable versus a lot of the machine generated speech tools we’ve seen up to now. I’ve also seen one or two NotebookLM-generated podcasts get added to OER Commons as OER themselves, although I’ve had to reach out to some of the submitters to ask them to also include the source material they used when generating the podcasts. It’s been a while though, and I think that’s partly to do with the fact that creating the podcast has lost its novelty (at least here in North America) for some folks.

cogdog · May 20, 2025, 4:22pm

Thanks for opening this up, Jan. If you listen to the video segment of Lex Fridman’s intro, he explains that teh recording was done mostly in Russian languages for which he and Zelenskyy are fluent, with mixed in parts in English and Ukranian. The time lag for live translations, to him, got in the way.

If I understand this then, they took the recording, and used AI/LLM to transcribe the audio, and then dod voice generation on the transcripts. I am not clear if the transcription source fed to ElevenLabs was a compilation into all one language.

Jan was responding to me team message about an upcoming podcast recording with one of our colleagues in Japan. I had suggested to our guest that I can send the questions ahead and ask in English, and allow them to respond in Japanese, where ideally they would be best able to communicate their ideas.

This was something we did previously in OEGlobal Voices Episode 68 where María Soledad Ramírez Montoya listened to my questions in English and responded in Spanish.

Our published audio us in mixed language, but I had the Descript software we use do separate transcripts in Spanish and English, which were then combined back into English and Spanish transcripts. I would like to think the mixed language version is worth listening if one is fluent in English and Spanish, and for someone like me, I’d be happy to listen to Marisols voice as I follow along with a translated transcript.

I am curious to explore Resemble AI, thanks @megolosk! It still looks like steps are needed to extract from a mixed language recording a single language transcript, right? Or are there transcription tools that can generate single language transcripts from mixed sources?

I will have to look for some tool, as I note that Descript does not transcribe from Japanese.

Indeed there is much of interest and value to provide content in multi languages, especially for regionally specific ones, as @danmcguire has been doing for his projects in Africa.

Are we nearing the Universal Translator? Beam one to me

cogdog · May 20, 2025, 4:53pm

I could not help but to explore. I did a voice training with Resemble.ai, which yielded this clip of AI Alan.

Actually, this is rather Alan sounding! Of course I can’t do more until I $$ up.

Just for fun, I copied the first two paragraphs of my reply above (including typos, sigh), and put into my Descript editor that has been trained in my voice. I can’t say I like it much, in this link you can first hear my Desript generated voice followed by real me recording the same as if I had spoken.

But I can see the potential here!

megolosk · May 20, 2025, 11:27pm

Thanks for doing that, cogdog! Full marks! But agreed, as the $$ tiers are by the second, one has to think carefully before moving forward…

jan · May 22, 2025, 5:07am

And now we’ll have voice cloning in ~real time!

Over the years, we’ve also been creating much more immersive experiences in Google Meet. That includes technology that’s helping people break down language barriers with speech translation, coming to Google Meet. In near real time, it can match the speaker’s voice and tone, and even their expressions — bringing us closer to natural and free-flowing conversation across languages.

(From the Google I/O 2025 keynote, emphasis mine)

vahidm · May 27, 2025, 2:06pm

My son thoroughly enjoys the Google’s Notebooks LM podcast generation tool where 2 voices summarize/comment uploaded materials, and i’ve been very surprised at how human they sound and how relevant the chatter they have is to the content that was uploaded.

It opens the doors to multimodal learning like i’ve never seen before, in the sense that you start with your run-of-the-mill PDF, but then it changes it to voice. And i guess we’re maybe 1-2 years to generating pertinent videos? (especially looking at the all-too-credible Veo3 Google AI-generated videos that are taking over youtube for those paying attention).

vahidm · May 27, 2025, 2:11pm

But back to your original post, @jan i personally wouldn’t promote “giving” your voice to a voice cloning corporation, for the same motive that you wouldn’t want to give your picture to an AI company that might use it (or resell it) for less-than-virtuous purposes.

But then i remember that i’ve posted my pictures all over social networks, and shared videos of me talking on youtube. Oh well.

cogdog · May 28, 2025, 9:33pm

Agreeing there Vahid, it’s always important to ask those questions before diving into the chat box. But really, I wonder how often people really take that step? The allure of the machine is hard to step back from.

And the problems of scraping were there long before LLMs. My photos shared in flickr got quite a bit of runs in catfishing scams and I found out my voice from recorded videos was used as well in those crude scams.

Yes, like you, back in the day of sharing was absolutely naive. Would I change anything if I knew? Impossible to say. What I gained was always more than some perceived loss of my digital stuff. The digits are not of value to me, it’s the experiences and memories.

Also, my years of publishing like 70k flickr photos in public domain means I see them all the time in other web sites offering photos (and ad clicks). I have seen them for sale on Alamy. I actually do not care. I never did take or share photos to get money for it. Because for almost every bad thing, I have had the most glorious joyful things happen when I hear from an author, musician, business owner who thanks me for a photo and lets me know where it has been used.

Okay, I am afar outlier of most people.

Frankly that will always be there. In any pool of human society, you are a bit off to think that its not possible for someone to do bad with your stuff. The only real protection is keeping stuff offline. The ying-yang of good/evil is always there.

And oops (see @Delmar I cannot be succinct!) I also think of the way we think of our photos and voices being used inside the machines as copies of the original. They are all completely disaggregated from the whole, and are just matrices of vectors and numbers. You will never know if your own photo was somehow used to create a fake news story, because it is just a single bit of data in a massive vat of slop.

It’s just gonna get more wild, eh?