Conversational AI

reMARS revisited: Advance diversity and inclusion in voice AI with speech disentanglement

Learn how Amazon uses machine-learning techniques to modify different aspects of speech — tone, phrasing, intonation, expressiveness, and accent — to create unique Alexa responses.

By Staff writer

March 13, 2023

2 min read

In June 2022, Amazon re:MARS, the company’s in-person event that explores advancements and practical applications within machine learning, automation, robotics, and space (MARS), took place in Las Vegas. The event brought together thought leaders and technical experts building the future of artificial intelligence and machine learning, and included keynote talks, innovation spotlights, and a series of breakout-session talks.

Now, in our re:MARS revisited series, Amazon Science is taking a look back at some of the keynotes, and breakout session talks from the conference. We've asked presenters three questions about their talks, and provide the full video of their presentation.

On June 24, Ewa Kolczyk, senior software development manager with Amazon Web Services (AWS), and Kayoko Yanagisawa, senior speech scientist in Alexa, presented their talk, "Advance diversity and inclusion in voice AI with speech disentanglement". Their presentation focused on speech disentanglement and how Amazon uses this technique to influence different aspects of speech — tone, phrasing, intonation, expressiveness, and accent — to create unique Alexa responses.

What was the central theme of your presentation?

In this presentation we talked about how we use machine learning (ML) techniques in text-to-speech (TTS) to improve diversity, equity, and inclusion (DEI), to make Alexa’s response work optimally for everyone. We use speech disentanglement techniques to separate the different aspects of speech such as language, accent, age, gender, and emotion so that we can modify them to create voices speaking multiple languages or accents, or create new voices in any gender, age, or accent. We also talked about Alexa’s preferred speaking rate feature and whisper mode which help customers with various needs.

In what applications do you expect this work to have the biggest impact?

Customers of speech products such as Voice AI (Alexa), IVR (Amazon Connect), or Amazon Polly users will be able to easily enhance their portfolio with a diverse range of TTS voices speaking different accents or languages, different speaker characteristics (gender, age) or different styles, to suit the needs of their global customer base.

What are the key points you hope audiences take away from your talk?

We can use ML techniques to modify various aspects of speech and to improve the diversity and style of TTS voices, thereby addressing the needs of various customers.