Conversational AI

Amazon at SLT: The fusion of speech and language understanding

Amazon principal applied scientist Yang Liu on the frontiers of speech and dialogue.

January 21, 2021

5 min read

The 2020 installment of IEEE’s Spoken Language Technologies (SLT) workshop is being held this week, after postponement from its original date. A biennial conference, SLT has convened only seven times previously, beginning in 2006.

Yang Liu, a principal applied scientist in the Alexa AI organization, published her first papers at the conference in 2008, when she was an assistant professor at the University of Texas at Dallas. In 2012, she was one of the conference’s general chairs, and this year, she’s the chair for special sessions and demos.

Yang Liu, principle applied scientist, Alexa AI — Yang Liu, a principal applied scientist in the Alexa AI organization.

“When this workshop was created — from the name you probably can tell — it was meant to bring two communities — the pure speech and signal-processing community and the traditional natural-language-processing community — together to discuss some applications of spoken-language understanding or processing,” Liu says. “There are a lot of these, like summarization of speech, retrieval of speech, and speech translation. It's not just speech recognition or speech synthesis. Once you have the speech recognition output, you probably will perform some kind of language understanding.”

At the time, such applications were largely speculative, but with the launch of the Amazon Echo in 2014, they became mainstream. As voice agents have proliferated, and research on natural-language understanding has grown, SLT has become more notable for its emphasis on speech technologies.

At this year’s SLT, Liu led the selection of two special sessions — special research tracks organized around particular themes. Both concern speech technologies.

Frontiers of speech

One session is on a topic that’s important to Alexa: more-natural conversational speech interactions. Last fall, Alexa announced its forthcoming natural-turn-taking feature, which will enable customers to engage in longer, multiturn interactions with Alexa, without repeating the wake word “Alexa”. The feature will also support conversations with multiple customers simultaneously, distinguishing remarks they direct to each other from instructions directed to Alexa.

Amazon publications at SLT

Learn more about Amazon's involvement at SLT 2021, including accepted publications.

The SLT special session titled “Integration of Speech Separation, Recognition, and Diarization towards Real Conversation Processing” will investigate a related set of topics.

“It's trying to integrate different technologies, including speech separation, speech recognition, and speaker diarization,” Liu says. (Speaker diarization is grouping together utterances from the same speakers in multi-speaker interactions.) “When you try to deal with multiparty conversations, then you need all of the relevant technology. It's not like each speech segment is from just one speaker. In real-world applications, you need to separate the different speakers, and you don't know how many speakers there are. And there are different background noises. There are different challenges for all of these tasks.”

The other special session is titled “Anti-Spoofing in Speaker Recognition”.

“This is speaker recognition and verification for security applications,” Liu explains. “You also need to consider adversarial attacks. You probably have seen ‘deep fakes’ — generated images or video or people that look pretty real. In speech, it’s the same thing. When you work on speaker verification, you think this is a real speaker, but maybe it's generated by machines.”

Frontiers of dialogue

While the special sessions that Liu helped choose for SLT concern speech, her work at Amazon concentrates on the other half of the spoken-language-understanding equation, natural-language understanding. In particular, she works on dialogue.

“We can put dialogue in two broad categories,” Liu says. “One is task-oriented conversations. These are where users have some goal: they want to make a hotel reservation or book a flight or make restaurant reservations. You need to detect the users’ intent and find some relevant slots, entities. The other is open-domain conversation. The Alexa Prize is the competition we organize for university teams to build these so-called socialbots. You want users to engage in conversation with these bots for hopefully up to 20 minutes.

“The traditional way to do that is to prepare templates for different domains — movie, music, book, fashion. Then we can design different dialogue flows and provide different responses based on what the user says. In our new approach, we use neural networks to generate the responses, trying to avoid those handwritten or predefined template-based responses.

“We also work on combining these task-oriented and open-domain conversations. Say the user wants to book a flight. Typically, these systems are going to ask, What's your destination? Where you departing from? What day? What time? But in the middle of the conversation, maybe users have some additional questions. ‘Do I need to wear a mask right now on flights?’ Such questions are not covered by those predefined questions from the agent. We want the system to be able to answer these questions. You probably can find the answers from some external FAQ page or from other external resources. So we're trying to enrich the task-oriented conversation with the ability to answer any kind of questions with answers based on external knowledge sources.”

Even in the dialogue context, however, information about the acoustic speech signal is crucial, Liu says.

“We want to build an empathetic dialogue system, and to understand the users’ sentiment. Acoustic information is important for such tasks. Rather than just looking at what the user has said, you also want to pay attention to the user’s tone,” Liu explains. “Even for Alexa to decide when to take the turn: has the user finished that utterance? This is very challenging in socialbot open-domain conversations. We can use different cues based on both what the person is saying as well as intonation — when I’m using a rising tone, probably indicating that I haven't finished my sentence, that I'm trying to think of what the next word is. Or sometimes the person says ‘um’ just trying to hold the floor.

“People do these things very naturally and do them well. For machines, we still have a long way to go.”

About the Author

Larry Hardesty

Larry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.

Amazon at SLT: The fusion of speech and language understanding

Amazon principal applied scientist Yang Liu on the frontiers of speech and dialogue.

Frontiers of speech

Amazon publications at SLT

Frontiers of dialogue

Related content

Work with us