Conversational AI

Amazon Unveils Novel Alexa Dialog Modeling for Natural, Cross-Skill Conversations

By Alexa Science Team

June 5, 2019

3 min read

Today, customer exchanges with Alexa are generally either one-shot requests, like “Alexa, what’s the weather?”, or interactions that require multiple requests to complete more complex tasks.

An Alexa customer planning a family movie night out, for example, must interact independently with multiple skills to find a list of local theaters playing a particular movie, identify a restaurant near one of them, and then purchase movie tickets, book a table, and perhaps order a ride. The cognitive burden of carrying information across skills — such as time, number of people, and location — rests with the customer.

“We envision a world where customers will converse more naturally with Alexa: seamlessly transitioning between skills, asking questions, making choices, and speaking the same way they would with a friend, family member, or co-worker,” says Rohit Prasad, Alexa vice president and head scientist. “Our objective is to shift the cognitive burden from the customer to Alexa.”

At Amazon’s re:MARS conference in Las Vegas, tomorrow’s vision arrived today when Prasad previewed a conversational night-out-planning experience that will be rolling out soon to customers. In the coming months, this and other multi-skill experiences will roll out to Alexa customers, initially in the U.S.

The night-out-planning experience requires Alexa to, among other things, resolve ambiguous references (“Are there any Italian restaurants nearby?”) and dynamically transition from one skill to another while preserving context (remembering the location of the movie theater in order to find close-by restaurants).

Enabling this new experience is a set of AI modules that work together to generate responses to customers’ questions and requests. With every round of dialog, the system produces a vector — a fixed-length string of numbers — that represents the context and the semantic content of the conversation.

With each update of the vector, an end-to-end, deep-learned conversational model generates a list of candidate actions that Alexa could take in response. Then, where necessary, the system fills in the candidate actions with specific values, such as movie times or restaurant names. The system then scores the actions and executes the one with the highest score.

This new approach to multi-turn dialog also includes a separate AI module whose task is to decide when to switch between different skills — which questions to pass to the restaurant search skill, for instance, and which to pass to the movie skill. Those switches can be reactive — as when a customer says, “Get me an Uber” after hearing a list of movie times — or proactive — as when Alexa follows up a movie ticket purchase request with the question “Should I book the tickets?”

Cross-skill_predictor.png._CB461671168_.png — In his talk at re:MARS, Rohit Prasad, Alexa VP and head scientist, said machine learning capabilities are advancing such that Alexa can predict a customer’s true goal from the direction of the dialogue and proactively enable the conversation flow across skills.

“With this new approach, Alexa will predict a customer’s latent goal from the direction of the dialog and proactively enable the conversation flow across topics and skills,” Prasad says. “This is a big leap for conversational AI.”

At re:MARS, Prasad also announced the developer preview of Alexa Conversations, a new deep-learning-based approach for skill developers to create more-natural voice experiences with less effort, fewer lines of code, and less training data than before. The preview allows skill developers to create natural, flexible dialogs within a single skill; upcoming releases will allow developers to incorporate multiple skills into a single conversation.

With Alexa Conversations, developers provide (1) application programming interfaces, or APIs, that provide access to their skills’ functionality; (2) a list of entities that the APIs can take as inputs, such as restaurant names or movie times; and (3) a handful of sample dialogs annotated to identify entities and actions and mapped to API calls. Alexa Conversations’ AI technology handles the rest.

Development_cycle.png._CB461671169_.png — In his talk at re:MARS, Prasad said hand coding of the dialog flow is replaced by a recurrent neural network that automatically models the dialog flow from developer-provided input, making it easier for developers to construct dialog flows for their skills.

“It’s way easier to build a complex voice experience with Alexa Conversations due to its underlying deep-learning-based dialog modeling,” Prasad said.

That’s largely because of AI technology that automatically generates simulated dialogs from the examples provided by the developer. Based on the developer’s sample data, the system represents the examples using a formal language that specifies syntactic and semantic relationships between words of a dialog. The representations can be converted back into natural language in many different ways, automatically producing dialog variations that are one to two orders of magnitude larger than the developer-provided data. These are used to train a recurrent neural network for modeling dialog flow.

More information about the Alexa Conversations developer preview can be found on the Alexa developers’ blog.