The Scalable Neural Architecture behind Alexa’s Ability to Select Skills
Alexa is a cloud-based service with natural-language-understanding capabilities that powers devices like Amazon Echo, Echo Show, Echo Plus, Echo Spot, Echo Dot, and more. Alexa-like voice services traditionally have supported small numbers of well-separated domains, such as calendar or weather. In an effort to extend the capabilities of Alexa, Amazon in 2015 released the Alexa Skills Kit, so third-party developers could add to Alexa’s voice-driven capabilities. We refer to new third-party capabilities as skills, and Alexa currently has more than 40,000.
Four out of five Alexa customers with an Echo device have used a third-party skill, but we are always looking for ways to make it easier for customers to find and engage with skills. For example, we recently announced we are moving toward skill invocation that doesn’t require mentioning a skill by name.
Finding the most relevant skill to handle a natural utterance is an open scientific and engineering challenge, for two reasons:
1. The sheer number of potential skills makes the task difficult. Unlike traditional digital assistants that have on the order of 10 to 20 built-in domains, Alexa must navigate more than 40,000. And that number increases each week.
2. Unlike traditional built-in domains that are carefully designed to stay in their swim lanes, Alexa skills can cover overlapping functionalities. For instance, there are dozens of skills that can respond to recipe-related utterances.
The problem here is essentially a large-scale domain classification problem over tens of thousands of skills. It is one of the many exciting challenges Alexa scientists and engineers are addressing with deep-learning technologies, so customer interaction with Alexa can be more natural and friction-free.
Alexa uses a two-step, scalable, and efficient neural shortlisting-reranking approach to find the most relevant skill for a given utterance. This post describes the first of those two steps, which relies on a neural model we call Shortlister. (I’ll describe the second step in a follow-up post tomorrow.) Shortlister is a scalable and efficient architecture with a shared encoder, a personalized skill attention mechanism, and skill-specific classification networks. We outline this architecture in our paper “Efficient Large-Scale Neural Domain Classification with Personalized Attention”, which we will present at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in July.
The shared encoder network is hierarchical: Its lower layers are character-based and orthography sensitive and learn to represent each word in terms of character structure or shape; its middle layers are word-based, and with the outputs from the lower layers, they learn to represent an entire utterance. The skill attention mechanism is a separate network that is personalized per user. It computes a summary vector that describes which skills are enabled in a given user’s profile and how relevant they are to the utterance representation. Both the utterance representation vector and the personalized skill-summary vector feed into a battery of skill-specific classification networks, one network for each skill.
During training, the system as a whole is evaluated on the basis of the skill classification networks’ outputs. Consequently, the shared encoder learns to represent utterances in a way that is useful for skill classification, and the personalized skill attention mechanism learns to attend to the most relevant skills.
In our experiments, the system performed significantly better when it used the skill attention mechanism than when it simply relied on a vector representing user-enabled skills, with one bit for each skill. But it performed better when it used both in tandem than when it used either in isolation.
While making our architecture scalable to tens of thousands of skills, we keep practical constraints in mind by focusing on minimizing memory footprint and runtime latency, which are critical to the performance of high-scale production systems such as Alexa. Currently, inference consumes 50 megabytes of memory, and the p99 latency is 15 milliseconds. Moreover, our architecture is designed to efficiently accommodate new skills that become available between our full-model retraining cycles.
Acknowledgments: Sunghyun Park, Ameen Patel, Jihwan Lee, Anjishnu Kumar, Joo-Kyung Kim, Dongchan Kim, Hammil Kerry, Ruhi Sarikaya, and all the engineers in Fan Sun’s and Yan Weng’s teams.