Who’s on First? How Alexa Is Learning to Resolve Referring Terms
This year, at the Association for Computational Linguistics’ Workshop on Natural-Language Processing for Conversational AI, my colleagues and I won one of two best-paper awards for our work on slot carryover.
Slot carryover is a method for solving the reference resolution problem that arises in the context of conversations with AI systems. For instance, if an Alexa customer asks, “When is Lion King playing at the Bijou?” and then follows up with the question “Is there a good Mexican restaurant near there?”, Alexa needs to know that “there” refers to the Bijou.
One of the things that makes reference resolution especially complicated for a large AI system like Alexa is that different Alexa services use different names — or slots — for the same data. A movie-finding service, for instance, might tag location data with the slot name Theater_Location, while a restaurant-finding service might use the slot name Landmark_Address. Over the course of a conversation, Alexa has to determine which slots used by one service should inherit data from which slots used by another.
Last year at Interspeech, we presented a machine learning system that learned to carry over slots from previous turns of dialogue to the current turn. That system made independent judgments about whether to carry over each slot value from one turn to the next. Even though it significantly outperformed a rule-based baseline system, its independent decision-making was a limitation.
In many Alexa services, slot values are highly correlated, and a strong likelihood of carrying over one slot value implies a strong likelihood of carrying over another. To take a simple example, some U.S. services have slots for both city and state; if one of those slots is carried over from one dialogue turn to another, it’s very likely that the other should be as well.
Exploiting these correlations should improve the accuracy of a slot carryover system. The decision about whether to carry over a given slot value should reinforce decisions about carrying over its correlates, and vice versa. In our new work, which was spearheaded by Tongfei Chen, a Johns Hopkins graduate student who was interning with our group, we evaluate two different machine learning architectures designed to explicitly model such slot correlations. We find that both outperform the system we reported last year.
Also at the Workshop on Natural-Language Processing for Conversational AI, Ruhi Sarikaya, director of applied science for the Alexa AI team, delivered a keynote talk on enabling scalable, natural, self-learning, contextual conversational systems. Although Sarikaya's talk ranged across the whole history of conversational AI, he concentrated on three recent innovations from the Alexa team: self-learning, contextual carryover, and skill arbitration. Contextual carryover is the topic of the main post on this page, and skill arbitration is a topic that Y. B. Kim has discussed at length on this blog, but self-learning has received comparatively little attention.
Self-learning is a process whereby Alexa improves performance with no human beings in the loop. It depends on implicit signals that a response is unsatisfactory, such as a "barge-in", in which a customer cuts of Alexa's response with a new request, or a rephrase or an earlier request. As Sarikaya explained, Alexa uses five different machine learning models to gauge whether customer reactions indicate dissatisfaction and, if so, whether they suggest what Alexa's responses should have been.
From such data, Sarikaya explained, the self-learning system produces a graphical model – an absorbing Markov chain – that spans multiple interactions between Alexa and multiple customers. This reinterprets the problem of correcting misunderstood queries as one of collaborative filtering, the technology underlying Amazon.com's recommendation engine. Rules learned by Alexa's self-learning system currently correct millions of misinterpretations every week, in Alexa's music, general-media, books, and video services. This year, Sarikaya explained, his team will introduce more efficient algorithms for exploring the graphical model of customer interactions, to increase the likelihood of finding the optimal correction for a misinterpreted request.
The first architecture we used to model slot interdependencies was a pointer network based on the long short-term memory (LSTM) architecture. Both the inputs and the outputs of LSTMs are sequences of data. A network configuration that uses pairs of LSTMs is common in natural-language understanding (NLU), automatic speech recognition, and machine translation. The first LSTM — the encoder — produces a vector representation of the input sequence, and the second LSTM — the decoder — converts it back into a data sequence. In machine translation, for instance, the vector representation would capture the semantic content of a sentence, regardless of what language it’s expressed in.
In our architecture, we used a bidirectional LSTM (bi-LSTM) encoder, which processes the input data sequence both forward and backward. The decoder was a pointer network, which outputs a subset of slots that should be carried over from previous turns of dialogue.
The other architecture we considered uses the same encoder as the first one, but it replaces the pointer-generator decoder with a self-attention decoder based on the transformer architecture. The transformer has recently become very popular for large-scale natural-language processing because of its efficient training and high accuracy. Its self-attention mechanism enables it to learn what additional data to emphasize when deciding how to handle a given input.
Our transformer-based network explicitly compares each input slot to all the other slots that have been identified in several preceding turns of dialogue, which are referred to as the dialogue context. During training, it learns which slot types from the context are most relevant when deciding what to do with any given type of input slot.
We tested our system using two different data sets, one a standard public benchmark and the other an internal data set. Each dialogue in the data sets consisted of several turns, alternating between customer utterances and system responses. A single turn might involve values for several slots. On both datasets, the new architectures, which are capable of modeling slot interdependencies, outperformed the systems we published last year, which make decisions independently.
Overall, the transformer system performed better than the pointer generator, but the pointer generator did exhibit some advantages in recognizing slot interdependencies across longer spans of dialogue. With the pointer generator architecture, we also found that ordering its slot inputs by turn improved performance relative to a random ordering, but further ordering slots within each turn lowered performance.
Suppose, for instance, that a given turn of dialogue consisted of the customer instruction “Play ‘Misty’ by Erroll Garner”, which the NLU system interpreted as having two slots, Song_Name (“Misty”) and Artist_Name (“Erroll Garner”). The bi-LSTM fared better if we didn’t consistently follow the order imposed by the utterance (Song_Name first, then Artist_Name) but instead varied the order randomly. This may be because random variation helped the system generalize better to alternate phrasings of the same content (“Play the Erroll Garner song ‘Misty’”).
Going forward, we will investigate further means of improving our slot carryover methodology, such as transfer learning and the addition of more data from the dialogue context, in order to improve Alexa’s ability to resolve references and deliver a better experience to our customers.
Acknowledgments: Tongfei Chen, Hua He, Lambert Mathias