How Alexa is learning to converse more naturally
To handle more-natural spoken interactions, Alexa must track references through several rounds of conversation. If, for instance, a customer says, “How far is it to Redmond?” and after the answer follows up by saying, “Find good Indian restaurants there”, Alexa should be able to infer that “there” refers to Redmond.
We call the task of reference tracking “context carryover,” and it’s a capability that is currently being phased in to the Alexa experience. At this year’s Interspeech, the largest conference on spoken-language understanding, my colleagues and I will present a paper titled “Contextual Slot Carryover for Disparate Schemas,” which describes our solution to the problem of slot carryover, a crucial aspect of context carryover.
Today, Alexa analyzes the semantic content of utterances according to the categories domain, intent, and slot. “Domain” describes the type of application — or “skill” — that the utterance should invoke; for instance, mapping skills should answer questions about geographic distance. “Intent” describes the particular function the skill should execute, such as measuring driving distance. And “slots” are variables that the function acts upon, such as point of origin and destination.
Answering successive questions in a natural conversation will often require Alexa to invoke different skills, which makes context carryover extremely hard. The mapping skill, for instance, might use the slot “Town” to describe travel destinations, whereas the Restaurants skill might use the slot “City” to describe the geographic area in which it is performing a search. In our Interspeech paper, we describe a neural-network approach that automatically learns how to map the slots used by one skill to those used by another.
We make an important distinction between our approach and conventional dialogue state tracking, which maintains a probability distribution across all possible values that a given target slot can take on. Our system, by contrast, (i) cares only about slot values mentioned in context and (ii) makes independent decisions about the probability that any given carryover decision is the correct one. This helps us scale to the large number of slots across all Alexa skills.
Our system uses an encoder-decoder model, which means that it divides the neural network into two components. The first component — the encoder — receives vectors representing various features of the input data and outputs a single summary vector. The second component — the decoder — receives the summary vector and outputs a confidence score, which represents the likelihood that a candidate slot is the correct one. Both components of the system are trained together, so the encoder learns to produce summary vectors that are particularly useful for candidate scoring. Below, we describe this architecture in more detail.
During training, our system uses a set of slot names (such as “Town”) and all their associated values (such as “Redmond”) to create an “embedding” for each slot name. Embedding is a technique for representing strings of words as points in a geometric space, such that semantically related strings are grouped together. Typically, it’s based on the frequency with which words co-occur with other words.
When the system is in use, we use proximity in embedding space to generate a list of candidate mappings between every slot encountered in the conversation so far and the slots available in the currently invoked skill. Each of these candidates is then fed into the encoder, along with other features, such as the recent history of the customer’s utterances, the recent history of Alexa’s responses, and the inferred intent of the customer’s most recent utterance.
The utterance histories pass through layers of the encoder known as long short-term memory (LSTM) encoders. LSTMs are neural-network layers that can account for the sequencing of data, so they preserve the information inherent in utterances’ word order. Each LSTM also has an associated word attention mechanism. During training, the word attention mechanisms learn which words in an utterance are particularly useful for assessing a candidate slot-mapping. We also add a separate attention mechanism to help the decoder decide whether to focus on user utterances or Alexa utterances.
In the encoder, the utterance histories are combined into a single output vector, but the candidate slot-mapping and the intent inference are encoded separately (as is a “recency encoding” that describes the conversational distance between the slot candidate and the slot whose value it’s inheriting). The outputs of the encoder then pass to the decoder, which consists of several densely interconnected network layers. Finally, the decoder outputs a decision about whether to carry over the slot or not.
In the paper we compare the performance of the network with and without the attention mechanisms to that of a strong rule-based system for slot mapping, which consists of hand-coded rules such as, If the initial slot type is “City,” and the follow-up utterance includes the word “there,” then the value of City should be carried over.
Generally, the attention mechanisms offered slight improvements in system performance. The rule-based system had a significantly lower recall than our model, meaning that it missed many carryovers that our model correctly deduced. Overall, according to the F1 score, which combines recall and precision (a measure of the false-positive rate), our system outperformed the rule-based system by roughly 9 percent.
Acknowledgments: Hancheng Ge, Lambert Mathias, Ruhi Sarikaya