Alexa at Interspeech 2018: How interaction histories can improve speech understanding
Alexa’s ability to act on spoken requests depends on statistical models that translate speech to text and text to actions. Historically, the models’ decisions were one-size-fits-all: the same utterance would produce the same action, regardless of context.
That’s changing, however. Some context awareness is reaching Alexa’s production models. And at this year’s Interspeech — the largest conference on spoken-language processing — Amazon researchers are presenting several papers with a common theme: using interaction history to improve customers’ Alexa experiences.
“Capturing and modeling context accurately is critical,” says Spyros Matsoukas, a senior principal scientist in the Alexa AI group and coauthor of a paper accepted to this year’s Interspeech. “Even seemingly simple requests, such as asking for the temperature, can have multiple valid responses depending on surrounding context — for instance, communicating the reading of a connected temperature sensor versus reporting the weather temperature in the user’s current location.”
“Besides handling ambiguous requests,” he continues, “context modeling facilitates a more efficient dialogue by enabling users to reference previous entities and/or intents, such as asking for the weather in Boston and then following up with ‘How about in Athens?’ There are several efforts within the Alexa team to analyze, extract, and model contextual information at different levels — user, device, dialogue — and employ it in all components of the spoken-language-understanding pipeline.”
One of the Amazon papers at Interspeech takes the long view of customers’ interaction histories. Alexa machine learning scientist Bo Xiao and his colleagues use collaborative filtering — the same technique that generates product recommendations on Amazon.com — to help resolve ambiguous requests. For instance, if a customer says, “Alexa, play ‘Hello,’” his or her music-listening history should indicate whether the song intended is the one by Adele or the one by Lionel Richie.
A simple record of songs played, however, can give a deceptive picture of a customer’s tastes. A customer might, for instance, frequently sample music recommended by friends or critics but find it unappealing, or cycle through several songs with similar titles in order to find one heard briefly on the radio, or the like.
On the other hand, Alexa customers rarely provide explicit ratings for the content they consume, and ratings are the basis for classical collaborative filtering. So Xiao and his colleagues use playback duration — whether a customer cuts a song off quickly or lets it play through — as a proxy for ratings. In tests, the technique showed promise as a means of personalizing the results of Alexa requests.
Two other Interspeech papers with Amazon coauthors, however, concentrate on short-term history: what the last few interactions with Alexa suggest about how current requests should be processed.
One of those papers deals with what Alexa scientists call “context carryover”: the ability to track references through several rounds of conversation. For instance, if an Alexa customer says, “How far is it to Redmond?”, then asks, “What’s the best Indian restaurant there?”, Alexa should infer that “there” refers to Redmond.
Frequently, sequential Alexa requests require the invocation of different skills — first, say, a trip-planning skill, and then a restaurant-rating skill. But different skills may use different terms to designate the same information. The trip-planning skill might use the term “destination” to designate a trip endpoint, and the restaurant-rating skill might use the term “city” to designate a search area.
At Interspeech, Amazon scientists Chetan Naik, Arpit Gupta, and their colleagues will present a system for mapping content from its “slot” in one skill to the corresponding slot in another. The system uses a neural network, which takes as input both contextual information about the customer’s recent interactions with Alexa and a list of candidate mappings, produced from “embeddings” that capture information about words’ syntactic similarity. In tests, the system demonstrated a 9% improvement over a slot-mapping system that used meticulously hand-coded rules.
Context carryover makes interactions with Alexa feel more natural, and that’s particularly true when it’s used in conjunction with Follow-Up Mode, which lets customers issue series of requests without repeating the wake word “Alexa.” Follow-Up Mode depends on distinguishing real follow-up requests from noise such as background conversations or TV audio. Applied scientist Harish Mallidi and his colleagues (including Matsoukas) have an Interspeech paper in which they describe a deep neural network that combines acoustics and features from Alexa's speech understanding system to identify follow-up utterances intended for Alexa. Their system does not use context itself, but it complements those that do.
A fourth Interspeech paper, from Alexa speech scientist Anirudh Raju and colleagues, uses contextual information to improve the quality of Alexa’s automatic speech recognition systems. Central to such systems are statistical language models, which assign probabilities to sequences of words. As such, they can help adjudicate competing interpretations of the same sound.
In the past, Alexa’s speech recognizers have used general-purpose statistical language models. But the likelihood of particular word sequences can vary widely according to context. If an Alexa customer says, “Alexa, get me a … ”, the probabilities of the next word being “sub” or “cab” are very different if the customer’s recent utterances concerned foods’ nutritional content or local traffic conditions.
Raju and his colleagues built several different language models from scratch, each tailored to a different Alexa skill “domain” — such as music playing or weather reporting — or to a different topic of conversation. Then they trained a machine-learning system to create ad hoc combinations of language models on the fly, on the basis of a customer’s recent utterances. In tests, the system reduced transcription errors by as much as 6% overall, and by as much as 15% on the crucial task of transcribing names.