Conversational AI

Building software and systems that help people communicate with computers naturally, as if communicating with family and friends.

A Simple Model for Detection of Rare Sound Events

Weiran Wang, Chieh-Chi Kao, Chao Wang

Interspeech 2018

2018

We propose a simple recurrent model for detecting rare sound events, when the time boundaries of events are available for training. Our model optimizes the combination of an utterancelevel loss, which classifies whether an event occurs in an utterance, and a frame-level loss, which classifies whether each frame corresponds to the event when it does occur. The two losses make use of a shared vectorial representation

Conversational AI
Play Duration Based User-entity Affinity Modeling in Spoken Dialog System

Bo Xiao, Nicholas Monath, Shankar Ananthakrishnan

Interspeech 2018

2018

Multimedia streaming services over spoken dialog systems have become ubiquitous. User-entity affinity modeling is critical for the system to understand and disambiguate user intents and personalize user experiences. However, fully voice-based interaction demands quantification of novel behavioral cues to determine user affinities. In this work, we propose using play duration cues to learn a matrix factorization

Conversational AI
Parameter Generation Algorithms for Text-to-speech Synthesis With Recurrent Neural Networks

Viacheslav Klimkov, Alexis Moinet, Adam Nadolski, Thomas Drugman

SLT 2018

2018

Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their

Conversational AI
Neural network based time-frequency masking and steering vector estimation for two-channel MVDR beamforming

Trausti Kristjansson

ICASSP 2018

2018

We present a neural network based approach to two-channel beamforming. First, single- and cross-channel spectral features are extracted to form a feature map for each utterance. A large neural network that is the concatenation of a convolution neural network (CNN), long short-term memory recurrent neural network (LSTMRNN) and deep neural network (DNN) is then employed to estimate frame-level speech and

Conversational AI
Monophone-based Background Modeling for Two-stage On-device Wake Word Detection

Minhua Wu, Sankaran Panchapagesan, Ming Sun, Jiacheng Gu, Ian Thomas, Shiv Naga Prasad Vitaladevuni, Björn Hoffmeister, Arindam Mandal

ICASSP 2018

2018

Accurate on-device wake word detection is crucial to products with far-field voice control such as the Amazon Echo. It is quite challenging to build a wake word system with both low False Reject Rate (FRR) and low False Alarm Rate (FAR) in real scenarios where there are various types of background speech, music or noise, especially when computational resources on the device is limited. In this paper, we

Conversational AI

Amazon Unveils Novel Alexa Dialog Modeling for Natural, Cross-Skill Conversations

Alexa Science Team

June 5, 2019

Today, customer exchanges with Alexa are generally either one-shot requests, like “Alexa, what’s the weather?”, or interactions that require multiple requests to complete more complex tasks.

Conversational AI
Using adversarial training to recognize speakers’ emotions

Viktor Rozgic

May 21, 2019

A person’s tone of voice can tell you a lot about how they’re feeling. Not surprisingly, emotion recognition is an increasingly popular conversational-AI research topic.

Conversational AI
Should Alexa read “2/3” as “two-thirds” or “February Third”?: The science of text normalization

Ming Sun

May 16, 2019

Text normalization is an important process in conversational AI. If an Alexa customer says, “book me a table at 5:00 p.m.”, the automatic speech recognizer will transcribe the time as “five p m”. Before a skill can handle this request, “five p m” will need to be converted to “5:00PM”. Once Alexa has processed the request, it needs to synthesize the response — say, “Is 6:30 p.m. okay?” Here, 6:30PM will be converted to “six thirty p m” for the text-to-speech synthesizer. We call the process of converting “5:00PM” to “five p m” text normalization and its counterpart — converting “five p m” to “5:00PM” — inverse text normalization.

Conversational AI
Training a Machine Learning Model in English Improves Its Performance in Japanese

Judith Gaspers

May 13, 2019

Recently, we published a paper showing that training a neural network to do language processing in English, then retraining it in German, drastically reduces the amount of German-language training data required to achieve a given level of performance.

Conversational AI
How we add new skills to Alexa’s name-free skill selector

Young-Bum Kim

May 3, 2019

Using cosine similarity rather than dot product to compare vectors helps prevent "catastrophic forgetting".

Conversational AI
“Alexa, Turn Down the Lights and Play Music”: The Science of Handling Compound Requests

Rahul Goel

May 2, 2019

Traditionally, Alexa has interpreted customer requests according to their intents and slots. If you say, “Alexa, play ‘What’s Going On?’ by Marvin Gaye,” the intent should be PlayMusic, and “‘What’s Going On?’” and “Marvin Gaye” should fill the slots SongName and ArtistName.

Conversational AI

Conversational AI

Publications

Related content

Work with us