Search - Amazon Science

Monophone-based Background Modeling for Two-stage On-device Wake Word Detection

Minhua Wu, Sankaran Panchapagesan, Ming Sun, Jiacheng Gu, Ian Thomas, Shiv Naga Prasad Vitaladevuni, Björn Hoffmeister, Arindam Mandal

ICASSP 2018

2018

Accurate on-device wake word detection is crucial to products with far-field voice control such as the Amazon Echo. It is quite challenging to build a wake word system with both low False Reject Rate (FRR) and low False Alarm Rate (FAR) in real scenarios where there are various types of background speech, music or noise, especially when computational resources on the device is limited. In this paper, we

Conversational AI

Information measures for microphone arrays

Mohamed Mansour

ICASSP 2018

2018

We propose a novel information-theoretic approach for evaluating microphone arrays that relies on the array physics and geometry rather than the underlying beamforming algorithm. The analogy between Multiple-Input-Multiple-Output (MIMO) wireless communication channel and the acoustic channel of microphone arrays is exploited to define information measures of microphone arrays, which provide upper bounds

Conversational AI

A Simple Model for Detection of Rare Sound Events

Weiran Wang, Chieh-Chi Kao, Chao Wang

Interspeech 2018

2018

We propose a simple recurrent model for detecting rare sound events, when the time boundaries of events are available for training. Our model optimizes the combination of an utterancelevel loss, which classifies whether an event occurs in an utterance, and a frame-level loss, which classifies whether each frame corresponds to the event when it does occur. The two losses make use of a shared vectorial representation

Conversational AI

Dynamics and periodicity based multirate fast transient-sound detection

Jun Yang, Philip Hilmes

EUSIPCO 2018

2018

This paper proposes an efficient real-time multirate fast transient-sound detection algorithm on the basis of emerging microphone array configuration intended for multimedia signal processing application systems such as digital smart home. The proposed detection algorithm first extracts the dynamics and periodicity features, then trains the model parameters of these features on Amazon machine learning platform

Conversational AI

Play Duration Based User-entity Affinity Modeling in Spoken Dialog System

Bo Xiao, Nicholas Monath, Shankar Ananthakrishnan

Interspeech 2018

2018

Multimedia streaming services over spoken dialog systems have become ubiquitous. User-entity affinity modeling is critical for the system to understand and disambiguate user intents and personalize user experiences. However, fully voice-based interaction demands quantification of novel behavioral cues to determine user affinities. In this work, we propose using play duration cues to learn a matrix factorization

Conversational AI

R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection

Weiran Wang, Chieh-Chi Kao, Ming Sun, Chao Wang

Interspeech 2018

2018

This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN [1], a wellknown region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context from

Conversational AI

Parameter Generation Algorithms for Text-to-speech Synthesis With Recurrent Neural Networks

Viacheslav Klimkov, Alexis Moinet, Adam Nadolski, Thomas Drugman

SLT 2018

2018

Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their

Conversational AI

Learning noise-invariant representations for robust speech recognition

Davis Liang, Zhiheng Huang, Zachary Lipton

SLT 2018

2018

Despite rapid advances in speech recognition, current models remain brittle to superficial perturbations to their inputs. Small amounts of noise can destroy the performance of an otherwise state-of-the-art model. To harden models against background noise, practitioners often perform data augmentation, adding artificially-noised examples to the training set, carrying over the original label. In this paper

Conversational AI

Context Aware Conversational Understanding for Intelligent Agents with a Screen

Vishal Naik, Angeliki Metallinou, Rahul Goel

AAAI 2018

2018

We describe an intelligent context-aware conversational system that incorporates screen context information to service multimodal user requests. Screen content is used for disambiguation of utterances that refer to screen objects and for enabling the user to act upon screen objects using voice commands. We propose a deep learning architecture that jointly models the user utterance and the screen and incorporates

Conversational AI

CRAFT: Complementary recommendation by adversarial feature transform

Cong Phuoc Huynh, Arridhana Ciptadi, Ambrish Tyagi, Amit Agrawal

ECCV 2018

2018

We propose a framework that harnesses visual cues in an unsupervised manner to learn the co-occurrence distribution of items in real-world images for complementary recommendation. Our model learns a non-linear transformation between the two manifolds of source and target item categories (e.g., tops and bottoms in outfits). Given a large dataset of images containing instances of co-occurring items, we train

Computer vision

Learning fashion by simulated human supervision

Eli Alshan, Sharon Alpert, Assaf Neuberger, Nathaniel Bubis, Eduard Oks

CVPR 2018

2018

We consider the task of predicting subjective fashion traits from images using neural networks. Specifically, we are interested in training a network for ranking outfits according to how well they fit the user. In order to capture the variability induced by human subjective considerations, each training example is annotated by a panel of fashion experts. Similarly to previous works on subjective data, the

Computer vision

Statistical Model Compression for Small-Footprint Natural Language Understanding

Grant Strimel, Kanthashree Mysore Sathyendra, Stanislav Peshterliev

Interspeech 2018

2018

In this paper we investigate statistical model compression applied to natural language understanding (NLU) models. Small-footprint NLU models are important for enabling offline systems on hardware restricted devices, and for decreasing on demand model loading latency in cloud-based systems. To compress NLU models, we present two main techniques, parameter quantization and perfect feature hashing. These

Conversational AI

Contextual Language Model Adaptation for Conversational Agents

Anirudh Raju, Behnam Hedayatnia, Linda Liu, Ankur Gandhe, Chandra Khatri, Angeliki Metallinou, Anushree Venkatesh, Ariya Rastrow

Interspeech 2018

2018

Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents. These ASR systems should provide a high accuracy under a variety of speaking styles, domains, vocabulary and argots. In this paper, we present a DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal,

Conversational AI

Contextual multi-armed bandits for causal marketing

Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, Houssam Nassif

ICML 2018

2018

This work explores the idea of a causal contextual multi-armed bandit approach to automated marketing, where we estimate and optimize the causal (incremental) effects. Focusing on causal effect leads to better return on investment (ROI) by targeting only the persuadable customers who wouldn’t have taken the action organically. Our approach draws on strengths of causal inference, uplift modeling, and multi-armed

Machine learning

The effectiveness of a two-layer neural network for recommendations

Oleg Rybakov, Vijai Mohan, Avishkar Misra, Scott LeGrand, Rejith Joseph, Kiuk Chung, Siddharth Singh, Qian You, Eric Nalisnick, Runfei Luo

ICLR 2018

2018

We present a personalized recommender system using neural network for recommending products, such as eBooks, audio-books, Mobile Apps, Video and Music. It produces recommendations based on customer’s implicit feedback history such as purchases, listens or watches. Our key contribution is to formulate recommendation problem as a model that encodes historical behavior to predict the future behavior using

Search and information retrieval

Can 3D pose be learned from 2D projections alone?

Dylan Drover, Rohith MV, Ching-Hang Chen, Amit Agrawal, Ambrish Tyagi, Cong Phuoc Huynh

ECCV 2018

2018

3D pose estimation from a single image is a challenging task in computer vision. We present a weakly supervised approach to estimate 3D pose points, given only 2D pose landmarks. Our method does not require correspondences between 2D and 3D points to build explicit 3D priors. We utilize an adversarial framework to impose a prior on the 3D structure, learned solely from their random 2D projections. Given

Computer vision

Question type guided attention in visual question answering

Yang Shi, Tommaso Furlanello, Sheng Zha, Animashree Anandkumar

ECCV 2018

2018

Visual Question Answering (VQA) requires integration of feature maps with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information

Computer vision

Compressed video action recognition

Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alex Smola, Philipp Krähenbühl

CVPR 2018

2018

Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression

Computer vision

Flexible and scalable state tracking framework for goal-oriented dialogue systems

Rahul Goel, Shachi Paul, Tagyoung Chung, Jeremie Lecomte, Arindam Mandal, Dilek Hakkani-Tür

NeurIPS 2018

2018

Goal-oriented dialogue systems typically rely on components specifically developed for a single task or domain. This limits such systems in two different ways: If there is an update in the task domain, the dialogue system usually needs to be updated or completely re-trained. It is also harder to extend such dialogue systems to different and multiple domains. The dialogue state tracker in conventional dialogue

Conversational AI

Object-Oriented Security Proofs

Ernie Cohen

FM 2018

2018

We use standard program transformations to construct formal security proofs.

Automated reasoning

Search results

Work with us