Conversational AI

Amazon's 36 ICASSP papers touch on everything audio

Topics range from the predictable, such as speech recognition and noise cancellation, to singing separation and automatic video dubbing.

June 4, 2021

8 min read

The International Conference on Acoustics, Speech, and Signal Processing (ICASSP) starts next week, and as Alexa principal research scientist Ariya Rastrow explained last year, it casts a wide net. The topics of the 36 Amazon research papers at this year’s ICASSP range from the classic signal-processing problems of noise and echo cancellation to such far-flung problems as separating song vocals from instrumental tracks and regulating translation length.

A plurality of the papers, however, concentrate on the core technology of automatic speech recognition (ASR), or converting an acoustic speech signal into text:

ASR n-best fusion nets
Xinyue Liu, Mingda Li, Luoxin Chen, Prashan Wanigasekara, Weitong Ruan, Haidar Khan, Wael Hamza, Chengwei Su
Bifocal neural ASR: Exploiting keyword spotting for inference optimization
Jon Macoskey, Grant P. Strimel, Ariya Rastrow
Domain-aware neural language models for speech recognition
Linda Liu, Yile Gu, Aditya Gourav, Ankur Gandhe, Shashank Kalmane, Denis Filimonov, Ariya Rastrow, Ivan Bulyko
End-to-end multi-channel transformer for speech recognition
Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, Siegfried Kunzmann
Improved robustness to disfluencies in RNN-transducer-based speech recognition
Valentin Mendelev, Tina Raissi, Guglielmo Camporese, Manuel Giollo
Personalization strategies for end-to-end speech recognition systems
Aditya Gourav, Linda Liu, Ankur Gandhe, Yile Gu, Guitang Lan, Xiangyang Huang, Shashank Kalmane, Gautam Tiwari, Denis Filimonov, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko
reDAT: Accent-invariant representation for end-to-end ASR by domain adversarial training with relabeling
Hu Hu, Xuesong Yang, Zeynab Raeesy, Jinxi Guo, Gokce Keskin, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Roland Maas
Sparsification via compressed sensing for automatic speech recognition
Kai Zhen, Hieu Duy Nguyen, Feng-Ju Chang, Athanasios Mouchtaris, Ariya Rastrow
Streaming multi-speaker ASR with RNN-T
Ilya Sklyar, Anna Piunova, Yulan Liu
Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems
Xianrui Zheng, Yulan Liu, Deniz Gunceler, Daniel Willett

Personalized ASR FSTs.png — To enable personalization of end-to-end automatic-speech-recognition systems, Linda Liu, Aditya Gourav and their colleagues use a word-level biasing finite state transducer, or FST (left). A subword-level FST preserves the weights of the word-level FST. For instance, the weight between state 0 and 5 of the subword-level FST (representing the word “player”) is (-1.6) +(- 1.6)+(-4.8) = -8.

Two of the papers address language (or code) switching, a more complicated version of ASR in which the speech recognizer must also determine which of several possible languages is being spoken:

Joint ASR and language identification using RNN-T: An efficent approach to dynamic language switching
Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Markus Mueller, Sergio Murillo, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann
Transformer-transducers for code-switched speech recognition
Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, Katrin Kirchhoff

The acoustic speech signal contains more information than just the speaker’s words; how the words are said can change their meaning. Such paralinguistic signals can be useful for a voice agent trying to determine how to interpret the raw text. Two of Amazon’s ICASSP papers focus on such signals:

Contrastive unsupervised learning for speech emotion recognition
Mao Li, Bo Yang, Joshua Levy, Andreas Stolcke, Viktor Rozgic, Spyros Matsoukas, Constantinos Papayiannis, Daniel Bone, Chao Wang
Disentanglement for audiovisual emotion recognition using multitask setup
Raghuveer Peri, Srinivas Parthasarathy, Charles Bradshaw, Shiva Sundaram

Several papers address other extensions of ASR, such as speaker diarization, or tracking which of several speakers issues each utterance; inverse text normalization, or converting the raw ASR output into a format useful to downstream applications; and acoustic event classification, or recognizing sounds other than human voices:

Echo:noise cancellation.png — The structure of a joint echo control and noise suppression system from Amazon. A microphone *(mic)* captures the output of a loudspeaker, along with noise and echo. The echo is partially cancelled by an adaptive filter *(ĥ_f)*, which uses the signal to the speaker. The microphone signal then passes to a residual-echo-suppression *(RES)* algorithm.

Speech enhancement, or removing noise and echo from the speech signal, has been a prominent topic at ICASSP since the conference began in 1976. But more recent work on the topic — including Amazon’s two papers this year — uses deep-learning methods:

Enhancing into the codec: Noise robust speech coding with vector-quantized autoencoders
Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri, Arvindh Krishnaswamy
Low-complexity, real-time joint neural echo control and speech enhancement based on Percepnet
- Jean-Marc Valin, Srikanth V. Tenneti, Karim Helwani, Umut Isik, Arvindh Krishnaswamy

Every interaction with Alexa begins with a wake word — usually “Alexa”, but sometimes “computer” or “Echo”. So at ICASSP, Amazon usually presents work on wake word detection — or keyword spotting, as it’s more generally known:

Exploring the application of synthetic audio in training keyword spotters
Andrew Werchniak, Roberto Barra-Chicote, Yuriy Mishchenko, Jasha Droppo, Jeff Condal, Peng Liu, Anish Shah

In many spoken-language systems, the next step after ASR is natural-language understanding (NLU), or making sense of the text output from the ASR system:

Introducing deep reinforcement learning to NLU ranking tasks
Ge Yu, Chengwei Su, Emre Barut
Language model is all you need: Natural language understanding as question answering
Mahdi Namazifar, Alexandros Papangelis, Gokhan Tur, Dilek Hakkani-Tür

In some contexts, however, it’s possible to perform both ASR and NLU with a single model, in a task known as spoken-language understanding:

A spoken-language-understanding system combines automatic speech recognition *(ASR)* and natural-language understanding *(NLU)* in a single model.

An interaction with a voice service, which begins with keyword spotting, ASR, and NLU, often culminates with the agent’s use of synthesized speech to relay a response. The agent’s text-to-speech model converts the textual outputs of various NLU and dialogue systems into speech:

CAMP: A two-stage approach to modelling prosody in context
Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman
Low-resource expressive text-to-speech using data augmentation
Goeric Huybrechts, Thomas Merritt, Giulia Comini, Bartek Perz, Raahil Shah, Jaime Lorenzo-Trueba
Prosodic representation learning and contextual sampling for neural text-to-speech
Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman
Universal neural vocoding with Parallel WaveNet
Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

All of the preceding research topics have implications for voice services like Alexa, but Amazon has a range of other products and services that rely on audio-signal processing. Three of Amazon’s papers at this year’s ICASSP relate to audio-video synchronization: two deal with dubbing audio in one language onto video shot in another, and one describes how to detect synchronization errors in video — as when, for example, the sound of a tennis ball being struck and the shot of the racquet hitting the ball are misaligned:

Amazon’s Text-to-Speech team has an ICASSP paper on the unusual topic of computer-assisted pronunciation training, a feature of some language learning applications. The researchers’ method would enable language learning apps to accept a wider range of word pronunciations, to score pronunciations more accurately, and to provide more reliable feedback:

Mispronunciation detection in non-native (L2) English with uncertainty modeling
Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

Singing separation.png — The architecture of a new Amazon model for separating a recording's vocal tracks and instrumental tracks.

Another paper investigates the topic of singing voice separation, or separating vocal tracks from instrumental tracks in song recordings:

Semi-supervised singing voice separation with noise self-training
Zhepei Wang, Ritwik Giri, Umut Isik, Jean-Marc Valin, Arvindh Krishnaswamy

Finally, two of Amazon’s ICASSP papers, although they do evaluate applications in speech recognition and audio classification, present general machine learning methodologies that could apply to a range of problems. One paper investigates federated learning, a distributed-learning technique in which multiple servers, each with a different, local store of training data, collectively build a machine learning model without exchanging data. The other presents a new loss function for training classification models on synthetic data created by transforming real data — for instance, training a sound classification model with samples that have noise added to them artificially.

Also at ICASSP, on June 8, seven Amazon scientists will be participating in a half-hour live Q&A. Conference registrants may submit questions to the panelists online.

About the Author

Larry Hardesty

Larry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.

Amazon's 36 ICASSP papers touch on everything audio

Topics range from the predictable, such as speech recognition and noise cancellation, to singing separation and automatic video dubbing.

Related content

Work with us