Conversational AI

A quick guide to Amazon's 20+ papers at ICASSP 2024

This year’s papers address topics such as speech enhancement, spoken-language understanding, dialogue, paralinguistics, and pitch estimation.

By Staff writer

April 11, 2024

4 min read

The International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) takes place April 14–19 in Seoul, South Korea. Amazon is a bronze sponsor of “the world’s largest and most comprehensive technical conference focused on signal processing and its applications.”

Amazon’s presence includes a workshop (Trustworthy Speech Processing), two of whose organizers are researchers with Amazon's Artificial General Intelligence (AGI) Foundations organization: Anil Ramakrishna, senior applied scientist, and Rahul Gupta, senior manager of applied science. In addition, Wontak Kim, senior manager of research science with Amazon Devices, will present a spotlight talk titled “Synthetic data for algorithm development: Real-world examples and lessons learned.”

As in previous years, many of Amazon’s accepted papers focus on automatic speech recognition. Topics such as speech enhancement, spoken-language understanding, and wake word recognition are all well represented. This year’s publications also touch on dialogue, paralinguistics, pitch estimation, and responsible AI. Below is a quick guide to Amazon’s more than 20 papers at the conference.

Addressee detection

Long-term social interaction context: The key to egocentric addressee detection
Deqian Kong, Furqan Khan, Xu Zhang, Prateek Singhal, Ying Nian Wu

Audio event detection

Cross-triggering issue in audio event detection and mitigation
Huy Phan, Byeonggeun Kim, Vu Nguyen, Andrew Bydlon, Qingming Tang, Chieh-Chi Kao, Chao Wang

Automatic speech recognition (ASR)

Max-margin transducer loss: Improving sequence-discriminative training using a large-margin learning strategy
Rupak Vignesh Swaminathan, Grant Strimel, Ariya Rastrow, Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Thanasis Mouchtaris

Graphic shows three 4 x 3 grids of green, yellow, and red dots connected by arrows; each has a black dot under the bottom dot in the first column and a white dot above the third column. The text <sos> turn on light appears to the left of each grid; the first grid has a title of utterance score (hyp1), the second has a title of utterance score (hyp2), and the third has a title of utterance score (hyp3). There are dotted line columns between the first and second grids with a two-way arrow marked margin connecting those grids. — A novel sequence-discriminative training criterion for automatic speech recognition (ASR) separates “good” and “bad” hypotheses in an N-best list produced from a pretrained transducer model. In this example, the three best hypotheses are fed back into the prediction network, after which the joint network lattice is computed to produce an utterance score for each hypothesis. From "Max-margin transducer loss: Improving sequence-discriminative training using a large-margin learning strategy".

Promptformer: Prompted conformer transducer for ASR
Sergio Duarte Torres, Arunasish Sen, Aman Rana, Lukas Drude, Alejandro Gomez Alanis, Andreas Schwarz, Leif Rādel, Volker Leutnant

Significant ASR error detection for conversational voice assistants
John Harvill, Rinat Khaziev, Scarlett Li, Randy Cogill, Lidan Wang, Gopinath Chennupati, Hari Thadakamalla

Graphic shows an overview of contrastive-learning-for-conversations approaches via a flow chart split into 2 distinct sections. The first is titled is titled "past/future contrastive learning" and the second is titled "n-best contrastive learning". — An overview of contrastive-learning-for-conversations (CLC) approaches. The past-future loss maximizes agreement between current, past, and future embeddings. The N-best loss minimizes agreement between current embeddings and top predictions of rephrases, while maximizing agreement otherwise. From "Task oriented dialogue as a catalyst for self-supervised automatic speech recognition".

Task oriented dialogue as a catalyst for self-supervised automatic speech recognition
David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister

Computer vision

Skin tone disentanglement in 2D makeup transfer with graph neural networks
Masoud Mokhtari, Fatima Taheri Dezaki, Timo Bolkart, Betty Mohler Tesch, Rahul Suresh, Amin Banitalebi

Dialogue

Turn-taking and backchannel prediction with acoustic and large language model fusion
Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

Paralinguistics

Paralinguistics-enhanced large language modeling of spoken dialogue
Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yi Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

Pitch estimation

Noise-robust DSP-assisted neural pitch estimation with very low complexity
Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris Smaragdis, Mike Goodwin

Responsible AI

Leveraging confidence models for identifying challenging data subgroups in speech models
Alkis Koudounas, Eliana Pastor, Vittorio Mazzia, Manuel Giollo, Thomas Gueudre, Elisa Reale, Giuseppe Attanasio, Luca Cagliero, Sandro Cumani, Luca de Alfaro, Elena Baralis, Daniele Amberti

Speaker recognition

Post-training embedding alignment for decoupling enrollment and runtime speaker recognition models
Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha, Andreas Stolcke

Speech enhancement

NoLACE: Improving low-complexity speech codec enhancement through adaptive temporal shaping
Jan Buethe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Mike Goodwin

Real-time stereo speech enhancement with spatial-cue preservation based on dual-path structure
Masahito Togami, Jean-Marc Valin, Karim Helwani, Ritwik Giri, Umut Isik, Mike Goodwin

Scalable and efficient speech enhancement using modified cold diffusion: A residual learning approach
Minje Kim, Trausti Kristjansson

Spoken-language understanding

S2E: Towards an end-to-end entity resolution solution from acoustic signal
Kangrui Ruan, Cynthia He, Jiyang Wang, Xiaozhou Joey Zhou, Helian Feng, Ali Kebarighotbi

Graphic shows the architecture of the method proposed in "S2E: Towards an end-to-end entity resolution solution from acoustic signal", which resolves entity mentions in queries to actionable entities in textual catalogues directly from audio. The graphic goes top top bottom via a flow chart and is split into 3 sections. — The architecture of the method proposed in "S2E: Towards an end-to-end entity resolution solution from acoustic signal", which resolves entity mentions in queries to actionable entities in textual catalogues directly from audio.

Towards ASR robust spoken language understanding through in-context learning with word confusion networks
Kevin Everson, Yi Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke

Text-to-speech

Mapache: Masked parallel transformer for advanced speech editing and synthesis
Guillermo Cambara Ruiz, Patrick Tobing, Mikolaj Babianski, Ravi chander Vipperla, Duo Wang, Ron Shmelkin, Giuseppe Coccia, Orazio Angelini, Arnaud Joly, Mateusz Lajszczak, Vincent Pollet

Wake word recognition

Hot-fixing wake word recognition for end-to-end ASR via neural model reprogramming
Pin-Jui Ku, I-Fan Chen, Huck Yang, Anirudh Raju, Pranav Dheram, Pegah Ghahremani, Brian King, Jing Liu, Roger Ren, Phani Nidadavolu

Left to right flow chart graphic shows the convolutional-neural-net architecture used in the experiments reported in "Maximum-entropy adversarial audio augmentation for keyword spotting". — Convolutional-neural-net architecture used in the experiments reported in "Maximum-entropy adversarial audio augmentation for keyword spotting".

Maximum-entropy adversarial audio augmentation for keyword spotting
Zuzhao Ye, Gregory Ciccarelli, Brian Kulis

On-device constrained self-supervised learning for keyword spotting via quantization aware pre-training and fine-tuning
Gene-Ping Yang, Yue Gu, Sashank Macha, Qingming Tang, Yuzong Liu

About the Author

Staff writer

A quick guide to Amazon's 20+ papers at ICASSP 2024

This year’s papers address topics such as speech enhancement, spoken-language understanding, dialogue, paralinguistics, and pitch estimation.

Addressee detection

Audio event detection

Automatic speech recognition (ASR)

Computer vision

Dialogue

Paralinguistics

Pitch estimation

Responsible AI

Speaker recognition

Speech enhancement

Spoken-language understanding

Text-to-speech

Wake word recognition

Related content

Work with us