Conversational AI

Building software and systems that help people communicate with computers naturally, as if communicating with family and friends.

Generative audio language modeling with continuous-valued tokens and masked next-token prediction

Shu-wen Yang, Byeonggeun Kim, Kuan Po Huang, Huy Phan, Bo-Ru (Roy) Lu, Harsha Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

ICML 2025

2025

Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise

Conversational AI
AutoChunker: Structured text chunking and its evaluation

Arihant Jain, Purav Aggarwal, Anoop S V K K Saladi

ACL 2025

2025

Text chunking is fundamental to modern retrieval-augmented systems, yet existing methods often struggle with maintaining semantic coherence, both within and across chunks, while dealing with document structure and noise. We present AutoChunker, a bottom-up approach for text chunking that combines document structure awareness with noise elimination. AutoChunker leverages language models to identify and segregate

Conversational AI
VADE: Visual attention guided hallucination detection and elimination

Vishnu Prabhakaran, Purav Aggarwal, Vinay Kumar Verma, Gokul Swamy, Anoop S V K K Saladi

ACL 2025

2025

Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the

Conversational AI
SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning

Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant Strimel, Andreas Schwarz

ACL 2025

2025

We introduce SIFT (Speech Instruction FineTuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech

Conversational AI
Towards safety reasoning in LLMs: AI-agentic deliberation for policy-embedded CoT data creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

ACL 2025

2025

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains

Related: Multiagent AI for generating chain-of-thought training data

Conversational AI

A quick guide to Amazon's 20+ papers at ICASSP 2024

Staff writer

April 11, 2024

This year’s papers address topics such as speech enhancement, spoken-language understanding, dialogue, paralinguistics, and pitch estimation.

Conversational AI
Updating large language models by directly editing network layers

Tamer Soliman

March 25, 2024

Automated method that uses gradients to identify salient layers prevents regression on previously seen data.

Machine learning
New pretraining tasks enable better document understanding

Srikar Appalaraju

March 7, 2024

DocFormerV2 makes sense of documents using local features, outperforming much bigger models.

Computer vision
Knowledge distillation method for better vision-language models

Tianyang Zhao, Yash Singh

February 22, 2024

Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Computer vision
Do large language models understand the world?

Matthew Trager, Stefano Soatto

February 15, 2024

In addition to its practical implications, recent work on “meaning representations” could shed light on some old philosophical questions.

Conversational AI
Amazon and IIT Bombay announce inaugural award recipients

Staff writer

January 25, 2024

Amazon IIT–Bombay AI-ML Initiative seeks to advance artificial intelligence and machine learning research within speech, language, and multimodal-AI domains.

Machine learning

Conversational AI

Publications

Related content

Work with us