Conversational AI

Building software and systems that help people communicate with computers naturally, as if communicating with family and friends.

OrchDAG: Complex tool orchestration in multi-turn interactions with plan DAGs

Yifu Lu, Shengjie Liu, Li Dong

NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models

2025

Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR

Conversational AI
SQLENS: An end-to-end framework for error detection and correction in text-to-SQL

Yue Gong, Chuan Lei, Xiao Qin, Kapil Eknath Vaidya, Balakrishnan (Murali) Narayanaswamy, Tim Kraska

NeurIPS 2025

2025

Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLENS, an end-to-end framework for fine-grained

Conversational AI
SABER: Small actions, big errors — Safe-guarding mutating steps in LLM agents

Alex Cuadron Lafuente, Pengfei Yu, Yang Liu, Arpit Gupta

arXiv

2025

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: do all actions contribute equally to failure? Analyzing execution traces on τ-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into mutating (environment-changing) vs. non-mutating steps and formalize de-cisive deviations—earliest

Conversational AI
Where did it all go wrong? A hierarchical look into multi-agent error attribution

Adi Banerjee, Anirudh Nair, Tarik Borogovac

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle

2025

Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in multi-agent interaction traces—whether using all-at-once evaluation, step-by-step analysis, or binary search—fall short when analyzing complex patterns, struggling with both accuracy and

Conversational AI
Structuring the unstructured: A multi-agent LLM framework for transforming ambiguous SOPs into code

Sachin Kumar Giroh, Pushpendu Ghosh, Aryan Jain, Harshal Paunikar, Anish Nediyanchath, Aditi Rastogi, Promod Yenigalla

EMNLP 2025

2025

This paper introduces, a three-stage multi agent LLM framework designed to transform unstructured and ambiguous Standard Operating Procedure (SOP) into a structured plan and an executable code template. Unstructured SOPs—common across industries such as finance, retail, and logistics—frequently suffer from ambiguity, missing information, and inconsistency, all of which hinder automation. We address this

Conversational AI

Detoxification of large language models via regularized fine-tuning

Charith Peris

November 21, 2024

Attribute-controlled fine-tuning can produce LLMs that adhere to policy while achieving competitive performance on general benchmarks.

Conversational AI
A quick guide to Amazon’s 50-plus papers at EMNLP 2024

Staff writer

November 14, 2024

Large language models predominate, both as a research subject themselves and as tools for researching topics of particular interest to Amazon, such as speech, recommendations, and information retrieval.

Conversational AI
Enhancing repository-level code completion with selective retrieval

Di Wu

October 17, 2024

Self-supervised method for learning when to retrieve contextual information from a code repository speeds up code completion times by 70% while increasing accuracy.

Conversational AI
The life of a prescription at Amazon Pharmacy

Alexandre Alves, Anita Vila

September 30, 2024

From pricing estimation and regulatory compliance to inventory management and chatbot assistants, machine learning models help Amazon Pharmacy customers stay healthy and save time and money.

Conversational AI
How task decomposition and smaller LLMs can make AI more affordable

Burak Gozluklu

September 19, 2024

“Agentic workflows” that use multiple, fine-tuned smaller LLMs — rather than one large one — can improve efficiency.

Machine learning
Accounting for cognitive bias in human evaluation of large language models

Aparna Elangovan

September 16, 2024

A position paper presented at ACL proposes a framework for more-accurate human evaluation of LLMs.

Conversational AI

Conversational AI

Publications

Related content

Work with us