Reinforcement learning

LLMEvalRec: An agentic framework for simulating users to evaluate news recommendation systems

Yao Ma, Samuel Louvan, Abhishek Tripathi, Wei Liu, Murat Sensoy

AAMAS 2026

2026

Evaluating news recommendation systems (NRS) presents unique challenges due to their dynamic and interactive nature coupled with evolving user interests. In the early stages of development, when user bases and historical data are scarce, it is difficult to conduct meaningful offline and online evaluations. This cold-start evaluation challenge hinders data-driven decision-making for product development and

Conversational AI

CodeV: Code with images for faithful visual reasoning via tool-aware policy optimization

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia (Kevin) Liu, Todd C. Hollon, Bryan Wang

CVPR 2026

2026

Agentic vision–language models are increasingly trained to 'think with images' by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate

Computer vision

Align to structure: Aligning large language models with structural information

Zae Kim, Anand Ramachandran, Farideh Tavazoee, JK Kim, Oleg Rokhlenko, Dongyeop Kang

AAAI 2026

2026

Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our

Conversational AI

Self-aligned reward: Towards effective and efficient reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris (Luyang) Kong

ICLR 2026

2026

Reinforcement learning with verifiable rewards has significantly advanced reasoning with large language models (LLMs) in domains such as mathematics and logic. However, verifiable signals provide only coarse-grained or binary correctness feedback. This limitation results in inefficiencies like overly verbose or repetitive reasoning. Existing length-based solutions (e.g., length penalty) compromise accuracy

Conversational AI

R-WOM: Retrieval-augmented world model for computer-use agents

Kai Mei, Jiang Guo, Shuaichen Chang, Marvin Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang

ICLR 2026

2026

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that

Search and information retrieval

Turn-PPO: Turn-level advantage estimation with PPO for improved multi-turn RL in agentic LLMs

Junbo Li, Peng Zhou, Rui Meng, Meet Vadera, Lihong Li, Laurence (Yang) Li

EACL 2026

2026

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage

Machine learning

Self-refining vision language model for robotic failure detection and reasoning

Carl Qi, Xiaojie Wang, Silong Yong, Stephen Sheng, Huitan Mao, Sriram Srinivasan, Mani Nambi, Amy Zhang, Yesh Dattatreya

ICLR 2026

2026

Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing

Automated reasoning

Confidence-calibrated small-large language model collaboration for cost-efficient reasoning

Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu, Henan Wang, Xavier Wang, Yaxiao Liu

AAAI 2026

2026

Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized

Machine learning

MAPRO: Recasting multi-agent prompt optimization as maximum a posteriori inference

Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, Yanfang Ye

EACL 2026

2026

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates

Conversational AI

SALT: Step-level advantage assignment for long-horizon agents via trajectory graph

Jiazheng Li, Yawei Wang, David Yan, Yijun Tian, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong

EACL 2026

2026

Large language models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards, a limitation

Conversational AI

Reinforcement learning

Work with us