LLMEvalRec: An agentic framework for simulating users to evaluate news recommendation systems
2026
Evaluating news recommendation systems (NRS) presents unique challenges due to their dynamic and interactive nature coupled with evolving user interests. In the early stages of development, when user bases and historical data are scarce, it is difficult to conduct meaningful offline and online evaluations. This cold-start evaluation challenge hinders data-driven decision-making for product development and deployment. To address this, we propose LLMEvalRec, a framework that leverages Large Language Model (LLM) agents to simulate user behavior for NRS evaluation. Our approach features generative agents that automatically generate user profiles from a small number of user reading histories and perform realistic actions, while introducing the Guided Episodic Search (GUES) algorithm, which guides the automated prompt optimization process by exploring human prompt engineering practices. Experiments demonstrate that LLMEvalRec-generated data achieves 0.97 Spearman correlation with real evaluation rankings, significantly outperforming baseline simulators (0.4 and -0.05), and successfully predicts relative performance trends across both MIND benchmark and real customer datasets. Production environment validation shows consistent alignment between simulated metrics and real click-through rate (CTR) improvements.
Research areas