From metrics to meaning: Estimating user feedback using LLM-based evaluation
2026
Large language models (LLMs) are increasingly deployed in real-world applications such as chatbots, writing assistants, and text summarization tools. As these applications become more central to user-facing tasks, robust evaluation of their performance becomes critical, not only for ensuring quality but also for guiding continuous improvement. Traditional evaluation approaches rely on intrinsic metrics computed from model outputs, such as groundedness and relevance (Es et al. 2025, Ru et al. 2024). While these metrics offer scalability and generalizability, their relationship to extrinsic business outcomes, such as user satisfaction, remains unclear. User in-product feedback, such as thumbs-up/thumbs-down ratings, provides direct insight into user experience. However, this data is often sparse and potentially biased. In our production chatbot, for instance, fewer than 5% of responses receive any user feedback. Moreover, self-selection bias may occur as users are more likely to provide feedback when they have strong reactions. This creates a fundamental challenge: while intrinsic metrics are abundant but difficult to interpret practically, user feedback is meaningful but scarce. To address this challenge, we propose applying a surrogate index framework (Athey et al. 2019) that leverages rich intrinsic metrics as surrogate variables to estimate the effect of model changes on user feedback outcomes. This framework treats user feedback as a downstream outcome and employs LLM-based evaluation metrics as intermediate surrogates, enabling us to estimate the treatment effect of model updates on user satisfaction, even when direct feedback is limited. We validate our approach using observational data from a production chatbot. Our analysis demonstrates that surrogate index estimates align directionally with direct estimates of user satisfaction while showing more conservative magnitudes, suggesting potential mitigation of self-selection bias in voluntary feedback. Through thematic analysis of user questions, we examine heterogeneous treatment effects across question categories, providing actionable insights for targeted improvements. Additionally, we demonstrate how this framework enables prospective offline evaluation of model updates by comparing new responses against existing ones using only LLM metrics and the trained surrogate model. Our work contributes a practical methodology for bridging intrinsic model metrics with extrinsic business outcomes, offering a scalable solution to the evaluation challenge in production AI systems. This research also enhances the interpretability of LLM-based metrics, elevating them from internal diagnostics to impact estimation tools.
Research areas