Transforming expert insight into scalable ai assessment: A framework for LLM-generated metrics and user-calibrated evaluation
2025
Effectively assessing AI systems, particularly those operating in specialized domains or producing dynamic outputs, requires translating nuanced human expertise into scalable, quantitative measures. Traditional metrics often fall short in capturing qualitative requirements that domain experts intuitively grasp. This paper presents a novel framework that systematically transforms qualitative expert feedback into quantitative metrics for assessing the output quality of AI systems. Our methodology leverages Large Language Models (LLMs), first to help articulate and formalize these metrics from expert input, and subsequently as 'judges' to apply them in an automated fashion. As validation, we present initial results from calibration against expert ratings, demonstrating that automated assessments align with human judgment and can evolve with changing requirements. Learning content creation serves as our illustrative specialized domain. Its reliance on learning design frameworks, coupled with the need for nuanced expert evaluation of pedagogical quality, makes it an ideal test case for our framework. Results confirm that our LLM-generated, expert-calibrated metrics achieve promising alignment with expert evaluations, enabling robust, scalable, and adaptable assessment.
Research areas