Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents REFCHECKER, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In REFCHECKER, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. REFCHECKER supports both proprietary and open-source models as the ex-tractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. REFCHECKER outperforms prior methods by 18.2 to 27.2 points on our benchmark and the checking results of REFCHECKER are strongly aligned with human judgments.
RefChecker: Reference-based fine-grained hallucination checker and benchmark for large language models
2024
Last updated October 29, 2024