RefChecker provides a standardized assessment framework to identify subtle hallucinations present in the outputs of large language models (LLMs).
Highlighted Features
- Finer granularity — RefChecker breaks down the claims in the LLM’s response into knowledge triplets as opposed to paragraphs, sentences, or sub-sentences. Detecting at the knowledge-triplet level will test the truthfulness of facts. Importantly, this finer granularity subsumes other coarse granularity and is therefore more informative and precise. One can arbitrarily roll up the granularity ladder to derive coarse-level metrics if needed.
- Wider coverage — RefChecker differentiates three distinctive settings based on the quality and quantity of context provided for LLM’s response:
- Zero context: the prompt is a factual question without any context (e.g., Open QA).
- Noisy context: the prompt is a question as well as a list of retrieved document (e.g., RAG).
- Accurate context: the prompt is a question as well as one document (e.g., summarization).
- Human evaluation — RefChecker includes 2.1k human-annotated LLM responses consisting of 300 test samples, each with responses from seven popular LLMs: GPT-4, GPT-3.5-Turbo, InstructGPT, Falcon (Falcon-40B-Instruct), Alpaca (Alpaca-7B), LLaMA2(70B-Chat), and Claude 2. We will release the data and results upon approval.
- Modular architecture — RefChecker is a three-stage pipeline, consisting of a claim extractor, E, a hallucination checker, C , and aggregation rules, 𝜏. They can be invoked and configured individually from the command line. Other than the three core stages, there are three auxiliary components:
- human-labeling tool (coming soon) to label claims;
- call to search engine for zero-context setting; and
- a localization model to map each knowledge triple back to the corresponding snippets of the reference.