New tool, dataset help detect hallucinations in large language models

Representing facts using knowledge triplets rather than natural language enables finer-grained judgments.

For all their remarkable abilities, large language models (LLMs) have an Achilles heel, which is their tendency to hallucinate, or make assertions that sound plausible but are factually inaccurate. Sometimes, these hallucinations can be quite subtle: an LLM might, for instance, make an assertion that’s mostly accurate but gets a date wrong by just a year or two.

Related content
The Amazon-sponsored FEVEROUS dataset and shared task challenge researchers to create more advanced fact-checking systems.

To help detect such subtle hallucinations, Amazon has released RefChecker (the “Ref” stands for “reference”), a combination of a new framework for hallucination detection and a benchmark dataset for assessing hallucinations in various contexts.

Where previous hallucination detection frameworks used sentences or short phrases to characterize the factual assertions in LLM-generated texts, RefChecker instead uses knowledge triplets with a <subject, predicate, object> structure — the same structure used to represent data in knowledge graphs. This enables a finer-grained evaluation of an LLM’s output, which should be more precise and more informative.

The benchmark dataset covers three distinct settings: zero context, in which LLMs generate texts to answer a question without any reference texts; noisy context, in which the LLMs are provided with a list of retrieved documents that may or may not contain accurate information (the retrieval-augmented generation, or RAG, setting); and accurate context, in which LLMs are provided with one accurate document. The dataset includes 100 examples for each setting.

Animation of the RefChecker framework, with the separate stages (1) extract triplets representing factual claims; (2) gather references; (3) check facts; and (4) localize claims to sentences.
A demo of the RefChecker framework.

Hallucination detection

The goal of hallucination detection is to check the factuality of LLM-generated responses against a set of references. The problem setting raises three chief questions: (1) How and where do we find the references? (2) At what level of detail will we check the responses? And (3) how do we categorize the claims in the responses?

1. Finding references

RefChecker can accommodate three different ways of answering the question about finding references, corresponding to the three types of data in the benchmark dataset: zero context (e.g. open question answering); (2) noisy context (e.g., retrieval-augmented generation); and (3) accurate context (e.g., summarization).

Figure illustrating the differences between the zero-context, noisy-context, and accurate-context settings.
Comparison of the three task settings.

The examples in the benchmark dataset are randomly sampled from the following data sources:

Setting

Data source

Task

References

Zero contextNaturalQuestions (development set)Closed-book question answering (QA)Annotated long answer
Noisy contextMS MARCO (development set)Retrieval-augmented generation (RAG)Retrieved passages
Accurate contextdatabricks-dolly-15kSummarization, closed QA, information extractionInput context

2. Evaluation granularity

Unlike existing methods that analyze paragraphs or sentences, RefChecker decomposes LLM responses into knowledge triplets. This allows us to test the factualness of individual knowledge points but also provides more informative and precise insights.

Informally, the claim is the unit to be checked. Previous works used sentences or short phrases excerpted from the LLM-generated text as the claims. RefChecker instead explores representing claims with knowledge triplets. This approach is inspired by knowledge graphs, which employ triplets with a <subject, predicate, object> structure to encapsulate factual knowledge. Knowledge triplets capture finer-grained information about the content of LLM-generated texts than sentences or sub-sentences do. The following is an example of a sentence and the corresponding fine-grained triplets.

“Richard Mulligan played Mr. Kincaid on The Partridge Family.”

Subject

Predicate

Object

Richard Mulliganplayed role ofMr. Kincaid
Mr. Kincaidcharacter onThe Partridge Family

3. Claim categorization

Rather than declaring the entire response hallucinatory or not, RefChecker inspects the claims embedded in an LLM-generated text. The basic relationship between an LLM’s response to a prompt and the corresponding references can be visualized as a Venn diagram.

Venn diagram showing the intersection of LLM responses and references. The intersection includes both contradictions (red x's) and corroborations (green checkmarks).
Possible relationships between an LLM’s response to a prompt and the corresponding references.

The intersection between the response and the references denotes claims that can be directly verified, which are categorized as either entailments (green check marks) or contradictions (red crosses), depending on whether they are supported or refuted by the references.

In practical applications, the references may not always provide sufficient evidence to verify all claims. In such cases, assessing the claims’ truthfulness requires additional evidence (orange question marks); we refer to such claims as neutral.

These three categories align closely with the categories support, refute, and not enough information within the fact-checking literature, and they are commonly used in natural-language inference (NLI). RefChecker uses this three-way classification, rather than conventional binary labels, to precisely model the relationship between responses and references.

RefChecker pipeline

RefChecker consists of two configurable modules: a claim triplet extractor, E, and a hallucination checker, C. You can also configure how the results are tallied, to translate between detection at the triplet level and hallucination reports at the response level. The modules can be extended and improved individually.

Figure representing the RefChecker framework, with LLM-generate text passing to the extractor, which sorts its factual claims into triplets, and a reference text passing to a checker, which does the same.
The RefChecker pipeline.

We found that LLMs are generally good at extracting claim triplets from input texts. In the initial RefChecker release, we use both GPT-4 and Claude 2. We will provide a Mixtral-8x7B open-source extractor in our next release.

The degree of agreement between the claim triplets from the response and reference texts can be assessed either manually or automatically. We will soon be releasing an annotation tool that can be used for manual assessment. In the initial RefChecker release, we also offer automatic checkers based on GPT-4, Claude 2, and RoBERTa-NLI. More open-source checkers such as AlignScore and our own Mistral-based checker will be available soon. We have found that majority voting among the automatic checkers provides the best agreement with human annotation.

An illustration of the RefChecker evaluation procedure, with the reference text and the extracted triplets passing to a torso silhouette representing a human checker.
The evaluation process in the zero-context setting.

Get started with RefChecker

RefChecker is now accessible on our GitHub repo. The package can also be installed using pip. To get started, refer to the QuickStart section in our README. There, you'll find detailed instructions on how to use RefChecker for extracting knowledge triplets, detecting hallucinations at the triplet level, and evaluating your own LLM.

We believe that detecting and pinpointing subtle, fine-grained hallucinations is the first step toward effective mitigation strategies. For feedback, feel free to reach out via GitHub issues. We welcome and look forward to your contributions and improvements through pull requests.

Acknowledgements: Lin Qiu, Zheng Zhang

Research areas

Related content

US, MA, Westborough
Amazon is looking for talented Postdoctoral Scientists to join our Fulfillment Technology and Robotics team for a one-year, full-time research position. The Innovation Lab in BOS27 is a physical space in which new ideas can be explored, hands-on. The Lab provides easier access to tools and equipment our inventors need while also incubating critical technologies necessary for future robotic products. The Lab is intended to not only develop new technologies that can be used in future Fulfillment, Technology, and Robotics products but additionally promote deeper technical collaboration with universities from around the world. The Lab’s research efforts are focused on highly autonomous systems inclusive of robotic manipulation of packages and ASINs, multi-Read more