Conversational AI

New tool, dataset help detect hallucinations in large language models

Representing facts using knowledge triplets rather than natural language enables finer-grained judgments.

By Xiangkun Hu, Dongyu Ru

January 17, 2024

For all their remarkable abilities, large language models (LLMs) have an Achilles heel, which is their tendency to hallucinate, or make assertions that sound plausible but are factually inaccurate. Sometimes, these hallucinations can be quite subtle: an LLM might, for instance, make an assertion that’s mostly accurate but gets a date wrong by just a year or two.

Hallucination detection

The goal of hallucination detection is to check the factuality of LLM-generated responses against a set of references. The problem setting raises three chief questions: (1) How and where do we find the references? (2) At what level of detail will we check the responses? And (3) how do we categorize the claims in the responses?

1. Finding references

RefChecker can accommodate three different ways of answering the question about finding references, corresponding to the three types of data in the benchmark dataset: zero context (e.g. open question answering); (2) noisy context (e.g., retrieval-augmented generation); and (3) accurate context (e.g., summarization).

Figure illustrating the differences between the zero-context, noisy-context, and accurate-context settings. — Comparison of the three task settings.

The examples in the benchmark dataset are randomly sampled from the following data sources:

Setting	Data source	Task	References
Zero context	NaturalQuestions (development set)	Closed-book question answering (QA)	Annotated long answer
Noisy context	MS MARCO (development set)	Retrieval-augmented generation (RAG)	Retrieved passages
Accurate context	databricks-dolly-15k	Summarization, closed QA, information extraction	Input context

2. Evaluation granularity

Unlike existing methods that analyze paragraphs or sentences, RefChecker decomposes LLM responses into knowledge triplets. This allows us to test the factualness of individual knowledge points but also provides more informative and precise insights.

Informally, the claim is the unit to be checked. Previous works used sentences or short phrases excerpted from the LLM-generated text as the claims. RefChecker instead explores representing claims with knowledge triplets. This approach is inspired by knowledge graphs, which employ triplets with a <subject, predicate, object> structure to encapsulate factual knowledge. Knowledge triplets capture finer-grained information about the content of LLM-generated texts than sentences or sub-sentences do. The following is an example of a sentence and the corresponding fine-grained triplets.

“Richard Mulligan played Mr. Kincaid on The Partridge Family.”

Subject	Predicate	Object
Richard Mulligan	played role of	Mr. Kincaid
Mr. Kincaid	character on	The Partridge Family

3. Claim categorization

Rather than declaring the entire response hallucinatory or not, RefChecker inspects the claims embedded in an LLM-generated text. The basic relationship between an LLM’s response to a prompt and the corresponding references can be visualized as a Venn diagram.

Venn diagram showing the intersection of LLM responses and references. The intersection includes both contradictions (red x's) and corroborations (green checkmarks). — Possible relationships between an LLM’s response to a prompt and the corresponding references.

The intersection between the response and the references denotes claims that can be directly verified, which are categorized as either entailments (green check marks) or contradictions (red crosses), depending on whether they are supported or refuted by the references.

In practical applications, the references may not always provide sufficient evidence to verify all claims. In such cases, assessing the claims’ truthfulness requires additional evidence (orange question marks); we refer to such claims as neutral.

These three categories align closely with the categories support, refute, and not enough information within the fact-checking literature, and they are commonly used in natural-language inference (NLI). RefChecker uses this three-way classification, rather than conventional binary labels, to precisely model the relationship between responses and references.

RefChecker pipeline

RefChecker consists of two configurable modules: a claim triplet extractor, E, and a hallucination checker, C. You can also configure how the results are tallied, to translate between detection at the triplet level and hallucination reports at the response level. The modules can be extended and improved individually.

Figure representing the RefChecker framework, with LLM-generate text passing to the extractor, which sorts its factual claims into triplets, and a reference text passing to a checker, which does the same. — The RefChecker pipeline.

We found that LLMs are generally good at extracting claim triplets from input texts. In the initial RefChecker release, we use both GPT-4 and Claude 2. We will provide a Mixtral-8x7B open-source extractor in our next release.

The degree of agreement between the claim triplets from the response and reference texts can be assessed either manually or automatically. We will soon be releasing an annotation tool that can be used for manual assessment. In the initial RefChecker release, we also offer automatic checkers based on GPT-4, Claude 2, and RoBERTa-NLI. More open-source checkers such as AlignScore and our own Mistral-based checker will be available soon. We have found that majority voting among the automatic checkers provides the best agreement with human annotation.

An illustration of the RefChecker evaluation procedure, with the reference text and the extracted triplets passing to a torso silhouette representing a human checker. — The evaluation process in the zero-context setting.

Get started with RefChecker

RefChecker is now accessible on our GitHub repo. The package can also be installed using pip. To get started, refer to the QuickStart section in our README. There, you'll find detailed instructions on how to use RefChecker for extracting knowledge triplets, detecting hallucinations at the triplet level, and evaluating your own LLM.

We believe that detecting and pinpointing subtle, fine-grained hallucinations is the first step toward effective mitigation strategies. For feedback, feel free to reach out via GitHub issues. We welcome and look forward to your contributions and improvements through pull requests.

Acknowledgements: Lin Qiu, Zheng Zhang

About the Author

Xiangkun Hu

Xiangkun Hu is an applied scientist with Amazon Web Services.

Dongyu Ru

Dongyu Ru is an applied scientist with Amazon Web Services.

New tool, dataset help detect hallucinations in large language models

Representing facts using knowledge triplets rather than natural language enables finer-grained judgments.

Hallucination detection

1. Finding references

2. Evaluation granularity

3. Claim categorization

RefChecker pipeline

Get started with RefChecker

Related content

Work with us