Fact-Score (FActScore)

What is FActScore?

FActScore is an automated evaluation metric quantifying factual accuracy in AI-generated long-form text. It decomposes text into minimal “atomic facts,” comparing each against authoritative sources like Wikipedia, calculating what percentage of facts are supported.

In a nutshell: “An automated grading machine checking each article sentence against encyclopedias.”

Key points:

What it does: Verifies whether AI-generated text contains truths
Why it matters: AI confidently writes falsehoods (hallucination), requiring accuracy monitoring
Who uses it: LLM developers, AI quality assurance teams, medical/legal fields requiring high accuracy

Calculation Method

FActScore uses this formula:

FActScore = (Supported Facts / Total Facts) × 100%

Example: If 45 of 50 extracted facts verify against reliable sources: FActScore = (45 / 50) × 100% = 90%

Implementation flow:

Decompose text into atomic facts (via LLM or rule processing)
Search Wikipedia for relevant passages per fact
Human experts or AI judge “is this fact supported?”
Calculate score

Benchmarks

Model FActScores:

GPT-4 — ~68% (high-performance)
ChatGPT — ~58% (general purpose)
Alpaca 65B — ~65% (open model)
Human writing — ~88% (gold standard)

Score interpretation:

80%+ — Trustworthy. Suitable for medical information and high-accuracy applications
70–80% — Practical, though important information needs human verification
Below 60% — Poor. Using directly is risky

Why it matters

Hallucination (AI generating false information) is a critical weakness. Traditional metrics (BLEU, ROUGE) don’t catch subtle factual errors—FActScore does.

Example: “Einstein was born in Germany in 1879” is fully true. But “Einstein won Nobel Prizes in both physics and chemistry” is partially wrong (physics only). FActScore catches such errors.

Medical information, legal documents, journalism, and scientific communication critically need rigorous evaluation. User decisions based on false information cause serious harm.

Real-World Use Cases

Healthcare AI chatbot quality management Medical companies evaluate AI using FActScore. Below-80% scores undergo physician review, preventing patient misinformation.

Academic publisher AI draft verification Academic journals verify AI-generated summaries with FActScore. Sub-50% scores skip adoption, using manual abstracts maintaining credibility.

Multi-language LLM evaluation Research teams compare factual accuracy across Japanese, Chinese, Arabic and other languages objectively.

Benefits and Considerations

Benefits: Automated FActScore rapidly evaluates massive AI-generated text. More efficient than manual full checking, providing objective standards. Teams identify accuracy weak points, improving models.

Considerations: FActScore accuracy depends heavily on reference source quality. Niche domains and recent information lack Wikipedia coverage, risking unfairly low scores. Multiple phrasings of same facts may trigger over-strict scoring.

Hallucination — AI-generated false information FActScore detects
Large Language Model (LLM) — FActScore evaluation targets
Fact-Checking — Human process FActScore automates
AI Quality Assurance — FActScore is critical evaluation tool
Natural Language Processing (NLP) — FActScore’s foundational technology

Frequently Asked Questions

Q: Is 90%+ FActScore fully trustworthy? A: Nearly, but not completely. Reference sources may contain errors; complex nuances may elude detection. Important information needs human final verification.

Q: Can FActScore evaluate Japanese text? A: Yes, but Japanese Wikipedia coverage is lower than English, risking unfairly low scores.

Q: How do we improve FActScore? A: Training data curation, fine-tuning, and stronger fact-verification prompts help improve scores.

Fact-Score (FActScore)

What is FActScore?

Calculation Method

Benchmarks

Why it matters

Real-World Use Cases

Benefits and Considerations

Frequently Asked Questions

Related Terms

Jagged Intelligence

Hallucination Detection

MHaluBench

What is FActScore?

Calculation Method

Benchmarks

Why it matters

Real-World Use Cases

Benefits and Considerations

Related Terms

Frequently Asked Questions

Related Terms

Jagged Intelligence

Hallucination Detection

MHaluBench

Cookie Settings

Necessary Cookies

Analytics Cookies