technicalfact-check· 2026-04-20· 7 min read

How to Fact-Check with AI Without Hallucinated Citations

Ask ChatGPT to cite its sources. Watch what happens.

You paste a claim into ChatGPT. You ask: "Is this accurate? Can you find sources?"

It answers confidently. It produces URLs. They look real — proper domains, plausible article titles.

Half of them don't exist.

This is citation hallucination. It's not a bug that will be patched in the next model release. It's a fundamental property of how language models work: they're trained to produce plausible-sounding outputs, not verified ones. When you ask an LLM to "find sources," it generates what a source citation looks like. That's different from actually retrieving sources.

This is why most AI fact-checkers are worse than useless. They give you the feeling of verification without the verification.

CheckApp's fact-check skill is built on a different premise. It doesn't ask the LLM whether a claim is true. It retrieves real sources first, then asks the LLM to reason about them. That's retrieve-before-reason, and it's the only approach that actually works.

Why LLMs fail at fact-checking

LLMs have two distinct failure modes when you ask them to verify claims.

Mode 1: Confident and wrong. The model answers from training data. If the claim is about something in its training set, it might get it right. If the claim is about something recent, obscure, or contested, it answers confidently from the closest pattern match in its weights. No caveat, no "I'm not sure." Just a confident-sounding wrong answer.

Mode 2: Hallucinated citations. You push back. "Can you show me a source?" The model produces a citation. It formats it correctly — volume numbers, DOI-style identifiers, plausible author names. The paper doesn't exist. The URL 404s. The journal is real but the article title is fabricated.

Neither failure mode is unique to any particular model. Gemini does this. GPT-4o does this. Claude does it less often, but still does it. The problem isn't the specific model — it's using a model's internal memory for a problem that requires external retrieval.

Fact-checking is not a reasoning problem. It's a retrieval problem wrapped in a reasoning problem. You have to find real sources before you can reason about whether they support the claim. LLMs are bad at step one. They're actually good at step two.

CheckApp separates the steps and assigns each to the right tool.

How the pipeline works

Here's the actual three-stage pipeline, pulled directly from src/skills/factcheck.ts.

Stage 1: Extract claims

const claimsText = await llm.call(extractClaimsPrompt(text), 1024);

The LLM reads the article and extracts up to 4 specific, verifiable factual claims. The prompt is explicit about what counts:

Focus on claims about statistics, dates, scientific facts, or named entities — not opinions.

Opinions, hedges, and general assertions are filtered out. What's left is claims the system can actually check. A sentence like "AI is transforming how companies work" doesn't get extracted. A sentence like "the average B2B buyer engages with 13 pieces of content before purchasing" does.

Cost: ~$0.001 per article for this step.

Stage 2: Retrieve real sources

For each extracted claim, CheckApp queries Exa:

const result = await exa.search(claim, {
  type: "auto",           // neural search, not keyword matching
  numResults: 3,
  contents: {
    highlights: { maxCharacters: 1500, query: claim }
  }
});

type: "auto" uses Exa's neural search, which finds semantically similar content rather than keyword matches. This matters when the claim is phrased differently than the source — which is most of the time. A source might describe the same study with different wording. Keyword search misses it. Neural search finds it.

The results come back as real URLs with extracted highlights — actual text from the page, not summaries generated by a model. This is the ground truth the next stage reasons over.

Cost depends on which provider you pick — they are mutually exclusive:

  • Exa Search (type: "auto", 3 results): ~$0.007 per claim
  • Exa Deep Reasoning (type: "deep-reasoning", 5 results, slower, deeper retrieval): ~$0.025 per claim
  • Parallel Task (research-grade multi-hop reasoning, slowest): ~$0.03 per claim

You configure one provider for the fact-check skill. You don't mix them per run.

Stage 3: Assess whether sources support the claim

const assessPrompt = `Is this claim supported by the evidence below?
Claim: "${claim}"

Evidence:
${evidence}

Reply with JSON:
{ "supported": true|false|null, "note": "one sentence", "claimType": "scientific"|"medical"|"financial"|"general" }

- supported: true if evidence supports the claim, false if contradicts, null if inconclusive.`;

The LLM now has real text from real pages. It's not being asked to recall facts. It's being asked to read and reason. That's what LLMs are actually good at.

The response is structured: supported (true/false/null), a one-sentence note, and a claim type classification. The claim type matters — medical and scientific claims trigger academic citation enrichment via Semantic Scholar in a separate pass.

Cost: ~$0.001 per claim for this reasoning step.

How severities roll up to a verdict. Each claim becomes a finding with severity mapped from supported:

  • supported: false → severity error (claim contradicted)
  • supported: null → severity warn (inconclusive — not enough evidence)
  • supported: true → severity info (verified)

The skill's overall verdict:

  • Any errorfail
  • 2 or more warnwarn
  • Otherwise → pass

So a single contradicted claim fails the skill. Inconclusive claims warn without failing. Verified claims contribute info findings that keep the verdict clean.

A real claim, traced end-to-end

Here's what the pipeline looks like on a real article claim.

The claim: "The average adult loses 50 to 100 strands of hair per day."

This is the kind of claim that sounds authoritative, appears in dozens of wellness articles, and is almost never sourced.

Stage 1 output: Extracted as a verifiable factual claim about human biology. Classified as medical.

Stage 2: Exa returns 3 results. One is from the American Academy of Dermatology. One is from a PubMed overview of telogen effluvium. One is a Mayo Clinic FAQ page. Each result includes extracted highlights — paragraph-level text from the actual page.

Stage 3 assessment:

{
  "supported": true,
  "note": "Multiple clinical sources confirm 50–100 strands/day is the normal range.",
  "claimType": "medical"
}

Final finding:

Verified (high confidence): "The average adult loses 50 to 100 strands of hair per day."
Cite: mayoclinic.org, aad.org
Sources: [AAD article], [Mayo Clinic FAQ], [PubMed overview]

Confidence is high because we got 3+ supporting sources and supported === true. If we'd gotten 1 source and an inconclusive assessment, it would be medium. If the evidence contradicted the claim, it would be low with a fail severity.

The finding includes the actual source URLs with titles and published dates. Not fabricated. Not paraphrased from model memory. Real pages the model read.

Why Exa over Google Search

The obvious question is: why not just use Google?

The Google Search API returns 10 blue links — page titles and snippets. You get the URL and a meta description. To get the actual text, you'd need to fetch and parse each page. For a pipeline that runs 3–4 searches per article, that's 12–16 HTTP requests per check, with scraping risk on each.

Exa returns extracted content. The highlights are the actual text from the page, chunked around your query. You can reason over it directly without a second fetch step.

Exa is also trained for semantic retrieval. Its "auto" mode uses a neural embedding approach to find conceptually similar content, not just keyword overlap. This matters because real sources rarely use the exact phrasing of the claim you're checking. The B2B buyer study that debunks "13 pieces of content" might describe the same research as "decision-making touchpoints" or "pre-purchase research stages." Neural search finds those matches. BM25 keyword search doesn't.

The other practical reason: Exa provides structured results with publishedDate, title, and per-result highlights. The pipeline uses those fields directly — published date appears in the finding, title appears in the source list. Google's API would require additional processing to get the same structure.

Cost transparency

Per 1,000-word article with 4 claims:

StepCost
Claim extraction (LLM)~$0.001
Exa search × 4 claims (standard)~$0.028
Assessment reasoning (LLM) × 4~$0.004
Total (standard mode)~$0.033
Total (Exa Deep Reasoning)~$0.105

At 1,000 articles/month: standard mode is ~$33. Deep reasoning is ~$105.

For most use cases, standard mode is the right choice. Deep reasoning makes sense for high-stakes content — medical, legal, financial — where you want the additional retrieval depth.

Budget-sensitive users pick Exa Search ($0.008/claim) over Exa Deep Reasoning ($0.025/claim). Parallel Task sits between them on cost and depth. Those three are the only providers the fact-check skill actually runs — other search providers (Tavily, Parallel Search) appear in the registry but are rejected at runtime because they don't return the deep-reasoning schema this skill relies on.

What happens when sources aren't there

Three failure modes worth understanding.

No sources found. Exa returns 0 results for the claim. This happens when the claim is too vague to retrieve against, or when it's about something genuinely obscure. The assessment comes back supported: null — inconclusive, not verified. Verdict: warn. The finding reads: "Unverified (low confidence): [claim] — insufficient evidence to evaluate."

CheckApp does not make up a verdict when it can't find sources. It surfaces the gap. That's the point.

Sources contradict the claim. Exa returns results, but the evidence says the opposite. supported: false, severity error. The finding reads: "Unsupported (low confidence): [claim] — [one-sentence note from the model explaining the contradiction]." Score impact: -25 points per contradicted claim.

Sources exist but don't confirm. Exa finds relevant pages but the highlights don't directly address the claim. supported: null, severity warn. This is the "inconclusive" case — there's evidence nearby but not specifically about this claim. Score impact: -10 points per unverified claim.

The verdict math:

  • Any fail (contradicted claim) → overall verdict fail
  • More than 1 warn (unverified claim) → overall verdict warn
  • All verified → pass

You always see what was checked, what sources were found, and why the verdict landed where it did. No black-box confidence scores. No "your content needs improvement" without showing you the improvement.

The deeper problem this solves

The issue with AI-assisted content isn't that writers are sloppy. It's that AI-generated first drafts produce confident-sounding text regardless of whether the underlying claims are defensible. The writer reads the draft, it sounds right, it goes to review, the reviewer reads it, it still sounds right. The fabricated statistic never triggers a flag because it sounds exactly like a real statistic.

CheckApp's fact-check skill runs after the draft exists and before it publishes. It doesn't care how confidently the claim was written. It goes and checks.

That's the whole idea: retrieve first, reason second, surface what's missing so you can fix it.

Install and run

npm install -g checkapp
checkapp --setup
checkapp article.md

--setup walks you through provider configuration. For fact-checking, you'll need an Exa API key (Exa Search for budget, Exa Deep Reasoning for depth) or a Parallel Task key. For claim reasoning, Claude, MiniMax, or any OpenRouter model.

checkapp --estimate-cost article.md

Shows estimated cost before you run. You see exactly what you'll spend on Exa queries and LLM calls before touching your keys.

The dashboard (checkapp --ui, opens at localhost:3000) shows the full fact-check report with clickable source links, per-claim confidence scores, and the full source list for each finding.

For Claude Code: the MCP server returns SkillResult[] with the complete sources[] array — URL, title, published date, and extracted quote — for every claim. Your agent can draft, check, and reference the sources it found, all in one workflow.

GitHub →