HotpotQA — Multi-Hop QA Benchmark Card
Multi-hop question answering: combine evidence from two gold paragraphs hidden among eight distractors, and identify the supporting facts.
HotpotQA (Yang et al., EMNLP 2018) is a standard benchmark for multi-hop retrieval and reasoning. This page documents how we run it and what we have measured so far — numbers second, and clearly caveated.
What it tests
A question that cannot be answered from a single document. The system has to chain evidence across two passages and produce a short-phrase answer, then justify it with the right supporting sentences. The dataset splits questions into two reasoning types:
- bridge — chain facts across two documents (e.g. "Who directed the film starring X?").
- comparison — compare a property across two entities (e.g. "Which city is larger?").
Questions span easy / medium / hard difficulty. (benchmark-content.ts: fullDescription / categories)
Dataset
- 113k Wikipedia-based QA pairs total. (arXiv:1809.09600 abstract)
- We use the distractor dev setting: each question ships with 10 context paragraphs — 2 gold + 8 distractors. The full dev split is up to 7,405 questions. (The 7,405 dev count comes from the structured content, not the abstract — it is flagged in the card's
_CITATION_AUDIT.md.) - Dataset home: https://hotpotqa.github.io/
Our protocol
Each question is run in an isolated engine — fresh, empty storage, no cross-question state — so there is no leakage between questions. The protocol is:
- Build an isolated atom pool from exactly the 10 context paragraphs (2 gold + 8 distractors).
- Retrieve with
top_k ≤ 10(the pool is only 10 atoms, so there is no trivial over-retrieval). - Generate a short-phrase answer from the retrieved context.
- Score the answer with token-level F1 (primary) and Exact Match.
- Score supporting facts with recall and F1 against the gold paragraphs.
Because the retrieval pool is only 10 paragraphs and top_k = 10, support recall is trivially 1.0 — every paragraph is always retrievable. The meaningful signal here is Answer F1, not support recall.
Note on the canonical harness. Our standard cross-system protocol fixes
reader = gpt-5with an LLM-as-judge, run over the full 7,405-question dev set, to make providers apples-to-apples. The measured run below predates that harness: it uses an NVIDIA-hosted small reader anduse_judge=false. So it is not a canonical leaderboard row — it is the closest measured run for our own engine.
Our results — measured (not yet in the live matrix)
Measured, with provenance gaps — read the caveats
These numbers come only from a raw run JSON, not a committed matrix cell. HotpotQA runs predate run_id discipline (no run_id field), so the join key is the source_json file. Treat this as a measured data point, not a headline leaderboard claim.
From hotpotqa_20260321_142326.json (best run, n=500, dev distractor):
| Metric | Value | Notes |
|---|---|---|
| Answer F1 (overall) | ≈0.778 | diagnostics.overall_answer_f1 = 0.7789 |
| → bridge | 0.758 | per_category.bridge.avg_f1, n=404 |
| → comparison | 0.869 | per_category.comparison.avg_f1, n=96 |
| Answer EM | ≈63% | diagnostics.overall_answer_em = 0.628 |
| Support recall | 1.0 | trivial — pool=10, top_k=10 (not a quality signal) |
Run config: strategy auto, two_pass=false, top_k 10, embedding Qwen3-Embedding-0.6B (1024d), use_judge=false. (source: experiments/results/hotpotqa_20260321_142326.json)
Caveats
- Reader-model provenance gap. The run records
llm_provider: nvidiabut has nollm_modelfield — we verified it is absent fromrun_config. The public disclaimer text says "Qwen3-80B"; the protocol's example command usesmeta/llama-3.1-8b-instruct. These conflict and neither is confirmed by the artifact, so the exact reader model is unrecorded. Tracked in_CITATION_AUDIT.md. - Not apples-to-apples with our gpt-5 harness. This run used an NVIDIA small reader and no judge, so it cannot be ranked directly against systems run under the canonical
reader = gpt-5protocol. - n=500, not full dev. Measured on a 500-question subset of the 7,405-question dev set.
- Support recall is not informative here. With a 10-paragraph pool and
top_k = 10, recall is 1.0 by construction.
Known asymmetries
External systems publish HotpotQA numbers under their own backbones and protocols, so cross-row comparison is confounded:
- Trained, task-specific systems sit at the top of the official leaderboard (e.g. Beam Retrieval, ~85.0 Answer F1) but use task-specific graph networks — not RAG-comparable.
- RAG systems (HopRAG ~76.1, SiReRAG ~76.5, HippoRAG ~74.3, RAPTOR ~73.1 Answer F1) mostly run GPT-4o readers; HippoRAG uses GPT-3.5-turbo. Those values are migrated from the HopRAG comparison table (arXiv:2502.12442) and are self-reported, not re-run by us.
- The backbone confound is the whole point of fixing
reader = gpt-5: until every system runs under the same reader, an Answer-F1 gap mixes engine quality with reader choice. The figures above are not on one axis with the GPT-4o RAG numbers.
These self-reported numbers are reference-only (comparable: false) until re-run under our protocol.
Links
- Interactive dashboard: https://benchmarks.mnemoverse.com
- HotpotQA paper: arXiv:1809.09600 · Dataset: https://hotpotqa.github.io/
- HopRAG (RAG comparison table): arXiv:2502.12442
Related
- Benchmarks overview
- THG — Topological Holographic Graph
- SLoD — Semantic Level of Detail
- AI memory landscape 2026
Edward Izgorodin — last updated 2026-06-21. Numbers traced to hotpotqa_20260321_142326.json; under-verified claims tracked in the card's _CITATION_AUDIT.md.