HotpotQA — Multi-Hop QA Benchmark Card

Multi-hop question answering: combine evidence from two gold paragraphs hidden among eight distractors, and identify the supporting facts.

HotpotQA (Yang et al., EMNLP 2018) is a standard benchmark for multi-hop retrieval and reasoning. This page documents how we run it and what we have measured so far — numbers second, and clearly caveated.

What it tests

A question that cannot be answered from a single document. The system has to chain evidence across two passages and produce a short-phrase answer, then justify it with the right supporting sentences. The dataset splits questions into two reasoning types:

bridge — chain facts across two documents (e.g. "Who directed the film starring X?").
comparison — compare a property across two entities (e.g. "Which city is larger?").

Questions span easy / medium / hard difficulty. (benchmark-content.ts: fullDescription / categories)

Dataset

113k Wikipedia-based QA pairs total. (arXiv:1809.09600 abstract)
We use the distractor dev setting: each question ships with 10 context paragraphs — 2 gold + 8 distractors. The full dev split is up to 7,405 questions. (The 7,405 dev count comes from the structured content, not the abstract — it is flagged in the card's _CITATION_AUDIT.md.)
Dataset home: https://hotpotqa.github.io/

Our protocol

Each question is run in an isolated engine — fresh, empty storage, no cross-question state — so there is no leakage between questions. The protocol is:

Build an isolated atom pool from exactly the 10 context paragraphs (2 gold + 8 distractors).
Retrieve with top_k ≤ 10 (the pool is only 10 atoms, so there is no trivial over-retrieval).
Generate a short-phrase answer from the retrieved context.
Score the answer with token-level F1 (primary) and Exact Match.
Score supporting facts with recall and F1 against the gold paragraphs.

Because the retrieval pool is only 10 paragraphs and top_k = 10, support recall is trivially 1.0 — every paragraph is always retrievable. The meaningful signal here is Answer F1, not support recall.

Note on the canonical harness. Our standard cross-system protocol fixes reader = gpt-5 with an LLM-as-judge, run over the full 7,405-question dev set, to make providers apples-to-apples. The measured run below predates that harness: it uses an NVIDIA-hosted small reader and use_judge=false. So it is not a canonical leaderboard row — it is the closest measured run for our own engine.

Our results — measured (not yet in the live matrix)

Measured, with provenance gaps — read the caveats

These numbers come only from a raw run JSON, not a committed matrix cell. HotpotQA runs predate run_id discipline (no run_id field), so the join key is the source_json file. Treat this as a measured data point, not a headline leaderboard claim.

From hotpotqa_20260321_142326.json (best run, n=500, dev distractor):

Metric	Value	Notes
Answer F1 (overall)	≈0.778	`diagnostics.overall_answer_f1 = 0.7789`
→ bridge	0.758	`per_category.bridge.avg_f1`, n=404
→ comparison	0.869	`per_category.comparison.avg_f1`, n=96
Answer EM	≈63%	`diagnostics.overall_answer_em = 0.628`
Support recall	1.0	trivial — pool=10, `top_k=10` (not a quality signal)

Run config: strategy auto, two_pass=false, top_k 10, embedding Qwen3-Embedding-0.6B (1024d), use_judge=false. (source: experiments/results/hotpotqa_20260321_142326.json)

Caveats

Reader-model provenance gap. The run records llm_provider: nvidia but has no llm_model field — we verified it is absent from run_config. The public disclaimer text says "Qwen3-80B"; the protocol's example command uses meta/llama-3.1-8b-instruct. These conflict and neither is confirmed by the artifact, so the exact reader model is unrecorded. Tracked in _CITATION_AUDIT.md.
Not apples-to-apples with our gpt-5 harness. This run used an NVIDIA small reader and no judge, so it cannot be ranked directly against systems run under the canonical reader = gpt-5 protocol.
n=500, not full dev. Measured on a 500-question subset of the 7,405-question dev set.
Support recall is not informative here. With a 10-paragraph pool and top_k = 10, recall is 1.0 by construction.

Known asymmetries

External systems publish HotpotQA numbers under their own backbones and protocols, so cross-row comparison is confounded:

Trained, task-specific systems sit at the top of the official leaderboard (e.g. Beam Retrieval, ~85.0 Answer F1) but use task-specific graph networks — not RAG-comparable.
RAG systems (HopRAG ~76.1, SiReRAG ~76.5, HippoRAG ~74.3, RAPTOR ~73.1 Answer F1) mostly run GPT-4o readers; HippoRAG uses GPT-3.5-turbo. Those values are migrated from the HopRAG comparison table (arXiv:2502.12442) and are self-reported, not re-run by us.
The backbone confound is the whole point of fixing reader = gpt-5: until every system runs under the same reader, an Answer-F1 gap mixes engine quality with reader choice. The figures above are not on one axis with the GPT-4o RAG numbers.

These self-reported numbers are reference-only (comparable: false) until re-run under our protocol.

Benchmarks

HotpotQA — Multi-Hop QA Benchmark Card

What it tests

Dataset

Our protocol

Our results — measured (not yet in the live matrix)

Caveats

Known asymmetries

Links

HotpotQA — Multi-Hop QA Benchmark Card ​

What it tests ​

Dataset ​

Our protocol ​

Our results — measured (not yet in the live matrix) ​

Caveats ​

Known asymmetries ​

Links ​

Related ​

HotpotQA — Multi-Hop QA Benchmark Card

What it tests

Dataset

Our protocol

Our results — measured (not yet in the live matrix)

Caveats

Known asymmetries

Links

Related