Skip to content

MuSiQue — Compositional Multi-Hop QA

MuSiQue (Trivedi et al., TACL 2022) is a multi-hop question-answering benchmark built by composing single-hop questions into longer reasoning chains. Where HotpotQA stops at two hops, MuSiQue goes to 2–4 hops, and each intermediate answer feeds into the next hop — so a single broken hop breaks the whole chain. That makes it a stress test for retrieval that has to assemble several pieces of evidence in the right order, which is what a memory engine is supposed to do.

This is a short card. The full run catalog, per-category scores, and the cross-provider protocol live in the interactive dashboard. Below is what the benchmark tests, how we run it, what we have measured so far, and the caveats on those numbers.

What it tests

MuSiQue requires compositional reasoning over 2–4 supporting paragraphs hidden among ~20 distractors per question. Because the questions are built by chaining single-hop questions, each hop's answer is a prerequisite for the next:

  • 2-hop / 3-hop / 4-hop — questions are categorized by chain length; longer chains are harder because more hops must all succeed.
  • Decomposition chains — every question ships with an explicit reasoning chain, which is what enables a per-hop coverage metric.
  • Answerable and unanswerable — the dataset has a MuSiQue-Ans (answerable) version and a MuSiQue-Full version that adds unanswerable contrast questions.

The benchmark's distinctive metric is chain coverage: the fraction of reasoning hops whose supporting evidence was actually retrieved. It isolates the retrieval failure mode that token-F1 alone hides — you can retrieve most paragraphs and still miss the one hop that completes the chain.

Dataset

  • Paper: arXiv:2108.00573 · Dataset: StonyBrookNLP/musique.
  • Scale: 25,000 total questions spanning 2–4 hops, split into MuSiQue-Ans (answerable) and MuSiQue-Full (answerable + unanswerable). The standard answerable dev set is 2,417 questions (a public-surface figure; the paper abstract confirms the 25k total and the 2–4-hop / Ans-Full split but does not itself state the 2,417 dev count).
  • Context: ~20 paragraphs per question — 2–4 supporting plus distractors.

Our protocol

For an apples-to-apples comparison across providers, the canonical protocol fixes:

  • reader = gpt-5, judge = gpt-5 with a fixed binary judge rubric, so every system is scored by the same model on the same prompt.
  • full MuSiQue-Ans dev set, all hops (2/3/4) — no toy subset.
  • μ = 0 ingest for our engine, with per-question isolation (a fresh engine per question — no cross-question leakage).
  • scoring = binary LLM-as-judge, with token-F1/EM and chain coverage as secondary metrics.

The cross-provider leaderboard under this protocol is still being filled in (it depends on the multi-provider harness). None of our existing MuSiQue runs match this protocol yet — they use a Llama-3.1-8B reader, no judge, and only the 2-hop subset. The run below is the closest measured number we have, not a protocol row.

Our results — measured, not yet in the live matrix

Caveated numbers — read before quoting

MuSiQue has no committed matrix cell in our benchmark system. The numbers below live only in a raw run JSON and carry known provenance gaps tracked in our citation audit: the run is the 2-hop subset only, n=200 — not the full 2,417-question dev set and not the 3-hop/4-hop categories — and its run_id is null, so source_json is the join key. Treat them as measured, not adjudicated. They are not a headline leaderboard claim.

Our reference run (musique_20260318_205914.json):

MetricValue
Answer F10.4571
Answer EM0.365
Retrieval recall0.8375
Chain coverage0.8375
  • Population: 200 questions, all 2-hop (hop distribution {2: 200}); answerable only.
  • Reader: meta/llama-3.1-8b-instruct (provider nvidia), no LLM judge (use_judge=false) — F1/EM are token-overlap scores, not judge verdicts.
  • Retrieval config: strategy=auto, two_pass=false, top_k=10, embeddings Qwen3-Embedding-0.6B (1024-d, local).

The notable shape here is that retrieval recall and chain coverage (both 0.8375) are far higher than answer F1 (0.4571). Across our MuSiQue runs the recall/chain figures stay near 0.8375 regardless of reader, while answer F1 moves with the reader model (a DeepSeek-v3.2 reader scored 0.389 on the same retrieval). On this run the bottleneck is the reader extracting the answer, not the retrieval finding the evidence.

Known asymmetries

  • Subset, not the full dev set. Our 0.4571 is 2-hop-only, n=200. The 3-hop and 4-hop categories — the harder, longer chains MuSiQue exists to test — are not in this run. Reading 0.4571 as a "MuSiQue score" would overstate coverage.
  • Backbone gap. Published systems use frontier backbones — HopRAG reports 54.9 F1 and SiReRAG 53.1 F1 on GPT-4o; RAPTOR 49.1, HippoRAG 43.8 (GPT-3.5-turbo), PRISM 41.8, IRCoT 36.5, BGE dense 30.1. Our run used Llama-3.1-8B. The fixed-judge protocol exists to remove this confound; until it runs, cross-system comparison is not apples-to-apples.
  • Different population than the competitor numbers. The competitor figures above are migrated from the HopRAG comparison table (arXiv:2502.12442), which evaluated on a 1,000-question subset of the MuSiQue-Ans validation set with GPT-4o backbones. Our number is a different 200-question, 2-hop-only subset with a Llama reader. Do not rank 0.4571 against 54.9 as if they share a population — they differ in size, hop distribution, and backbone.
  • No judge on this run. The F1/EM here are token-overlap metrics, not LLM-as-judge verdicts. The canonical protocol scores by judge; these numbers are therefore not directly comparable to a judge-scored row.

By Edward Izgorodin · last updated 2026-06-21.