BEAM — Long-Term Memory at 10M-Token Scale
BEAM (Tavakoli et al., 2025 — Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs) pushes conversational memory past the million-token mark: a single conversation can run up to 10M tokens, far longer than the multi-session histories most memory benchmarks use. It is the scale-frontier test in our suite — it stresses tokens per single conversation, not the number of stored items.
This is a short card. The full run catalog, per-type breakdown, and the cross-provider protocol live in the interactive dashboard. Below is what the benchmark tests, how we run it, what we have measured, and — honestly — where the headline number does not mean what it might look like.
What it tests
BEAM is a long-term agentic-memory benchmark. The full dataset has 100 conversations and 2,000 validated questions across four scale buckets (100K / 500K / 1M / 10M tokens). We report the 10M bucket: 10 conversations × 20 probing questions = 200 questions, spread across 10 question types that stress different memory skills:
- abstention — refusing to answer when the memory holds no answer.
- contradiction_resolution — reconciling statements that conflict across sessions.
- event_ordering — placing events in sequence.
- information_extraction — pulling a specific fact out of a long history.
- instruction_following — honoring an instruction or constraint stated earlier.
- knowledge_update — using the latest value, not a stale earlier one.
- multi_session_reasoning — combining evidence from several sessions.
- preference_following — applying a stated user preference consistently.
- summarization — summarizing a discussion or topic thread.
- temporal_reasoning — reasoning about dates, durations, and time spans.
Scoring is rubric-based nugget LLM-as-judge: each atomic claim in the answer scores 0.0 / 0.5 / 1.0, and the mean is the question score — not a substring match.
Dataset
- Paper: arXiv:2510.27246 (Tavakoli, Salemi, Ye, Abdalla, Zamani, Mitchell). BEAM has no public external leaderboard yet.
- Scale buckets: 100K (20 conv), 500K (35), 1M (35), 10M (10), each with 20 questions per conversation. We run the 10M bucket (200 questions) — the hardest scale tier.
- The conversation we ingest: one 10M-token cohort (
beam-10m-conv0-9). After chunking and consolidation, the engine stored 97,729 chunks (104,349 total, 6,620 filtered, 989 consolidations).
Our protocol
For an apples-to-apples comparison across providers, the canonical protocol fixes:
- reader = gpt-5, judge = gpt-5 with the rubric nugget-judge prompt, so every system is scored by the same model on the same prompt.
- full 10M bucket (200 questions) — no toy subset.
- μ = 0 ingest — zero LLM calls during ingestion (
llm_calls_during_ingest_total == 0, asserted per conversation); the engine builds memory without an LLM in the loop. - scoring = rubric nugget LLM-as-judge.
The cross-provider leaderboard under this protocol is still being filled in (it depends on the multi-provider harness). Our committed 10M run below uses a claude-sonnet-4-6 reader (the judge is already gpt-5), so it is close to this protocol but not exact — the gpt-5-reader run is still pending. It is the closest measured number we have, and it carries asymmetries that block a direct vs-competitor comparison.
Our results — committed 10M run
The headline below traces to a committed matrix cell: cell_mnemoverse_engine_beam_10m_conv0-9_n200_k200.
- Judge accuracy: 0.61 (122/200 questions correct at the top-100 retrieval cutoff; correct = judge score ≥ 0.5).
- Average nugget score: 0.5378 across all 200 questions.
- 0 errors, 0 judge parse-errors.
- Setup: reader
claude-sonnet-4-6· judgegpt-5· embeddingsQwen3-Embedding-0.6B· μ = 0 ingest · top-k 200, scored at top-100 cutoff.
Per-question-type accuracy (each type n = 20):
| Question type | Judge acc % | Avg nugget score | Question type | Judge acc % | Avg nugget score |
|---|---|---|---|---|---|
| instruction_following | 95.0 | 0.825 | preference_following | 75.0 | 0.6625 |
| knowledge_update | 75.0 | 0.7125 | contradiction_resolution | 75.0 | 0.5437 |
| summarization | 75.0 | 0.6117 | information_extraction | 55.0 | 0.55 |
| abstention | 55.0 | 0.55 | temporal_reasoning | 45.0 | 0.40 |
| event_ordering | 35.0 | 0.3513 | multi_session_reasoning | 25.0 | 0.1717 |
The pattern is consistent with the difficulty of each task at this scale: instruction- and preference-following hold up, while multi-session reasoning and event ordering — the types that need many scattered atoms retrieved together — fall off sharply.
Read this before quoting the 0.61
The 0.61 judge accuracy over-states true retrieval grounding. On this run, ground-truth-atom recall at the top-100 cutoff was only ≈34–36% (RUN_REGISTRY reports 36.0% = 246/684 resolved GT atoms; a later recompute gives 34.1% = 246/721 — both pre-bugfix approximations). The judge awards partial credit for plausible general phrasing even when the specific evidence atoms were not retrieved — e.g. summarization scored 75% judge-pass on ~12% GT-recall@100, instruction_following 95% on ~31%. The headline reflects answer quality as scored by the judge, not how well the engine actually retrieved the right evidence.
Known asymmetries
These are why the 0.61 is not a head-to-head leaderboard claim against other systems:
- Reader-model mismatch. This run uses a
claude-sonnet-4-6reader, not the gpt-5 reader the canonical protocol (and Mem0's published numbers) use. It is the closest measured run, not a protocol-exact one. Do not read it head-to-head against a gpt-5-reader system. - Reader-input-size asymmetry. Our reader was fed a mean ~132,166 tokens per question (median ~120,370, p95 ~204,422). Mem0's published BEAM runs feed a
gpt-5reader ~6,914 mean tokens — roughly 19× less context. A larger context budget can lift judge accuracy independent of the memory engine, so the two numbers are confounded and must not be ranked directly. - Single seed. This is a single-draw point estimate. The protocol calls for 3 seeds to bound variance; treat 0.61 as one observation, not a converged mean.
- Retrieval grounding gap. As in the warning above, judge accuracy and true GT-atom recall diverge here — a real signal that the engine's retrieval at 10M scale is the bottleneck, not the answer-generation step.
Mem0's self-reported numbers — reference only, not comparable
Mem0 has published BEAM figures from its own cloud stack (gpt-5 reader + gpt-5 judge): 64.1 at BEAM-1M and 48.6 at BEAM-10M, at ~6.7–6.9K mean reader tokens with LLM-extraction ingest. We mark these comparable: false: different reader model, ~19× smaller context budget, and a different (LLM-in-the-loop) ingest path than our μ = 0 ingest. They are a reference point for what others report at this scale, not a row to rank our 0.61 against. The shared-protocol harness exists to remove exactly these confounds; until both systems run under it, no head-to-head ranking is honest.
Links
- Interactive dashboard: benchmarks.mnemoverse.com — full run catalog, per-type breakdowns, and the cross-provider protocol.
- Paper: arXiv:2510.27246 — Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs (Tavakoli et al., 2025).
Related
- Benchmarks — the overview and results hub.
- LongMemEval — Long-Term Interactive Memory — the multi-session memory card at smaller scale.
- The Judge Says Yes Too Easily — why an LLM judge's leniency can lift a headline above true retrieval grounding, the mechanism behind the gap noted above.
- Building Memory That Scales — the engine work behind these numbers.
- Interactive dashboard — explore the benchmark data visually.
By Edward Izgorodin · last updated 2026-06-21.