BEAM — Long-Term Memory at 10M-Token Scale

BEAM (Tavakoli et al., 2025 — Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs) pushes conversational memory past the million-token mark: a single conversation can run up to 10M tokens, far longer than the multi-session histories most memory benchmarks use. It is the scale-frontier test in our suite — it stresses tokens per single conversation, not the number of stored items.

This is a short card. The full run catalog, per-type breakdown, and the cross-provider protocol live in the interactive dashboard. Below is what the benchmark tests, how we run it, what we have measured, and — honestly — where the headline number does not mean what it might look like.

What it tests

BEAM is a long-term agentic-memory benchmark. The full dataset has 100 conversations and 2,000 validated questions across four scale buckets (100K / 500K / 1M / 10M tokens). We report the 10M bucket: 10 conversations × 20 probing questions = 200 questions, spread across 10 question types that stress different memory skills:

abstention — refusing to answer when the memory holds no answer.
contradiction_resolution — reconciling statements that conflict across sessions.
event_ordering — placing events in sequence.
information_extraction — pulling a specific fact out of a long history.
instruction_following — honoring an instruction or constraint stated earlier.
knowledge_update — using the latest value, not a stale earlier one.
multi_session_reasoning — combining evidence from several sessions.
preference_following — applying a stated user preference consistently.
summarization — summarizing a discussion or topic thread.
temporal_reasoning — reasoning about dates, durations, and time spans.

Scoring is rubric-based nugget LLM-as-judge: each atomic claim in the answer scores 0.0 / 0.5 / 1.0, and the mean is the question score — not a substring match.

Dataset

Paper: arXiv:2510.27246 (Tavakoli, Salemi, Ye, Abdalla, Zamani, Mitchell). BEAM has no public external leaderboard yet.
Scale buckets: 100K (20 conv), 500K (35), 1M (35), 10M (10), each with 20 questions per conversation. We run the 10M bucket (200 questions) — the hardest scale tier.
The conversation we ingest: one 10M-token cohort (beam-10m-conv0-9). After chunking and consolidation, the engine stored 97,729 chunks (104,349 total, 6,620 filtered, 989 consolidations).

Our protocol

For an apples-to-apples comparison across providers, the canonical protocol fixes:

reader = gpt-5, judge = gpt-5 with the rubric nugget-judge prompt, so every system is scored by the same model on the same prompt.
full 10M bucket (200 questions) — no toy subset.
μ = 0 ingest — zero LLM calls during ingestion (llm_calls_during_ingest_total == 0, asserted per conversation); the engine builds memory without an LLM in the loop.
scoring = rubric nugget LLM-as-judge.

The cross-provider leaderboard under this protocol is still being filled in (it depends on the multi-provider harness). Our committed 10M run below uses a claude-sonnet-4-6 reader (the judge is already gpt-5), so it is close to this protocol but not exact — the gpt-5-reader run is still pending. It is the closest measured number we have, and it carries asymmetries that block a direct vs-competitor comparison.

Our results — committed 10M run

The headline below traces to a committed matrix cell: cell_mnemoverse_engine_beam_10m_conv0-9_n200_k200.

Judge accuracy: 0.61 (122/200 questions correct at the top-100 retrieval cutoff; correct = judge score ≥ 0.5).
Average nugget score: 0.5378 across all 200 questions.
0 errors, 0 judge parse-errors.
Setup: reader claude-sonnet-4-6 · judge gpt-5 · embeddings Qwen3-Embedding-0.6B · μ = 0 ingest · top-k 200, scored at top-100 cutoff.

Per-question-type accuracy (each type n = 20):

Question type	Judge acc %	Avg nugget score	Question type	Judge acc %	Avg nugget score
instruction_following	95.0	0.825	preference_following	75.0	0.6625
knowledge_update	75.0	0.7125	contradiction_resolution	75.0	0.5437
summarization	75.0	0.6117	information_extraction	55.0	0.55
abstention	55.0	0.55	temporal_reasoning	45.0	0.40
event_ordering	35.0	0.3513	multi_session_reasoning	25.0	0.1717

The pattern is consistent with the difficulty of each task at this scale: instruction- and preference-following hold up, while multi-session reasoning and event ordering — the types that need many scattered atoms retrieved together — fall off sharply.

Read this before quoting the 0.61

The 0.61 judge accuracy over-states true retrieval grounding. On this run, ground-truth-atom recall at the top-100 cutoff was only ≈34–36% (RUN_REGISTRY reports 36.0% = 246/684 resolved GT atoms; a later recompute gives 34.1% = 246/721 — both pre-bugfix approximations). The judge awards partial credit for plausible general phrasing even when the specific evidence atoms were not retrieved — e.g. summarization scored 75% judge-pass on ~12% GT-recall@100, instruction_following 95% on ~31%. The headline reflects answer quality as scored by the judge, not how well the engine actually retrieved the right evidence.

Known asymmetries

These are why the 0.61 is not a head-to-head leaderboard claim against other systems:

Reader-model mismatch. This run uses a claude-sonnet-4-6 reader, not the gpt-5 reader the canonical protocol (and Mem0's published numbers) use. It is the closest measured run, not a protocol-exact one. Do not read it head-to-head against a gpt-5-reader system.
Reader-input-size asymmetry. Our reader was fed a mean ~132,166 tokens per question (median ~120,370, p95 ~204,422). Mem0's published BEAM runs feed a gpt-5 reader ~6,914 mean tokens — roughly 19× less context. A larger context budget can lift judge accuracy independent of the memory engine, so the two numbers are confounded and must not be ranked directly.
Single seed. This is a single-draw point estimate. The protocol calls for 3 seeds to bound variance; treat 0.61 as one observation, not a converged mean.
Retrieval grounding gap. As in the warning above, judge accuracy and true GT-atom recall diverge here — a real signal that the engine's retrieval at 10M scale is the bottleneck, not the answer-generation step.

Mem0's self-reported numbers — reference only, not comparable

Mem0 has published BEAM figures from its own cloud stack (gpt-5 reader + gpt-5 judge): 64.1 at BEAM-1M and 48.6 at BEAM-10M, at ~6.7–6.9K mean reader tokens with LLM-extraction ingest. We mark these comparable: false: different reader model, ~19× smaller context budget, and a different (LLM-in-the-loop) ingest path than our μ = 0 ingest. They are a reference point for what others report at this scale, not a row to rank our 0.61 against. The shared-protocol harness exists to remove exactly these confounds; until both systems run under it, no head-to-head ranking is honest.

Benchmarks

BEAM — Long-Term Memory at 10M-Token Scale

What it tests

Dataset

Our protocol

Our results — committed 10M run

Known asymmetries

Mem0's self-reported numbers — reference only, not comparable

Links

BEAM — Long-Term Memory at 10M-Token Scale ​

What it tests ​

Dataset ​

Our protocol ​

Our results — committed 10M run ​

Known asymmetries ​

Mem0's self-reported numbers — reference only, not comparable ​

Links ​

Related ​

BEAM — Long-Term Memory at 10M-Token Scale

What it tests

Dataset

Our protocol

Our results — committed 10M run

Known asymmetries

Mem0's self-reported numbers — reference only, not comparable

Links

Related