AI Memory Benchmarks: A Field Guide

Q: What is the best AI memory benchmark?

There is no single trustworthy, scale-honest memory benchmark yet. For cross-session conversational memory, LongMemEval is the cleanest decomposition of memory abilities (extraction, multi-session reasoning, temporal reasoning, knowledge updates, abstention) and LoCoMo is the most-cited — though LoCoMo's answer key is about 6.4% wrong and its lenient judge accepts roughly 63% of intentionally wrong answers. For extreme scale, BEAM (up to 10M tokens, arXiv:2510.27246) is the current public frontier and the only one where the token budget far exceeds any context window, forcing the memory system rather than the LLM to do the work. The honest practice is to measure on several benchmarks and read recall@k alongside any judge score, not to crown one.

Q: Why do memory benchmark scores disagree?

Because most memory benchmarks score free-form answers with an LLM judge, and the judge's grading prompt — not the memory system — often decides the headline. In a controlled experiment, holding the answers fixed and swapping only the grading prompt moved a LoCoMo score about 40 points. Different papers also use different reader models, top-k retrieval depths, dataset subsets, and answer-key versions, so two numbers on the same named benchmark are frequently not measuring the same thing. We cover this in detail in our judge-variance piece.

Q: Which benchmarks test long-term memory at scale?

Most cap well below the 10M-token regime where production agent memory lives. BEAM runs conversations from 128K up to 10M tokens; LongMemEval's M variant reaches about 1.5M tokens; LongMemEval-V2 uses agent-trajectory histories up to about 115M tokens; and BABILong embeds reasoning tasks in documents up to 10M tokens (though it tests long-context reasoning, not a persistent store). Among true store-and-recall memory benchmarks, BEAM at 10M tokens is the current scale frontier.

An AI agent's memory is the part of it that survives the end of a conversation. Ask whether a given agent "remembers well" and you immediately need a test — and the field has produced dozens of them, each measuring something slightly different and most of them disagreeing about what counts as memory at all. Some test whether a model can find one fact buried in a long document. Some test whether an assistant can recall what a user told it forty sessions ago. A few test whether a system can hold contradictory facts and resolve them. They are not interchangeable, and a high score on one tells you almost nothing about another.

This is a map of that territory. It groups the major benchmarks by family, assesses each family against one question — does this actually test an agent's memory, or just a long context window? — and lays out the recurring problems that make the numbers hard to trust. It is the companion to two narrower pieces: one on why memory scores swing forty points depending on who grades them (judge variance), and one on the discipline we use to keep our own numbers re-derivable (our methodology). This page is the wider view: the lay of the land.

TL;DR

Memory benchmarks split into two families: strict memory tests (multi-session, store-then-recall-later — LoCoMo, LongMemEval, BEAM, MemoryAgentBench) and long-context tests (single-pass over a long input — NIAH, RULER, BABILong, InfiniteBench, the multi-hop QA sets). Only the first family tests a persistent store; the second can be "passed" by a big enough context window.
The strict-memory leader for scale is BEAM (up to 10M tokens, arXiv:2510.27246) — the only public benchmark where the token budget far exceeds any context window.
Four problems recur across the field: judge variance and wrong answer keys (LoCoMo's key is about 6.4% wrong), saturation and context-window leakage (many "memory" tests fit inside a modern window), non-comparability (every paper uses a different judge, reader, top-k, and subset), and a scale ceiling (most cap far below the 10M-to-1B-token regime where production agent memory lives).
There is no single trustworthy, scale-honest memory benchmark yet. The honest reader checks the recipe and reads recall alongside any judge score.

How to read a memory benchmark: two families

Before the table, one distinction does most of the work. A benchmark tests memory in the strict sense only if the evidence is presented earlier and the system must store it and recall it later — across turns, across sessions, across a task history. If instead the whole haystack is dropped into the model's context in a single pass, then a model with a large enough window can answer without any persistent store at all. That is a long-context test, not a memory test.

The distinction matters because the two families have opposite failure modes. Long-context tests get easier every time context windows grow; several are already saturated. Strict memory tests get harder as histories lengthen, because no window is large enough and the system is forced to choose what to keep. When a vendor reports a number, the first question is which family the benchmark belongs to — because only one of them isolates the memory system from the underlying LLM.

Family 1: conversational and cross-session memory

This is the family that actually tests agent memory in the strict sense. Evidence is spread across sessions; the system must store it, recall it later, handle updates and contradictions, and know when something was never said.

LoCoMo (arXiv:2402.17753, ACL 2024) is the de-facto standard and the headline benchmark vendors compete on. It runs 10 long conversations — about 300 turns and roughly 9K-26K tokens each, up to 35 sessions — with around 1,540 non-adversarial QA pairs spanning single-hop, multi-hop, temporal, and open-domain questions. It is also the most critiqued benchmark in the space. An independent audit (Penfield Labs, one team, April 2026, not peer-reviewed) found about 6.4% of the answer keys (99 of 1,540) are score-corrupting errors — hallucinated facts, bad temporal reasoning, speaker mis-attribution — and that the commonly used GPT-4o-mini judge accepted roughly 63% of intentionally-wrong-but-topically-adjacent answers. The conversations also fit inside a modern context window, so LoCoMo does not cleanly separate a memory system from a long-context LLM. Current vendor headlines (in the low-90s) now sit within a couple of points of that answer-key ceiling, which means the field is increasingly measuring judge and key noise rather than memory quality. Treat any published LoCoMo number skeptically and check what was actually measured — retrieval recall or end-to-end QA, which judge, which subset.

LongMemEval (arXiv:2410.10813, ICLR 2025) is the cleanest decomposition of memory abilities and the most rigorous of the cross-session set. It uses 500 manually curated questions across five abilities — information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — over chat histories that scale from about 115K tokens (the S variant) to about 1.5M tokens (the M variant) by padding with distractor sessions. Its abstention and knowledge-update axes test exactly what production memory systems get wrong, and the paper reports commercial assistants dropping about 30% on sustained interaction. The caveats: only 500 questions, synthetically padded histories, and the fact that some systems can score near-100% by verbatim/cheating-style retrieval — a reminder that a high number does not always mean good memory.

BEAM (arXiv:2510.27246, the "Beyond a Million Tokens" benchmark) is the scale frontier. It runs narratively coherent conversations from 128K up to 10M tokens (about 100 conversations, 2,000 validated probing questions) across 10 abilities, and it adds contradiction resolution and event ordering as first-class tasks for the first time. Crucially, it is designed so it cannot be solved by expanding the context window — at 10M tokens no reader holds the conversation, so the memory engine, not the window, is doing the work. That makes it the best public proxy for whether a system actually has memory at production scale. It is new, with limited third-party replication, and its single-user single-narrative conversations are less realistic than true multi-party agent trajectories. (A note for our own readers: a separate internal Mnemoverse "BEAM" exists in our workspace; the benchmark discussed here is the public arXiv:2510.27246 one.)

MemoryAgentBench (arXiv:2507.05257, 2025; ICLR 2026) frames evaluation around agent competencies rather than dialogue artifacts: accurate retrieval, test-time learning, long-range understanding, and conflict resolution, over incremental "inject once, query many times" streaming in roughly 100K-1.4M-token settings. Its finding — that no current method masters all four — is itself informative. It inherits biases from the existing datasets it reformulates, and its chunked streaming simulates rather than reproduces real multi-turn interaction.

Around these four sit several more: MSC (arXiv:2107.07567, the 2021-2022 benchmark that launched multi-session memory research, now superseded for quantitative eval); ConvoMem (arXiv:2511.10523, the largest QA set in the space at about 75K pairs, studying when external memory beats raw context); LongMemEval-V2 (arXiv:2605.12493, which shifts from chat history to agent-trajectory memory at 100M-plus tokens); and newer 2026 entries — LifeBench, AMA-Bench, MemBench, PerLTQA, DialSim — that push toward procedural memory, tool-using agent trajectories, latency-aware recall, and efficiency-and-capacity scoring. (One frequently cited name, MemoryBank / SiliconFriend, arXiv:2305.10250, is a memory mechanism and companion demo, not a standardized benchmark; cite it as an architecture, not an eval.)

Verdict for this family: these are the benchmarks that genuinely test agent memory. Within it, LongMemEval is the cleanest, LoCoMo is table-stakes but heavily caveated, and BEAM owns the scale frontier.

Family 2: long-context and multi-hop QA, repurposed for memory

This family was built to test long-context LLMs and multi-hop retrieval, not persistent memory. Each one presents its evidence in a single pass, so a model with a large enough window can pass without any store. They remain useful as components of a memory evaluation — for stress-testing recall against distractors, or checking that a retriever surfaces a full evidence chain — but a high score here is not evidence of memory.

The needle family is the simplest. NIAH (Needle-in-a-Haystack, 2023) plants one fact in long filler text and asks the model to retrieve it; it is saturated — frontier models score near 100% — which is precisely why harder variants exist. RULER (arXiv:2404.06654, COLM 2024) extends it to 13 synthetic tasks including multi-needle retrieval, multi-hop tracing, and aggregation, and is the standard "real context size" probe; its vanilla single-needle portion is near-saturated. NoLiMa (arXiv:2502.05167, ICML 2025) removes the lexical overlap, forcing semantic rather than string-match recall — and shows that effective context for meaning is far shorter than advertised windows (10 of the tested models drop below half their short-context baseline by 32K). BABILong (arXiv:2406.10149, NeurIPS 2024) embeds 20 reasoning tasks in book text out to 10M tokens, and found models effectively use only 10-20% of available context — the most scale-stressing long-context test, though it is reasoning-over-static-text, not memory.

The long-document and multi-hop sets round it out. InfiniteBench (arXiv:2402.13718, ACL 2024) was the first to average over 100K tokens; its synthetic passkey/retrieval subtasks are saturated on frontier models, and its novel-based subsets risk contamination. HotpotQA (arXiv:1809.09600, EMNLP 2018) is the canonical 2-hop QA set but is cheatable — roughly 35% of bridge questions are solvable with single-hop reasoning — and largely superseded by MuSiQue (arXiv:2108.00573, TACL 2022), which composes hops bottom-up so each is necessary and is the least-gameable of the multi-hop trio. 2WikiMultihopQA (arXiv:2011.01060, COLING 2020) sits between them, with explicit reasoning-path supervision but template artificiality. These appear repeatedly inside long-context harnesses like ZeroSCROLLS, LongBench, HELMET, and LV-Eval.

Verdict for this family: valuable as diagnostic components — MuSiQue to check the retriever surfaces a full evidence chain, RULER and NoLiMa and BABILong to test recall against distractors at scale — but none of them is a memory benchmark. They present the evidence in-context, in a single pass.

The navigable table

Twelve load-bearing benchmarks, grouped by family. "Strict memory?" answers the throughline question: does it actually test a persistent store, or just a long context window?

Benchmark	What it tests	Scale	Scoring	Year	Strict memory?	What it misses
LoCoMo	Cross-session conversational QA, temporal	~10 convs, ~9K-26K tok	F1 / LLM-judge	2024	Yes	~6.4% wrong keys; lenient judge; fits in a context window
LongMemEval	5 memory abilities incl. updates, abstention	500 Q, 115K-1.5M tok	LLM-judge accuracy	2024	Yes	Only 500 Q; synthetic padding; verbatim-retrieval can game it
BEAM	10 abilities incl. contradiction, ordering	128K-10M tok, 2K Q	Nugget rubric 0/0.5/1	2025	Yes	New; single-user single-narrative, not true agent trajectories
MemoryAgentBench	Retrieval, test-time learning, conflict	~100K-1.4M tok	Task accuracy	2025	Yes	Reformulated datasets; chunked streaming simulates multi-turn
MSC	Multi-session persona consistency	~5 sessions, ~9K tok	Perplexity / human eval	2021	Yes	Too short; generation metrics; superseded for QA eval
ConvoMem	When memory beats raw context	~75K Q, ~150 convs	QA accuracy	2025	Yes	New; central claim is vendor-adjacent; accuracy-only
NIAH	Single-fact long-context retrieval	up to 128K+ tok	Binary match	2023	No	Saturated; lexical-only; no distractors or reasoning
RULER	Multi-needle, tracing, aggregation	4K-128K+ tok	Per-task accuracy	2024	No	Synthetic; single-needle portion near-saturated; gameable
NoLiMa	Semantic (non-lexical) recall	up to 32K+ tok	Accuracy vs baseline	2025	No	Synthetic single-needle form; very hard, low signal on weak systems
BABILong	Multi-fact reasoning in a haystack	up to 10M tok	Per-task accuracy	2024	No	bAbI artificiality; static document, no session dynamics
HotpotQA	2-hop Wikipedia QA	10-para pool	EM / F1	2018	No	Cheatable (~35% single-hop); contamination risk
MuSiQue	Genuine 2-4 hop, anti-shortcut	~20 paras	F1 / chain coverage	2022	No	Template phrasing; static corpus, not memory

Read the table top-to-bottom and the throughline is visible: the strict-memory rows are where agent memory actually lives, and the long-context rows are diagnostic tools that test the LLM, not the store. The scale column tells the rest of the story — almost everything caps below the regime where production memory operates, with BEAM and BABILong the lone outliers at 10M tokens (and only BEAM is a memory benchmark).

The four recurring problems

Across both families, four problems recur often enough to be the honest meta-layer of the field. Any number you read is shaped by at least one of them.

1. Judge variance and wrong answer keys

Most memory benchmarks score free-form answers with an LLM judge, and the judge's grading prompt — not the memory system — often decides the headline. In a controlled experiment, holding the answer set fixed and swapping only the grading prompt moved a LoCoMo score by about 40 points; the judge model moved it 1-2 points. On top of that, LoCoMo's own answer key is about 6.4% wrong, and its common judge accepts roughly 63% of intentionally wrong answers. The mechanism is detailed in our judge-variance article — the short version is that a judge score without its full recipe is a rumour.

2. Saturation and context-window leakage

Many benchmarks marketed as memory tests fit inside a modern context window, so they measure the LLM's attention, not a persistent store. NIAH and the synthetic retrieval subtasks of RULER and InfiniteBench are saturated on frontier models; HotpotQA is largely cheatable; LoCoMo's conversations are short enough to load wholesale. When a "memory" benchmark fits in a window, a high score may just mean the model has a big window — which is why the 10M-token regime of BEAM is the interesting frontier: there, no window is large enough.

3. Non-comparability across papers

There is no standard evaluation protocol for LoCoMo, LongMemEval, or BEAM. Every paper picks its own judge model and prompt, its own reader model, its own top-k retrieval depth, its own dataset subset, and sometimes its own answer-key version — and each of those moves the score, often more than the inter-system gaps being advertised. Two numbers on the same named benchmark are frequently not measuring the same thing. Vendor self-claims and independent evaluations routinely diverge by 18-24 points on LoCoMo. This is why any cross-vendor comparison must be marked comparable: false unless the judge, reader, top-k, metric, and subset all match.

4. The scale ceiling

Production agent memory lives in the 10M-to-1B-token regime — a long-running assistant accumulates years of history. Most benchmarks cap far below it: LoCoMo at tens of thousands of tokens, LongMemEval at about 1.5M, MemoryAgentBench at about 1.4M. BEAM at 10M tokens is the current public frontier, and even that is the floor of where deployed memory operates, not the ceiling. The honest reading is that the field's measurement apparatus trails the field's deployment reality by roughly two orders of magnitude. The bar has risen a long way since goldfish memory; it has not yet reached the building.

Where this leaves an honest reader

There is no single trustworthy, scale-honest memory benchmark yet. Each one tests a real slice of memory and misses others; the leaders for cross-session recall are caveated by judge and key noise, the leader for scale is new and narrow, and the long-context tests measure the LLM rather than the store. So the honest practice is not to crown one benchmark — it is to read every number with its recipe attached and to keep a judge-free metric in view.

That is our stance, stated plainly so you can hold us to it. We measure on several benchmarks rather than one; we lead with recall@k — answer-key-checked and judge-free — alongside any LLM-judge score, because recall is the harder number to game; and we plant our flag on the scale frontier, because that is where agent memory actually has to work and where a context window can no longer fake it. We do not claim to top any benchmark, and none of these benchmarks is "ours." The full discipline — regression-not-peak, the comparability key, provenance, recall-first — is in our methodology page. The map above is where it starts: knowing which test measures what, and which numbers to believe.

Common questions

What is the best AI memory benchmark? There is no single trustworthy, scale-honest one. LongMemEval is the cleanest decomposition of cross-session memory abilities; LoCoMo is the most-cited but its key is about 6.4% wrong and its judge over-lenient; BEAM (up to 10M tokens) is the scale frontier and the only one where the token budget far exceeds any context window. Measure on several and read recall@k beside any judge score.

Why do memory benchmark scores disagree? Because most score free-form answers with an LLM judge whose grading prompt — not the memory system — often decides the headline, and because different papers use different reader models, top-k depths, subsets, and answer-key versions. Two numbers on the same benchmark frequently are not measuring the same thing. See our judge-variance piece.

Which benchmarks test long-term memory at scale? Most cap far below the 10M-token regime where production memory lives. BEAM runs to 10M tokens; LongMemEval-M to about 1.5M; LongMemEval-V2 uses agent trajectories to about 115M tokens; BABILong embeds reasoning out to 10M (but tests long-context, not a store). Among true store-and-recall benchmarks, BEAM at 10M is the frontier.

Sources

Strict conversational / agent-memory benchmarks:

LoCoMo — arXiv:2402.17753 (Maharana et al., ACL 2024)
LongMemEval — arXiv:2410.10813 (Wu et al., ICLR 2025)
BEAM ("Beyond a Million Tokens") — arXiv:2510.27246 (Tavakoli et al., 2025)
MemoryAgentBench — arXiv:2507.05257 (Hu et al., 2025; ICLR 2026)
MSC (Multi-Session Chat) — arXiv:2107.07567 (Xu, Szlam, Weston, ACL 2022)
ConvoMem — arXiv:2511.10523 (2025)
LongMemEval-V2 — arXiv:2605.12493 (Wu et al., 2026)
PerLTQA — arXiv:2402.16288 (Du et al., 2024)
DialSim — arXiv:2406.13144 (Kim et al., 2024)
MemoryBank / SiliconFriend (mechanism, not benchmark) — arXiv:2305.10250 (Zhong et al., AAAI 2024)

Long-context and multi-hop QA benchmarks:

NIAH (Needle-in-a-Haystack) — Kamradt, 2023 (github.com/gkamradt/LLMTest_NeedleInAHaystack)
RULER — arXiv:2404.06654 (Hsieh et al., NVIDIA, COLM 2024)
NoLiMa — arXiv:2502.05167 (Modarressi et al., Adobe, ICML 2025)
BABILong — arXiv:2406.10149 (Kuratov et al., NeurIPS 2024)
InfiniteBench — arXiv:2402.13718 (Zhang et al., OpenBMB, ACL 2024)
HotpotQA — arXiv:1809.09600 (Yang et al., EMNLP 2018)
MuSiQue — arXiv:2108.00573 (Trivedi et al., TACL 2022)
2WikiMultihopQA — arXiv:2011.01060 (Ho et al., COLING 2020)

Judges, Good and Evil: Why Memory Benchmark Scores Swing 40 Points — the judge-variance problem in depth.
How We Measure AI Memory Honestly — the discipline that keeps our own numbers re-derivable.
The AI Memory Landscape 2026 — the companion map of memory systems (this piece maps the tests).

By Edward Izgorodin. Last updated 2026-06-23. A field guide to AI-memory benchmarks, grounded in the benchmark papers cited above; benchmarks are tests, not competitors, and cross-vendor results are not comparable unless judge, reader, top-k, metric, and subset all match.

AI Memory Benchmarks: A Field Guide ​

How to read a memory benchmark: two families ​

Family 1: conversational and cross-session memory ​

Family 2: long-context and multi-hop QA, repurposed for memory ​

The navigable table ​

The four recurring problems ​

1. Judge variance and wrong answer keys ​

2. Saturation and context-window leakage ​

3. Non-comparability across papers ​

4. The scale ceiling ​

Where this leaves an honest reader ​

Common questions ​

Sources ​

Related ​