How to Evaluate AI Agent Memory: The Benchmark Map and the Multi-Axis Lens
Every memory vendor cites a near-perfect "LoCoMo 90+%." Open three of those posts and you will find three different numbers for the same systems — and the same benchmark name pointing at two different datasets. The full LoCoMo release is 50 conversations and 7,512 questions; the harness most vendors actually run scores a roughly 10-conversation, ~1,986-question subset. Same name, two numbers. No two harnesses agree. So the first thing to learn about evaluating agent memory is that you cannot do it by reading one number.
Memory evaluation is a framework, not a score. What to measure (the memory operations and abilities), with which instrument (the benchmark map), on which axes (quality and latency, cost, governance) — and every public number is harness-dependent.
TL;DR
- "LoCoMo 90+%" is not one fact: the full dataset is 50 conv / 7,512 Q, but the common vendor harness scores a ~10-conv / ~1,986-Q subset — the same name, two numbers.
- Memory evaluation is a framework, not a benchmark — what to measure (write → manage → read operations; direct vs indirect), with which instrument (the benchmark map below), on which axes.
- No one benchmark covers "memory" — each tests a different slice: LoCoMo multi-session recall (now saturating); LongMemEval adds knowledge-updates and abstention; BEAM adds contradiction-resolution and event-ordering at up to 10M tokens; MemoryAgentBench adds test-time learning and conflict-resolution; MemoryArena tests active, decision-relevant use (not passive recall); STATE-Bench tests state-tracking.
- Accuracy is a trap. A full-context baseline can score ~6 points higher than a memory layer while costing ~14× the tokens and ~14× the latency — latency, cost, and governance are co-equal axes.
This article is the map. The two companions in this cluster already own the parts that are settled: why the output-evaluation tools don't measure memory (it is a different unit), and why the LLM judge under every score is lenient (it says yes too easily). Here we cover the gap they leave — the benchmark map, the framework it implies, and the axes beyond accuracy.
The framework: direct vs indirect, write → manage → read
Lay the benchmarks side by side and a structure falls out. They are not competing leaderboards; they test different parts of one pipeline, and they answer one of two fundamentally different questions.
Agent memory evaluation measures whether an agent's memory behaves correctly across many sessions — recall, consolidation, contradiction handling, recency — not whether any single response is good. That is its own measurement problem, distinct from response quality (companion).
The cleanest organizing split comes from the foundational "Survey on the Memory Mechanism of LLM-based Agents" (Zhang et al., 2024): direct vs indirect evaluation.
Direct (intrinsic) evaluation measures the memory module itself — does the store hold the right thing, and can it retrieve it? Indirect (extrinsic) evaluation measures whether having memory makes the agent better at its actual task. Most vendor numbers are direct conversational recall. The thing practitioners care about — does memory improve my agent's job — is indirect, and barely measured. That single split explains the map: LoCoMo and LongMemEval are direct recall; MemoryArena and STATE-Bench are indirect, task-level use.
Inside the direct path, a 2026 survey on memory for autonomous LLM agents (a single-author March-2026 preprint — read it as "a 2026 survey proposes," not consensus) frames memory as a write → manage → read loop, where each operation is separately evaluable and separately fails:
- Write — filter low-signal records, canonicalize (normalize dates, names, quantities).
- Manage — deduplicate and consolidate overlapping entries; detect and resolve contradictions; forget stale facts.
- Read — retrieve the relevant records and utilize them in the response.
Consolidation is the management operation that merges duplicate and overlapping memories into one canonical entry, instead of accumulating conflicting copies of the same fact. The same survey calls it an open challenge, observing that current systems "oscillate between hoarding and amnesia." Each stage can fail on its own, and each can be tested on its own. That is the whole point of a framework: you don't grade "memory," you grade the operation that broke.
What to measure: dimensions mapped to instruments
Here is the dimension set, synthesized from the canonical lists — LongMemEval's five abilities, BEAM's ten, MemoryAgentBench's four — each mapped to the instrument that tests it and the standard metric.
| Dimension | What it asks | Instrument | Metric |
|---|---|---|---|
| Retrieval quality | Did the store surface the right records? | IR metrics, any harness | Precision@k / Recall@k |
| Information extraction | Recall a specific fact from a long history | LongMemEval, BEAM | LLM-judge / F1 |
| Multi-session / multi-hop reasoning | Synthesize facts across sessions | LongMemEval, BEAM | accuracy |
| Temporal / recency | Reason over "when" and timestamps | LongMemEval, BEAM | accuracy |
| Knowledge update | Revise a stored fact when it changes | LongMemEval, BEAM | accuracy |
| Contradiction resolution | Reconcile conflicting statements | BEAM | accuracy |
| Consolidation / dedup | Merge overlaps, drop stale copies | no standard public benchmark | contradiction rate, staleness |
| Abstention | Say "I don't know" on false premises | LongMemEval, BEAM | catch-rate |
| Test-time learning | Improve from in-context outcomes | MemoryAgentBench | competency score |
| Active / decision use | Use memory to guide a next action | MemoryArena, STATE-Bench | task success |
Recall@k is the fraction of all relevant records that appear in the top-k retrieved results: Recall@k = (relevant records in top-k) / (total relevant records). Precision@k is the fraction of the top-k that are relevant: Precision@k = (relevant records in top-k) / k. These grade the memory layer's retrieval directly — distinct from the judge that grades the final answer.
The honest gap: the management dimensions — consolidation, dedup, forgetting — have no widely-adopted public benchmark that scores them as a standalone number, even though every survey names them. If those matter for your agent, you assemble your own gold set and adversarial cases. The dimensions are agreed; the instruments for the dynamic ones are immature.
The benchmark map: what each agent-memory benchmark actually tests
The instrument you pick decides what "remembers" even means. Six benchmarks dominate the 2026 landscape. Read this as a lookup table: each row is a different question about memory, and the limitation column is where the number stops meaning what you think it means.
| Benchmark | What it tests | Scale / config | Headline finding | Known limitation |
|---|---|---|---|---|
| LoCoMo (Maharana et al., ACL 2024) | Multi-session dialogue recall: single/multi-hop, temporal, open-domain, adversarial | Full: 50 conv / 7,512 Q; vendor harness: ~10 conv / ~1,986 Q | A full-context baseline (~73%) can beat a memory system (~67%) | Saturating; original metrics lexical (F1/ROUGE/FActScore), no LLM judge; same name, two configs |
| LongMemEval (Wu et al., ICLR 2025) | 5 abilities: info extraction, multi-session reasoning, temporal, knowledge update, abstention | 500 Q across 7 question types | Commercial assistants drop ~30% (up to 30–60% on the -S variant) | Still chatbot-shaped recall, not agentic task use |
| BEAM (Tavakoli et al., arXiv, 2025) | 10 abilities incl. contradiction resolution + event ordering (the two earlier ones omit) | 100 conv, 2,000 Q, up to 10M tokens (128K/500K/1M/10M tiers) | Even 1M-context models struggle as dialogues lengthen — a bigger window is not the fix | Venue unconfirmed; cite as arXiv, 2025, not ICLR |
| MemoryAgentBench (Hu, Wang, McAuley, 2025) | 4 competencies: accurate retrieval, test-time learning, long-range understanding, conflict resolution / selective forgetting | Incremental multi-turn, ~100K–1.44M-token depths | "All methods fall short of mastering all four" | 4th competency named differently in arXiv vs repo |
| MemoryArena (arXiv, 2026) | Active vs passive memory: decision-relevant use, not recall | 766 interdependent multi-session agentic tasks | Models near-perfect on LoCoMo drop to ~40–60% | Venue unconfirmed; the 40–60% band is a survey/secondary characterization |
| STATE-Bench (Microsoft, May 2026) | State tracking: does the agent improve with experience on stateful tasks? | Realistic stateful enterprise tasks | A relatively neutral source — Microsoft doesn't sell a memory product | Recent; independent replication still thin |
The most important lesson hides in the first row. "LoCoMo" names two different tests. The released dataset is 50 conversations and 7,512 questions across five reasoning types (Maharana et al.); the harness most vendors actually run is a ~10-conversation, ~1,986-question subset (ByteRover) — Penfield audited the 1,540 non-adversarial part of it. If you read "92% on LoCoMo" without asking which LoCoMo, you have not read a result.
Two more caveats the table forces into the open. BEAM is an arXiv, 2025 preprint (2510.27246); the "ICLR 2026" venue floating around is unconfirmed, so cite it as arXiv. And vendor self-reports are not on this map as neutral fact: Mem0's June-2026 blog claims LoCoMo 92.5 and LongMemEval 94.4 on its own harness and date — flag and date those, do not treat them as comparable results.
Now the map reads as a checklist. Pure recall? LoCoMo and LongMemEval's information-extraction ability. Contradiction resolution and event ordering? BEAM, uniquely. Test-time learning and selective forgetting? MemoryAgentBench. Does memory change a decision (indirect)? MemoryArena and STATE-Bench. The point is not that one benchmark is best — it is that the benchmark you pick must match the memory operation you care about, and most omit the management operations (consolidation, forgetting, contradiction) entirely.
Accuracy is a trap: the multi-axis lens
The benchmark map gives you a quality number. A quality number alone is a trap, because the cheapest way to win it is to stop being a memory system at all — just stuff the entire history into the context window. On a saturating benchmark like LoCoMo, that brute-force baseline competes with, and often beats, purpose-built memory. So quality is one axis among four.
| Axis | What you measure | Why it's co-equal |
|---|---|---|
| Quality | accuracy, recall, contradiction rate, staleness | The headline — but the easiest to game with a big window |
| Latency | p50 / p95 per memory operation | A correct answer in 10s can be useless in a live agent |
| Token cost | prompt tokens consumed by injected memory; storage growth | Drives the bill and the context budget |
| Governance | privacy-leakage rate, deletion compliance, access-scope violations | Named by the survey; no standard metric yet — a dimension to watch |
The trade-off is concrete. In the Mem0 paper (Chhikara et al., April 2025), a full-context baseline scored about 72.9% on LoCoMo but cost roughly 9.87s and ~26K tokens per conversation; a memory layer scored about 66.9% at roughly 0.71s and ~1.8K tokens — about 91% lower latency and over 90% fewer tokens for under six accuracy points. (These figures originate with the vendor's own paper; treat them as self-reported, April 2025.) The Agent Memory Benchmark framing (Vectorize/Hindsight, March 2026) puts the rule plainly: "90% accuracy at $10/user/day is not better than 82% at $0.10." If your memory system can't beat a full-context baseline on quality, you don't have an accuracy story — you have a cost-and-latency story, and that is a legitimate story, but report it as what it is.
The workflow: a practitioner playbook
Once you have the framework, the map, and the axes, evaluating a real system is three disciplined steps.
- Name the dimension, then pick the instrument that tests it — by its config, not its name. Cross-session recall, contradiction handling, and decision-relevant use are different problems with different benchmarks. A scheduling assistant lives on temporal reasoning and knowledge update; a research agent on retrieval recall and abstention. Then read the config: "LoCoMo" is two datasets, so confirm two scores ran the same one before you compare them.
- Run a full-context baseline, and measure quality and cost. If your memory layer can't beat dumping everything into context on quality, your case is cost and latency — a legitimate result, reported as what it is. Pair every accuracy figure with p95 latency, tokens per query, and storage growth.
- Evaluate indirectly on your own data, and disclose your config. A high LoCoMo number does not predict that memory makes your agent better at your task — only an indirect, task-level eval does. When you report a result, travel with it: which benchmark and subset, the judge model and prompt, the date, the backbone model, the limits. A score without its harness is not a result — it is a claim.
How to read a public memory score
Step 1 says read the config; this is why. There is no neutral leaderboard, and the score is a property of the test rig as much as the system. The same memory system moved from 65.99% to 75.14% — about nine points — purely by fixing role assignment, timestamps, and search parallelism in the harness (Zep re-run); Hindsight reports that judge-prompt and judge-model changes swing scores by double digits. That swing comes from the grader underneath the score, which is lenient by construction. We do not re-derive that here — the LLM-as-a-judge deep-dive takes apart the leniency, the "be generous" grader, and the bias catalogue. Inherit it, and read every score with three things attached: its config (which dataset, which subset, which judge), its date (vendor self-scores climb across blog posts), and its limits (saturation, answer-key issues, venue status).
That last discipline is the one the field mostly skips, and the one Mnemoverse — a persistent-memory engine whose job is recall, consolidation, and contradiction handling across sessions — tries to hold itself to: on the benchmarks page, every number travels with its config, date, and limits. A score without that context is not a result; it's a benchmark name with a number attached.
Common questions
How do you evaluate AI agent memory? Not with one benchmark and one number. Memory evaluation is a framework: read the benchmark map to pick the instrument that tests the memory operation you care about (write, manage, or read), then measure quality alongside latency, token cost, and governance. Every public memory score is harness-dependent, so read each one with its config, date, and limits attached.
What is the difference between LoCoMo and LongMemEval? LoCoMo (ACL 2024) is the original multi-session dialogue benchmark, now saturating — a plain full-context baseline can beat a purpose-built memory system. LongMemEval (ICLR 2025) is more disciplined: 500 curated questions across five abilities (information extraction, multi-session reasoning, temporal, knowledge update, abstention), on which commercial assistants drop about 30%.
Why do vendors report different scores on the same benchmark? Because there is no neutral leaderboard, and one benchmark name can map to two datasets. The full LoCoMo release is 50 conversations / 7,512 questions, but the widely-cited vendor harness scores a roughly 10-conversation / ~1,986-question subset. On top of that, the judge and harness choices alone can swing a score by double digits.
Is accuracy enough to judge an agent-memory system? No. Accuracy is one axis. In the Mem0 paper a full-context baseline scored ~72.9% but cost ~9.87s and ~26K tokens per conversation, while a memory layer scored ~66.9% at ~0.71s and ~1.8K tokens (vendor self-report, April 2025). Latency, token cost, storage growth, and governance are co-equal axes — 90% accuracy at $10/user/day is not better than 82% at $0.10.
Which benchmark tests contradiction resolution in agent memory? BEAM (arXiv, 2025) is the broadest, testing ten abilities including contradiction resolution and event ordering that earlier benchmarks omit, on dialogues up to 10M tokens — and even 1M-token-context models struggle, so a bigger window is not the fix. MemoryAgentBench adds a conflict-resolution / selective-forgetting competency.
Sources
- Benchmarks — LoCoMo, Maharana et al., ACL 2024 (arXiv:2402.17753); LongMemEval, Wu et al., ICLR 2025 (arXiv:2410.10813) (project page); BEAM, Tavakoli et al., arXiv, 2025 (2510.27246) (venue unconfirmed); MemoryAgentBench, Hu et al. (arXiv:2507.05257); MemoryArena (arXiv:2602.16313) (venue unconfirmed); STATE-Bench, Microsoft, May 2026
- Framework & surveys — A Survey on the Memory Mechanism of LLM-based Agents, Zhang et al., 2024 (arXiv:2404.13501) — direct vs indirect; Memory for Autonomous LLM Agents — write→manage→read loop, 2026 preprint (arXiv:2603.07670) (single-author preprint, not peer-reviewed)
- Multi-axis / efficiency — Mem0, Chhikara et al., April 2025 (arXiv:2504.19413) — full-context vs memory-layer latency/token figures, vendor-origin; Mem0 state-of-memory blog, June 2026 — higher self-reported scores, flagged; Agent Memory Benchmark framing, Vectorize/Hindsight, March 2026 — accuracy-vs-cost rule, vendor source
- Config nuance / retrieval metrics — LoCoMo vendor-subset counts, ByteRover (vendor blog); retrieval Precision@k / Recall@k; Zep harness re-run; harness-dependence detail in the LLM-as-judge companion
Related
- The Judge Says Yes Too Easily: LLM-as-a-Judge and Leniency — the measurement instrument under every memory number, and why it's lenient
- LangChain / LangSmith Evaluation: What It Measures — and the One Thing It Can't — why output-evaluation tools don't measure memory
- Benchmarks — how the Mnemoverse memory engine reports, config and limits attached
Edward Izgorodin, June 2026 — LinkedIn
