Benchmarks

Name: Mnemoverse memory benchmarks (LoCoMo cross-system)
Creator: Mnemoverse

Evaluation results from the Mnemoverse memory engine on established benchmarks. All runs are automated, configurations are documented, and raw results are committed as JSON in the repository.

Interactive dashboard: benchmarks.mnemoverse.com — live leaderboard with detailed per-question breakdowns.

How we read a memory benchmark

Memory evaluation is not one number. A score is a property of the test rig as much as of the system under test, so we report each result with its config, its date, and its limits attached. Three things shape how we read every benchmark — ours and everyone else's. The full treatment is in How to Evaluate AI Agent Memory; the short version:

Direct vs indirect. Most public memory scores — including ours below — are direct recall: does the store hold the right thing and surface it? That is distinct from the indirect question (does memory make an agent better at its actual task), which is barely measured by any public benchmark. We benchmark the direct path and say so.
Write → manage → read. Memory is a loop: filter and canonicalize on write; deduplicate, consolidate, and resolve contradictions on manage; retrieve and use on read. Each stage fails on its own. Recall benchmarks mostly probe read; the manage operations (consolidation, forgetting, contradiction resolution) have no widely-adopted public benchmark, so we flag where a number does not cover them.
The judge and the harness move the score. Most of these benchmarks are graded by an LLM-as-judge, which is lenient by construction — see The Judge Says Yes Too Easily. Judge model, judge prompt, dataset subset, and reader backbone can each swing a score by double digits, and one benchmark name can point at two different datasets. So a number is only meaningful next to the exact rig that produced it. That is why we headline our strict in-house judge and commit the raw verdicts of all four for the rest — the multi-judge view below shows how far they diverge.

LoCoMo (Primary Benchmark)

LoCoMo (Maharana et al., ACL 2024) evaluates conversational memory across multi-session dialogues. The dataset labels five question categories: single-hop factual recall, temporal reasoning, multi-hop inference, open-domain knowledge, and an adversarial category designed to trigger hallucinations. The four core categories are the focus of the evaluation; the fifth (adversarial) is treated as a robustness probe and is excluded from the headline matrix for symmetry across systems — the full matrix lives on the LoCoMo benchmark card.

Note on "LoCoMo." The benchmark name points at two different datasets: the full release is 50 conversations / 7,512 questions, while the harness most vendors run scores a ~10-conversation / ~1,986-question subset. The matrix on the card is a single conversation (conv-26) — narrower still, and labelled as such. Read any "LoCoMo 90+%" claim by asking which LoCoMo.

Headline. On the conv-26 matrix (152 questions, four judges, strict in-house headline), our research prototype peaks at 0.711 at k=50 — ahead of mem0, supermemory, and zep on measurable k — while the public API row trails (peak 0.539) as features migrate from the research path into the shipping contract. Treat every LoCoMo number, ours included, as judge-dependent: a prompt swap alone moves it ~40 points (see Judges, Good and Evil).

How to read this — our own headline included. Any memory vendor, us included, can pick the benchmark slice and the judge that flatter it most — and the judge alone moves a LoCoMo score by tens of points. The headline above is the slice and judge where our engine leads: real and reproducible, but not the whole picture. The live cross-system matrix puts every system, judge, and recall depth side by side, with the per-cell test conditions disclosed — including the fairer symmetric conv-47 run (190 questions, adversarial category included), where the lead is no longer ours. Decide who leads on the axes that matter for your memory workload, and read it off the full matrix rather than any single headline.

The full cross-system matrix, the four-judge view, our disclosed asymmetries, and reproducibility live on the LoCoMo benchmark card and the live, cell-by-cell dashboard. The dashboard has since moved to a fresher conv-47 matrix (190 questions, graded symmetrically across systems); the card adopts it once the presentation is settled.

Other benchmarks

LoCoMo above is our primary, fully-committed matrix. We also run four additional benchmarks, each testing a different memory operation. These have measured engine numbers, but each with a provenance gap relative to the LoCoMo matrix. BEAM-10M is a committed, live matrix cell — but its closest run used a different reader than the canonical protocol. HotpotQA, MuSiQue, and LongMemEval are not yet in the matrix at all — their numbers live only in raw run JSONs, under different readers or partial subsets. Each has its own page with the full protocol, caveats, and the not-comparable competitor rows; read the numbers there with their limits attached.

BEAM — long-term agentic memory at conversational scale up to 10M tokens (100 conv / 2,000 Q), the only one of these to test contradiction resolution and event ordering. Closest measured engine run: 0.610 judge pass-rate on BEAM-10M (200 questions), but under a claude-sonnet-4-6 reader, not the canonical gpt-5 protocol — so it is a closest-measured run, not a head-to-head leaderboard row. The headline overstates true retrieval grounding (GT-atom-recall@100 was lower); see the page.
HotpotQA — multi-hop QA over multiple supporting documents (bridge + comparison). Closest measured engine run: Answer F1 ≈0.778 (n=500), with support recall ~1.0 because the candidate pool is only 10 paragraphs — but with an unrecorded NVIDIA-hosted small reader and no judge, so it is not on the same axis as the GPT-4o-backbone competitor numbers.
MuSiQue — compositional multi-hop reasoning where each answer feeds the next hop (longer chains than HotpotQA). Closest measured engine run: Answer F1 0.457 on the 2-hop subset only (n=200), reader Llama-3.1-8B, no judge; retrieval/chain-coverage ~0.84, so the reader, not retrieval, is the bottleneck.
LongMemEval — long-horizon conversational memory across 500 instances, five abilities including knowledge updates and abstention. The two paper-comparable engine runs disagree: a weighted-mean overall of ≈0.62 (run 0327) versus ≈0.79 (run 0323), the gap attributable to a lenient judge pass on the higher run. Without a frozen judge-prompt hash we cannot adjudicate which is canonical, so we present both and headline neither.

Ongoing work

Bringing research-prototype features into the public API — diversity/temporal-deduplication before the reader, second-pass retrieval, anti-refusal reader prompt
Temporal grounding — a principled temporal sub-graph rather than ad-hoc resolvers; the current production path has no dedicated time layer
Multi-conversation evaluation — extending the matrix from one LoCoMo conversation to the full ten, plus held-out conversations from other corpora
A canonical cross-provider protocol — re-running every system (ours and competitors') under a single fixed reader + judge so the "other benchmarks" numbers above become directly comparable leaderboard rows, not closest-measured runs
Parallel judge fan-out — score-neutral wall-clock reduction (judges currently dominate per-question latency)

AI Memory Benchmarks: A Field Guide — the canonical map of all memory benchmarks, strict-memory vs long-context
How to Evaluate AI Agent Memory — the practitioner how-to and multi-axis lens for choosing among these benchmarks
Building Memory That Scales — the story behind these numbers
Interactive dashboard — explore the benchmark data visually
Design Language — how the 3D visualization maps to engine concepts
The Judge Says Yes Too Easily — how the LLM-as-a-judge under these scores works, and why leniency inflates leaderboard numbers
LangChain / LangSmith Evaluation — the broader agent-evaluation tooling landscape these benchmarks sit in

Benchmarks ​

How we read a memory benchmark ​

LoCoMo (Primary Benchmark) ​

Other benchmarks ​

Ongoing work ​

Related ​

Benchmarks

How we read a memory benchmark

LoCoMo (Primary Benchmark)

Other benchmarks

Ongoing work

Related