LoCoMo — Long-Term Conversational Memory
LoCoMo (Maharana et al., ACL 2024) is the de-facto standard for cross-session conversational memory and the headline benchmark vendors compete on. It asks whether a system can remember and reason over facts spread across long multi-session dialogues — single-hop recall, multi-hop chains, temporal ordering, and open-domain questions.
This is a short card. The full run catalog, per-cell judge prompts, dataset hashes, and the cross-provider protocol live in the interactive dashboard. For where LoCoMo sits among all memory benchmarks — and its known answer-key problems — see the AI Memory Benchmarks Field Guide.
What it tests
LoCoMo is a conversational-memory benchmark over machine-generated, persona-grounded dialogues with temporal event graphs. Questions span five categories: single-hop (recall one fact), multi-hop (chain facts across sessions), temporal (when/order), open-domain (world knowledge plus the conversation), and an adversarial category-5 set (unanswerable / should-abstain). The original paper scored answers with token-overlap F1 and ROUGE/FActScore; most downstream work — including ours — now uses an LLM-as-judge for semantic correctness, which is exactly where scores become judge-dependent.
Dataset
10 conversations, roughly 300 turns and 9K–26K tokens each, up to ~35 sessions; about 1,986 QA pairs, of which ~1,540 are the non-adversarial subset conventionally reported (the rest are category-5 adversarial). The conversations fit inside a modern context window, so LoCoMo does not cleanly separate a memory system from a long-context LLM — a caveat we carry on every LoCoMo number.
Our protocol
We score answers with LLM-as-judge prompts that return a binary CORRECT / WRONG verdict per question; a cell score is the fraction of questions a judge marked correct. Reader, embedding model, judge prompts, and conversation are held identical across every cell of a matrix, so any movement is attributable to the system or the k, not the harness.
Our results — the cross-system matrix
One LoCoMo conversation (conv-26), 152 questions after symmetric removal of the adversarial category (originally 199), evaluated with four LLM judges. Scores below are on the strict in-house judge (our dedicated strict-grader is harsher still). k is the retrieval top-k passed to the reader; higher is better.
| System | k=10 | k=20 | k=50 | k=100 | k=200 |
|---|---|---|---|---|---|
| Mnemoverse (research prototype) | 0.691 | 0.684 | 0.711 | 0.664 | 0.638 |
| Mnemoverse (API) | 0.296 | 0.362 | 0.408 | 0.526 | 0.539 |
| mem0 (cloud) | 0.572 | 0.625 | 0.625 | 0.664 | 0.664 |
| supermemory (cloud) | 0.342 | 0.421 | 0.651 | 0.638 | 0.645 |
| zep (cloud) | 0.217 | 0.355 | 0.480 | — | — |
Two Mnemoverse rows appear because both configurations are real:
- Research prototype — the internal engine evaluation (in-process call into
MemoryEngine), full algorithmic stack,strategy=Nonewith engine-config defaults. - API — the version available today at
core.mnemoverse.com(thinner request/response contract), invoked withtwo_pass=Trueandstrategy="auto", routing throughStrategyClassifieron the query text only (noqa.category, evidence, ordia_idleakage — verified by grep on the adapter).
The spread between the two rows is engineering work in flight, not a fixed gap: as features move from the research path into the public API, the API row converges toward the prototype. The em-dashes on the zep row mark cells where every query returned empty after the service quota was exhausted mid-sweep; we publish a single contiguous sweep under identical conditions rather than splice in an earlier run.
conv-47 update pending. The live dashboard has moved to a fresher conv-47 matrix (190 questions, all systems graded on the full set including the adversarial category — the symmetric fix for conv-26's removal asymmetry). This card will adopt conv-47 once the presentation is settled; until then it shows the committed conv-26 matrix that matches the frozen 2026-06-08 baseline.
Multi-judge view
The same 152 questions were also scored by mem0, gpt-4o-as-judge, and a strict-grader variant:
| Judge | Tendency |
|---|---|
| Mnemoverse (in-house) | strict — off-by-one dates, partial enumerations, and lost qualifiers fail |
| mem0 | lenient — Mem0's "be generous, same topic = correct" rubric; accepts paraphrases and partial credit our judge rejects |
gpt-4o | lenient — passes partial answers and date drift our judge rejects |
| strict-grader | very strict — penalises any token outside ground truth |
We publish the strict in-house judge as the headline; the others are recorded per cell for reproducibility but inflate by ~5–20 points on the same answers, mostly by accepting wrong dates or half-recalled lists. Why a single LoCoMo number is fragile — and how a prompt swap alone moves it ~40 points — is the subject of Judges, Good and Evil.
Known asymmetries (disclosed)
Per our methodology, we disclose measurement asymmetries rather than strip them to manufacture parity. After the PR #290 ASYM-024 closure (HTTP adapter now sends two_pass + strategy symmetric with the in-process baseline), an asym=0.25 residual gap persists and its direction now favours the API (HTTP) row: the classifier preset (PPR / gap_filling / entity_chain) can engage on HTTP but not on the strategy=None in-process baseline. Documented algorithm advantages on the Mnemoverse side (Hebbian feedback, two-pass, reranker, μ=0 ingest, between-session consolidation, edge pruning) are kept and disclosed, not hidden. Read the matrix against this caveat.
Reproducibility
All 25 cells (five systems × five k) — with judge prompts, dataset hash, judge-prompt hashes, embedding model, reader model, post-filter question count, and the pinned config.git_sha — are committed as JSON under experiments/benchmarks/matrix/cells/ in mnemoverse-core, alongside BASELINE_FROZEN_2026-06-08.md. The HTTP (API) row requires a core deploy exposing POST /api/v1/memory/read-batch accepting two_pass / strategy (post-PR #290); older deploys reproduce the pre-closure asymmetry.
Links
- Live, cell-by-cell matrix: benchmarks.mnemoverse.com
- Benchmarks overview — the hub and the other benchmark cards.
Related
- AI Memory Benchmarks: A Field Guide — where LoCoMo sits among all memory benchmarks, and its answer-key problems.
- Judges, Good and Evil — why a single LoCoMo number depends on the grader.
- How We Measure AI Memory Honestly — the discipline behind these numbers.
- Building Memory That Scales — the engineering story behind the engine these numbers measure.