Skip to content

LoCoMo — Long-Term Conversational Memory

LoCoMo (Maharana et al., ACL 2024) is the de-facto standard for cross-session conversational memory and the headline benchmark vendors compete on. It asks whether a system can remember and reason over facts spread across long multi-session dialogues — single-hop recall, multi-hop chains, temporal ordering, and open-domain questions.

This is a short card. The full run catalog, per-cell judge prompts, dataset hashes, and the cross-provider protocol live in the interactive dashboard. For where LoCoMo sits among all memory benchmarks — and its known answer-key problems — see the AI Memory Benchmarks Field Guide.

What it tests

LoCoMo is a conversational-memory benchmark over machine-generated, persona-grounded dialogues with temporal event graphs. Questions span five categories: single-hop (recall one fact), multi-hop (chain facts across sessions), temporal (when/order), open-domain (world knowledge plus the conversation), and an adversarial category-5 set (unanswerable / should-abstain). The original paper scored answers with token-overlap F1 and ROUGE/FActScore; most downstream work — including ours — now uses an LLM-as-judge for semantic correctness, which is exactly where scores become judge-dependent.

Dataset

10 conversations, roughly 300 turns and 9K–26K tokens each, up to ~35 sessions; about 1,986 QA pairs, of which ~1,540 are the non-adversarial subset conventionally reported (the rest are category-5 adversarial). The conversations fit inside a modern context window, so LoCoMo does not cleanly separate a memory system from a long-context LLM — a caveat we carry on every LoCoMo number.

Our protocol

We score answers with LLM-as-judge prompts that return a binary CORRECT / WRONG verdict per question; a cell score is the fraction of questions a judge marked correct. Reader, embedding model, judge prompts, and conversation are held identical across every cell of a matrix, so any movement is attributable to the system or the k, not the harness.

Our results — the cross-system matrix

One LoCoMo conversation (conv-26), 152 questions after symmetric removal of the adversarial category (originally 199), evaluated with four LLM judges. Scores below are on the strict in-house judge (our dedicated strict-grader is harsher still). k is the retrieval top-k passed to the reader; higher is better.

Systemk=10k=20k=50k=100k=200
Mnemoverse (research prototype)0.6910.6840.7110.6640.638
Mnemoverse (API)0.2960.3620.4080.5260.539
mem0 (cloud)0.5720.6250.6250.6640.664
supermemory (cloud)0.3420.4210.6510.6380.645
zep (cloud)0.2170.3550.480

Two Mnemoverse rows appear because both configurations are real:

  • Research prototype — the internal engine evaluation (in-process call into MemoryEngine), full algorithmic stack, strategy=None with engine-config defaults.
  • API — the version available today at core.mnemoverse.com (thinner request/response contract), invoked with two_pass=True and strategy="auto", routing through StrategyClassifier on the query text only (no qa.category, evidence, or dia_id leakage — verified by grep on the adapter).

The spread between the two rows is engineering work in flight, not a fixed gap: as features move from the research path into the public API, the API row converges toward the prototype. The em-dashes on the zep row mark cells where every query returned empty after the service quota was exhausted mid-sweep; we publish a single contiguous sweep under identical conditions rather than splice in an earlier run.

conv-47 update pending. The live dashboard has moved to a fresher conv-47 matrix (190 questions, all systems graded on the full set including the adversarial category — the symmetric fix for conv-26's removal asymmetry). This card will adopt conv-47 once the presentation is settled; until then it shows the committed conv-26 matrix that matches the frozen 2026-06-08 baseline.

Multi-judge view

The same 152 questions were also scored by mem0, gpt-4o-as-judge, and a strict-grader variant:

JudgeTendency
Mnemoverse (in-house)strict — off-by-one dates, partial enumerations, and lost qualifiers fail
mem0lenient — Mem0's "be generous, same topic = correct" rubric; accepts paraphrases and partial credit our judge rejects
gpt-4olenient — passes partial answers and date drift our judge rejects
strict-gradervery strict — penalises any token outside ground truth

We publish the strict in-house judge as the headline; the others are recorded per cell for reproducibility but inflate by ~5–20 points on the same answers, mostly by accepting wrong dates or half-recalled lists. Why a single LoCoMo number is fragile — and how a prompt swap alone moves it ~40 points — is the subject of Judges, Good and Evil.

Known asymmetries (disclosed)

Per our methodology, we disclose measurement asymmetries rather than strip them to manufacture parity. After the PR #290 ASYM-024 closure (HTTP adapter now sends two_pass + strategy symmetric with the in-process baseline), an asym=0.25 residual gap persists and its direction now favours the API (HTTP) row: the classifier preset (PPR / gap_filling / entity_chain) can engage on HTTP but not on the strategy=None in-process baseline. Documented algorithm advantages on the Mnemoverse side (Hebbian feedback, two-pass, reranker, μ=0 ingest, between-session consolidation, edge pruning) are kept and disclosed, not hidden. Read the matrix against this caveat.

Reproducibility

All 25 cells (five systems × five k) — with judge prompts, dataset hash, judge-prompt hashes, embedding model, reader model, post-filter question count, and the pinned config.git_sha — are committed as JSON under experiments/benchmarks/matrix/cells/ in mnemoverse-core, alongside BASELINE_FROZEN_2026-06-08.md. The HTTP (API) row requires a core deploy exposing POST /api/v1/memory/read-batch accepting two_pass / strategy (post-PR #290); older deploys reproduce the pre-closure asymmetry.