LoCoMo — Long-Term Conversational Memory

LoCoMo (Maharana et al., ACL 2024) is the de-facto standard for cross-session conversational memory and the headline benchmark vendors compete on. It asks whether a system can remember and reason over facts spread across long multi-session dialogues — single-hop recall, multi-hop chains, temporal ordering, and open-domain questions.

This is a short card. The full run catalog, per-cell judge prompts, dataset hashes, and the cross-provider protocol live in the interactive dashboard. For where LoCoMo sits among all memory benchmarks — and its known answer-key problems — see the AI Memory Benchmarks Field Guide.

What it tests

LoCoMo is a conversational-memory benchmark over machine-generated, persona-grounded dialogues with temporal event graphs. Questions span five categories: single-hop (recall one fact), multi-hop (chain facts across sessions), temporal (when/order), open-domain (world knowledge plus the conversation), and an adversarial category-5 set (unanswerable / should-abstain). The original paper scored answers with token-overlap F1 and ROUGE/FActScore; most downstream work — including ours — now uses an LLM-as-judge for semantic correctness, which is exactly where scores become judge-dependent.

Dataset

10 conversations, roughly 300 turns and 9K–26K tokens each, up to ~35 sessions; about 1,986 QA pairs, of which ~1,540 are the non-adversarial subset conventionally reported (the rest are category-5 adversarial). The conversations fit inside a modern context window, so LoCoMo does not cleanly separate a memory system from a long-context LLM — a caveat we carry on every LoCoMo number.

Our protocol

We score answers with LLM-as-judge prompts that return a binary CORRECT / WRONG verdict per question; a cell score is the fraction of questions a judge marked correct. Reader, embedding model, judge prompts, and conversation are held identical across every cell of a matrix, so any movement is attributable to the system or the k, not the harness.

Our results — the cross-system matrix

One LoCoMo conversation (conv-26), 152 questions after symmetric removal of the adversarial category (originally 199), evaluated with four LLM judges. Scores below are on the strict in-house judge (our dedicated strict-grader is harsher still). k is the retrieval top-k passed to the reader; higher is better.

System	k=10	k=20	k=50	k=100	k=200
Mnemoverse (research prototype)	0.691	0.684	0.711	0.664	0.638
Mnemoverse (API)	0.296	0.362	0.408	0.526	0.539
mem0 (cloud)	0.572	0.625	0.625	0.664	0.664
supermemory (cloud)	0.342	0.421	0.651	0.638	0.645
zep (cloud)	0.217	0.355	0.480	—	—

Two Mnemoverse rows appear because both configurations are real:

Research prototype — the internal engine evaluation (in-process call into MemoryEngine), full algorithmic stack, strategy=None with engine-config defaults.
API — the version available today at core.mnemoverse.com (thinner request/response contract), invoked with two_pass=True and strategy="auto", routing through StrategyClassifier on the query text only (no qa.category, evidence, or dia_id leakage — verified by grep on the adapter).

The spread between the two rows is engineering work in flight, not a fixed gap: as features move from the research path into the public API, the API row converges toward the prototype. The em-dashes on the zep row mark cells where every query returned empty after the service quota was exhausted mid-sweep; we publish a single contiguous sweep under identical conditions rather than splice in an earlier run.

conv-47 update pending. The live dashboard has moved to a fresher conv-47 matrix (190 questions, all systems graded on the full set including the adversarial category — the symmetric fix for conv-26's removal asymmetry). This card will adopt conv-47 once the presentation is settled; until then it shows the committed conv-26 matrix that matches the frozen 2026-06-08 baseline.

Multi-judge view

The same 152 questions were also scored by mem0, gpt-4o-as-judge, and a strict-grader variant:

Judge	Tendency
Mnemoverse (in-house)	strict — off-by-one dates, partial enumerations, and lost qualifiers fail
mem0	lenient — Mem0's "be generous, same topic = correct" rubric; accepts paraphrases and partial credit our judge rejects
`gpt-4o`	lenient — passes partial answers and date drift our judge rejects
strict-grader	very strict — penalises any token outside ground truth

We publish the strict in-house judge as the headline; the others are recorded per cell for reproducibility but inflate by ~5–20 points on the same answers, mostly by accepting wrong dates or half-recalled lists. Why a single LoCoMo number is fragile — and how a prompt swap alone moves it ~40 points — is the subject of Judges, Good and Evil.

Known asymmetries (disclosed)

Per our methodology, we disclose measurement asymmetries rather than strip them to manufacture parity. After the PR #290 ASYM-024 closure (HTTP adapter now sends two_pass + strategy symmetric with the in-process baseline), an asym=0.25 residual gap persists and its direction now favours the API (HTTP) row: the classifier preset (PPR / gap_filling / entity_chain) can engage on HTTP but not on the strategy=None in-process baseline. Documented algorithm advantages on the Mnemoverse side (Hebbian feedback, two-pass, reranker, μ=0 ingest, between-session consolidation, edge pruning) are kept and disclosed, not hidden. Read the matrix against this caveat.

Reproducibility

All 25 cells (five systems × five k) — with judge prompts, dataset hash, judge-prompt hashes, embedding model, reader model, post-filter question count, and the pinned config.git_sha — are committed as JSON under experiments/benchmarks/matrix/cells/ in mnemoverse-core, alongside BASELINE_FROZEN_2026-06-08.md. The HTTP (API) row requires a core deploy exposing POST /api/v1/memory/read-batch accepting two_pass / strategy (post-PR #290); older deploys reproduce the pre-closure asymmetry.

LoCoMo — Long-Term Conversational Memory ​

What it tests ​

Dataset ​

Our protocol ​

Our results — the cross-system matrix ​

Multi-judge view ​

Known asymmetries (disclosed) ​

Reproducibility ​

Links ​

Related ​