Skip to content

LongMemEval — Long-Term Interactive Memory

LongMemEval (Wu et al., ICLR 2025) measures whether a chat assistant can remember information across many sessions. The paper reports that commercial assistants and long-context LLMs drop about 30% in accuracy when they have to recall facts spread across sustained interactions — which is exactly the gap a memory engine is supposed to close.

This is a short card. The full run catalog, per-category scores, and the cross-provider protocol live in the interactive dashboard. Below is what the benchmark tests, how we run it, what we have measured so far, and — honestly — where our numbers are not yet settled.

What it tests

LongMemEval is a long-horizon conversational memory benchmark: 500 instances, each a question over a long history of prior chat sessions. The question types stress the parts of memory that break under length:

  • temporal-reasoning — ordering events across sessions.
  • multi-session — aggregating information that is scattered across several sessions.
  • knowledge-update — using the latest value when a later session contradicts an earlier one.
  • single-session-user / -assistant / -preference — recalling a fact, an assistant statement, or a stated preference from one session buried in the history.
  • abstention — refusing to answer false-premise questions (the protocol describes this 7th type; see caveats — the variant-s run we cite did not contain abstention instances).

The retrieval task is hard because the answer evidence is a few turns inside tens of thousands of irrelevant ones.

Dataset

  • Paper: arXiv:2410.10813 · Dataset: xiaowu0162/longmemeval-cleaned.
  • Variants: oracle (1–6 sessions, ~2K tokens — easy, not paper-comparable), s (≈38–62 sessions, ~115K tokens — the paper-comparable variant), m (~500 sessions, ~1.5M tokens).
  • The variant we report (s): 500 instances, averaging 47.7 sessions and 493.5 turns per instance (246,750 turns total), measured from our own run-time view of the dataset.
  • Judge: the paper uses a type-specific LLM-as-Judge (gpt-4o), reported at 97% human agreement. We port that judge prompt verbatim. (We have not independently re-verified the 97% figure from the paper body.)

Our protocol

For an apples-to-apples comparison across providers, the canonical protocol fixes:

  • reader = gpt-5, judge = gpt-5 with a fixed binary judge rubric, so every system is scored by the same model on the same prompt.
  • full variant-s (500 instances) — no oracle or toy subset.
  • μ = 0 ingest for our engine, with per-instance isolation (a fresh engine per instance — no cross-instance leakage).
  • scoring = binary LLM-as-judge.

The cross-provider leaderboard under this protocol is still being filled in (it depends on the multi-provider harness). Our own existing variant-s judge runs used gpt-5-mini as reader+judge, not gpt-5 — so they are paper-comparable to LongMemEval, but not exact to this protocol. They are the closest measured numbers we have, and they come with an unresolved discrepancy.

Our results — measured, not yet in the live matrix

Caveated numbers — read before quoting

LongMemEval has no committed matrix cell in our benchmark system. The numbers below live only in raw run JSONs and have a known provenance gap: no frozen judge-prompt hash, so the two runs' judges cannot be reconciled. Treat them as measured, not adjudicated. They are not a headline leaderboard claim, and we do not present a single "LongMemEval number" for Mnemoverse.

We have two variant-s judge runs that share configuration (variant s, gpt-5-mini reader+judge, top_k 20, 1536-d API embeddings) yet disagree sharply:

RunOverall judge score (stored)Per-category judge score (avg)
0327 (longmemeval_20260327_235944.json)≈0.62single-session-user 0.757 · knowledge-update 0.727 · single-session-assistant 0.661 · multi-session 0.649 · temporal-reasoning 0.534 · single-session-preference 0.20
0323 (longmemeval_20260323_130557.json)≈0.79single-session-user 0.942 · single-session-assistant 0.938 · knowledge-update 0.820 · temporal-reasoning 0.777 · multi-session 0.705 · single-session-preference 0.565

Each run's overall is stored directly as diagnostics.overall_judge_score (0.6197 and 0.792); the per-category column is the stored per-category judge scores.

Why we show both and pick neither

The two runs use the same reader, the same retrieval depth, and the same embeddings on the structured config — yet the 0323 run reports much higher per-category judge scores while its token-F1 is lower (e.g. single-session-user F1 0.154 on 0323 vs 0.521 on 0327). Higher judge accuracy paired with lower token overlap is the signature of a more lenient judge pass: the judge accepted answers the stricter pass rejected, without the answers actually matching the references more closely.

This is the same leniency effect documented in The Judge Says Yes Too Easily — an LLM judge's strictness can swing the headline number by tens of points without the underlying retrieval changing at all. Without a frozen judge-prompt hash, we cannot say which pass is the "correct" one. So we present the ≈0.62 / ≈0.79 band and assert neither as canonical. When the gpt-5 protocol run lands with a frozen judge, it will replace both.

Known asymmetries

  • Backbone gap. Top published systems use frontier models (Hindsight on Gemini-3 Pro, others on gpt-4o); our measured runs used gpt-5-mini. The fixed-judge protocol exists to remove this confound — until it runs, cross-system comparison is not apples-to-apples.
  • Self-reported vs measured-by-us. Published competitor numbers (Hindsight ≈91.4, EverMemOS ≈83.0, TiMem ≈76.9, GPT-4o RAG top-10 ≈71.4, GPT-4o full-context ≈60.6, Mem0 ≈65.0) come from each system's own judge and backbone. They are reference points, not a single ranking against our numbers. Several widely-cited figures were corrected against primary sources (e.g. the GPT-4o RAG "72.0" is not in the paper — Table 2's best GPT-4o @10 is 71.4; the Llama-3.1-8B "45.4" is not a cell in the paper; the MemGPT "15.8" is from a vendor blog, not Wu et al.).
  • Abstention coverage. The protocol describes 7 question types including abstention; the variant-s run we cite contained only the 6 non-abstention types (500 instances). So our numbers do not exercise the false-premise refusal case.
  • Judge identity matters more than the dataset here. As the discrepancy above shows, on this benchmark the judge configuration can move the headline more than the memory engine does. Any single number without a judge-prompt hash should be read skeptically — including ours.

By Edward Izgorodin · last updated 2026-06-21.