Skip to content

How We Measure AI Memory Honestly

In early 2026 a benchmark run scared us. A fresh run of our memory engine on LoCoMo came back about 16 points lower than an earlier run we trusted. Sixteen points is not noise — it's the kind of drop that means a recent code change broke something. We went looking for the bad commit.

There was no bad commit. The code was identical between the two runs. The earlier, higher number had been produced with retrieval-boosting flags on — a different two_pass setting, a different strategy, a deeper top-k. The new, lower number was the engine on its plain out-of-the-box defaults. We had stacked a tuned score next to a default score and read the gap as a regression. The 16 points were never in the engine. They were in the settings.

That false alarm is why this page exists. It is the constructive companion to our judge-variance article, which shows why most memory leaderboards can't be trusted. This page is the other half: the discipline that makes our numbers re-derivable — most of it learned the hard way, one mistake at a time.

TL;DR

  • Regression, not peak. Our canonical number is the out-of-the-box default, used as a regression tripwire — not a hand-tuned showpiece. The tuned config is kept, labelled a ceiling, never the headline. We never compare numbers across paradigms.
  • The comparability key. A score is a recipe: judge + grading prompt + retrieval depth (top-k) + dataset slice + paradigm. All must match before two numbers are comparable. A number without its key is a rumour.
  • Provenance. Every public figure is meant to be a four-link chain — frozen recipe card, raw result file, registry row, git commit — joined by a run_id, with an automated gate that fails the release on any untraceable number. This is a target, not yet fully enforced.
  • Recall first. Recall@k is answer-key-checked and judge-free — the hard-to-game floor. The LLM-judge score is a soft overlay shown beside it.
  • The asymmetry inventory. We catalogue every way our own comparisons could be unfair — and publish, candidly, that about 16 of ~30 known asymmetries tilt in our favour.

The settings trap: why we publish the everyday number, not the showcase

The scare taught us that there are two completely different questions you can ask a memory system, and that confusing them produces lies.

The regression paradigm. The canonical, headline number is the score the engine produces on its shipping defaults — what a real user gets out of the box. We treat the benchmark primarily as a regression test: a tripwire that fires if a code change quietly makes that default experience worse.

The other question — "what's the highest score this engine can reach with every knob tuned for this exact test?" — is fair, and we answer it too. But that number is a ceiling, kept separately and labelled as such, never the headline: a tuned ceiling tells you what the engine could do under ideal lab conditions, not what you'll actually get. Most vendor leaderboard numbers are ceilings dressed as headlines — the single best config the vendor could find for one test — and two ceilings from two vendors often aren't even measuring the same thing.

The rule that fell out of the incident is the single most-repeated discipline we hold: never compare numbers across paradigms. Internally a run is tagged with the question it answered — regression (defaults), tuned-frontier (the ceiling), competitor-mirror (apples-to-apples against a named competitor's exact setup), exploration (one knob varied), or smoke (a fast sanity check, never a headline). A default number set against a tuned number isn't a comparison. It's the 16-point false alarm, repeated.

A score is a recipe, not a fact: the comparability key

Once you've been burned by a settings difference, you start seeing recipes everywhere. A score like "92%" is not one fact. It is the output of a recipe, and if any ingredient changes, the number changes — usually with no change to the memory system at all. The ingredients that move it most:

IngredientWhat it isWhy it moves the number
JudgeThe model and the grading instructions used to mark answers right or wrongThe instructions are the single biggest lever — see below
Retrieval depth (top-k)How many candidate memories the system pulls backPull back more context and the score rises, regardless of memory quality
Dataset sliceOne conversation, a 10-conversation subset, the full set, with or without trick questions"LoCoMo" is not one fixed thing; each slice gives a different number
ParadigmDefault / tuned / competitor-mirror (the question from above)Different recipes entirely

The comparability key is that bundle — judge + grading prompt + retrieval depth + dataset slice + paradigm. Two numbers are comparable only when all of it matches. We attach the key to every number we publish, because a number without its key is a rumour.

The judge ingredient should frighten you most, and it's where our judge-variance article does the heavy lifting. The short version: on one fixed set of 152 answers — same answers, same grader model — swapping only the grading instructions moved the score from about 48% (a strict "did it actually get this right" rubric) to about 90% (a lenient rubric that accepts paraphrases, partial credit, and date wiggle-room). That is roughly a 42-point swing from instructions alone, with zero change to the memory system or even the answers. The grader model barely mattered — 1 to 2 points. (The full control, including the same swing reproduced on a competitor's own answers, is in the companion article.)

Lead with recall: the number that can't be graded into existence

Because the grader is the biggest lever on the score, our most-trusted number doesn't use a grader at all.

Recall@k asks a mechanical question: did the system actually retrieve the correct memory, checked against a known answer key? No AI grader, nothing to inflate.

A memory test can measure two different things, and we report both. Recall measures whether the engine found the right memory — answer-key-checked, judge-free, hard to game. The LLM-judge score bundles the memory, the answer-writing, and the grader's leniency, so it is inflatable. "Lead with recall" is our first reporting rule; the judge score is a softer overlay shown beside it, never alone. And when we do show a judge score, the headline is the strict judge — the conservative floor — with more lenient graders shown next to it for contrast, never on their own. Reporting a strict number as the headline, when the incentive everywhere else is to report the most flattering one, is exactly why we do it.

Why it matters: at shallow retrieval depth the judge score can sit above recall. The system earns credit for answers it could not actually have grounded in a retrieved memory — lucky guesses and partial-credit hedges a lenient grader waved through. A high "pass rate" can coexist with the engine having missed the relevant memory entirely. Recall exposes that gap; we publish both so you can see it. The full recall-versus-judge evidence is in the judges article.

Provenance: a number you can re-derive (and where we fall short)

A 16-point ghost regression also teaches you to distrust any number you can't walk backward. So for every figure we publish — site, deck, chat message — the target is a chain of evidence with four links, joined by a single run_id:

  1. A frozen recipe card written before the run starts, capturing the exact command, every setting, and the grading instructions — so nothing can be quietly changed afterward (the precise failure mode of the 16-point incident).
  2. The raw result file, with every question's answer and score.
  3. A registry entry — one row in a ledger that indexes the run, tags its paradigm, and flags whether it's clean enough to publish.
  4. The exact git commit the run executed against.

Provenance is that four-link chain. If a link is missing, the number is untraceable and is not supposed to be published. An automated gate scans our public surfaces, pulls out every number, and fails the release if one can't be traced to a registered run.

Here is the candid part. This is the target, and it is not yet fully enforced. Dozens of historical numbers predate the discipline; they are flagged as untraceable and are still being back-filled or excluded. We document that gap rather than hide it — a methodology page about provenance has no business pretending its own provenance is complete when it isn't. The point of the gate isn't that we've already won; it's that an untraceable number is meant to fail our own release check, instead of quietly being grandfathered in.

And the comparability key is not just an internal promise — most of it is already public. Every cell behind the interactive benchmark matrix ships as a JSON record carrying its system, judge, reader model, retrieval depth, dataset scope, per-judge accuracy, recall provenance, and (where applicable) a link to its asymmetry inventory, readable directly in the matrix manifest. The deeper records — the run registry, the frozen recipe cards, the decision records — are internal today and being opened; but the recipe already travels with every published cell, where you can check it rather than take our word for it.

The asymmetry inventory: we audited our own fairness and published the ledger

The hardest discipline is being honest when honesty costs you the favourable number. When we compare ourselves against a competitor, dozens of small setup details can tilt the result — different question sets, different answer-formatting hints, one system's retrieval silently capped lower than another's. Instead of pretending these don't exist, we keep a written inventory of every known asymmetry: what it is, which direction it tilts (toward us, toward the competitor, or both), how big it is when we know, and how to fix it. Every comparison references it.

The inventory is candid even when it's unflattering. Of roughly 30 catalogued asymmetries, about 16 tilt in our favour. The majority of the known ways our head-to-heads could be unfair flatter us — and we state that plainly in the summary rather than bury it. That's the point of writing the ledger: we ran the adversarial audit on our own numbers. The catalogue is published, and every matrix cell links the asymmetry methodology that applies to it, so the tilt is auditable per number rather than asserted in prose.

We also classify every advantage a benchmark setup gives us into one of three integrity tiers:

TierMeaningWhat we do
HONESTThe advantage also happens in production (e.g. our memory keeps learning between questions)Keep it — and disclose it as a kept advantage
GIFTFree in the test but must be earned in productionKeep only if disclosed
CORRUPTImpossible in production — cheating (e.g. leaking the answer key or a dataset fingerprint into retrieval)Never published. We found such leaks in our own setup and removed them

The mandate is blunt: win, but without cheating, and disclose. Kept algorithmic advantages are disclosed, not hidden. The end state we're driving toward is a single shared, symmetric harness that runs every system through identical code — removing most of these asymmetries by construction, not by good intentions.

What this means when you read a memory number

The 16-point false alarm reduced our whole method to a short checklist that works on anyone's numbers, including ours. When you read a benchmark claim, ask for five things:

  1. Which paradigm? Everyday default, or tuned showcase ceiling? Don't let the two be compared.
  2. The comparability key. Judge model and grading prompt, retrieval depth, dataset slice. Missing any, the number is a rumour.
  3. Recall, not just pass-rate. The judge-free floor is the hard number; a high pass-rate over a low recall is partial-credit hedging.
  4. Provenance. Can the figure be traced to a frozen recipe, a raw result file, and a code commit?
  5. The asymmetries. In a head-to-head, which way does each setup detail tilt — and did anyone write that down?

We don't claim to top a leaderboard, and we don't claim the discipline above is fully built — the provenance gate is still a target and the symmetric harness is still being assembled. What we claim is narrower and, we think, more useful: when we publish a number, the recipe travels with it, recall comes first, and the ways we could be flattering ourselves are written down where you can read them.

Common questions

Why publish the default config instead of the best-tuned score? Because the default is the score a real user actually gets, and we use it as a regression tripwire. The tuned config is kept as a labelled ceiling, never the headline — and we never compare across paradigms. The discipline came from a 16-point "regression" that turned out to be a tuned-vs-default settings difference, not a code change.

Why lead with recall@k instead of pass-rate? Recall is answer-key-checked and judge-free — the hard-to-game floor. Pass-rate bundles in the grader's leniency and can stay high even when the engine missed the memory and the answer hedged for partial credit. We show both and lead with recall.

Do any benchmark setups favour Mnemoverse? Yes — about 16 of roughly 30 catalogued asymmetries tilt in our favour, and we publish that. Production-real advantages we keep and disclose; anything that leaks the answer key is classed CORRUPT and never published.

Sources

  • The public benchmark matrix — every published cell, with its judge, reader model, retrieval depth, dataset scope, per-judge accuracy, and recall provenance: the matrix manifest and the interactive hub. This is the inspectable face of everything below.
  • The asymmetry methodology — the published catalogue each comparison references, classifying every measurement asymmetry by direction, severity, and status (16 of 30 tilt toward us): asymmetric-v1. (The HONEST / GIFT / CORRUPT integrity tiers below are our own framing, not part of that artifact.)
  • Regression-not-peak (ADR-008) and the run registry. The internal decision record and ledger behind the four-link provenance chain (recipe card → result file → registry row → git commit) and the paradigm taxonomy — internal today and being opened, with the public matrix above as their inspectable face.
  • Judge-variance evidence. The 48%↔90% swing on a fixed 152-answer set, the recall-versus-judge gap, and the competitor-answer replication, documented in LLM-as-Judge Variance in AI Memory Benchmarks.

By Edward Izgorodin. Last updated 2026-06-23.