Skip to content

Judges, Good and Evil: Why Memory Benchmark Scores Swing 40 Points

Picture two graders reading the exact same stack of answers from an AI-memory system. The first is generous — partial credit, paraphrase tolerance, dates within a couple of weeks count as right. The second is severe — miss a detail, lose the point. Same answers, same questions, same retrieval — yet the kind grader returns 0.90 while the severe grader returns 0.48. Forty-two points apart, and not one byte of the memory system changed.

That is not a thought experiment. It is a control we ran on our own numbers, and then on a competitor's — the experiment that should make you distrust every agent-memory leaderboard, including ours. In AI-agent memory benchmarks — LoCoMo, BEAM, and their kin — the LLM judge, not the memory engine, often decides the headline. This article is about why you cannot trust a memory leaderboard without the judge's recipe, and what an honest score looks like instead.

TL;DR

  • On a fixed set of LoCoMo answers, swapping only the grading prompt moved our score from 0.9013 to 0.4803 — a 42.1-point drop, with 64 of 152 verdicts flipped. The judge model moved it 1–2 points. The prompt did the rest.
  • We reproduced the swing on a competitor's own answers (Mem0 v2 OSS): 0.7413 → 0.3427, a 39.9-point drop, 57 of 143 verdicts flipped. Generation can't be the cause; the prompt is.
  • Recall@k — answer-key-checked, judge-free — is the honest floor. Across k=10→200, retrieval recall rose 25.8 points while the judge score rose only 13.1. At k=10, a third of the gold evidence was never retrieved, yet the generous judge scored 0.770.
  • The published literature agrees on direction: judge leniency varies ~5× across models, and the LoCoMo answer key is itself ~6.4% wrong.
  • Our own headline (0.9013, a generous judge) lives inside the soft metric we're criticizing. We say so up front — and we lead with recall@k.

The good judge and the evil judge are the same model

LLM-as-judge is the practice of having a large language model read a system's answer, compare it to a reference answer, and decide CORRECT or WRONG (or assign a score). It is how nearly every AI-memory leaderboard turns raw answers into a number.

We built a four-judge registry that varies exactly one axis at a time, so any swing is attributable rather than confounded. Two of the judges run the same model (gpt-5) and differ only in their grading prompt: the mem0 judge uses Mem0's published rubric — partial credit, paraphrase tolerance, dates within ±14 days, durations within ±50%. The strict judge uses an adversarial-strict rubric. Same model, same answers, one prompt apart.

Here is the control on our own answers — LoCoMo conversation 26, 152 questions, one gpt-5 reader, top-k 200:

JudgeGrading promptScore (J)
mem0 (gpt-5)lenient rubric0.9013
mem0-4o (gpt-4o)lenient rubric, different model0.914
strict (gpt-5)strict rubric0.4803

The "good" judge and the "evil" judge are not two different models with two different temperaments. They are the same gpt-5, reading the same 152 answers. The only thing that changed is the prompt — and 64 of those 152 answers are CORRECT to the good judge and WRONG to the evil one. Meanwhile, swapping the judge model from gpt-5 to gpt-4o under the identical lenient prompt moved the score by +1.3 points (2–4 cases). The prompt is the lever; the model barely moves it.

So when a leaderboard publishes a single memory score, the relevant question is not "how good is the memory?" but "which grader read the answers, and with what rubric?"

The control that closes the loophole

There is an obvious objection: maybe the strict judge cratered our score because our answers are genuinely mediocre, and the lenient judge was papering over weak generation. If that were true, the swing would vanish when you grade a stronger system's answers.

So we ran the same two prompts — same gpt-5 model — over Mem0's own published v2 OSS answers on the same conversation. We changed nothing about how those answers were produced — we only judged them, twice:

JudgeScore (J), 143 scored
mem0 (gpt-5)0.7413
strict (gpt-5)0.3427
swing−39.9 points

57 of 143 of Mem0's own answers are CORRECT to the good judge and WRONG to the evil one. The collapse is uneven by category — open-domain questions fall 76.9 points under the strict prompt, multi-hop 48.0, temporal 38.9, single-hop 30.4 — but the headline is unambiguous: a leader's own answers swing ~40 points on a single prompt swap, with their generation untouched.

The two controls — −42.1 points on our answers, −39.9 on Mem0's — agree within about two points. Generation cannot be the cause, because generation never changed. The only variable across each pair of runs is the grading prompt, which means the canonical LoCoMo headline, scored under the lenient verbatim rubric, is judge-inflated for everyone who runs that eval — us included.

Our own headline lives inside the soft metric

This is the uncomfortable part, and the point at which an honest article has to turn the instrument on itself.

Our reported 0.9013 on conversation 26 was produced by the good judge — the lenient mem0 rubric. It sits squarely inside the same soft metric we are criticizing. When the evil judge reads the identical answers, our number is 0.4803. We are not exempt from our own argument; we are the first example in it.

That is exactly why we do not let the judge score stand alone — and why the rest of this article is about the metric we trust more.

Recall@k: the floor the judge can't inflate

Recall@k is the fraction of the gold evidence a system actually retrieves within its top k results, checked directly against the answer key — no LLM in the loop. It is judge-free, costs nothing, and is hard to game, because it scores retrieval depth rather than the persuasiveness of a final sentence.

Watch what happens when recall and the judge score look at the same k-curve (gpt-5 reader, lenient mem0 judge):

krecalljudge score (J)J − recall
100.6690.770+0.10
200.7630.829+0.07
500.8400.862+0.02
1000.8830.875−0.01
2000.9270.901−0.03

At k=10, a third of the gold evidence is never retrieved (recall 0.669) — yet the reader plus the generous judge hand back 0.770, ten points of credit for answers the system could not have grounded. Across k=10→200, recall climbs 25.8 points while the judge score climbs only 13.1: the judge is half as sensitive to retrieval depth. The same gap shows up on BEAM, where one run scores J=0.610 but ground-truth atom recall@100 is only about 34–36% (see the BEAM card for the canonical figure) — the judge's partial credit inflating the headline by an estimated 10–15 points.

The reading we take from this: recall@k is the honest memory metric; the judge score is a soft, depth-inflated overlay. Lead with the floor, show the overlay alongside it, and never let the overlay travel alone.

The field, fairly: judges drift, and so does the answer key

Our ~40-point prompt swing is at the extreme end, but the direction — the grader, not the system, moves the number — is well-documented. "Judging the Judges" (arXiv:2406.12624) finds judge leniency (the rate of accepting a wrong answer) ranging from 0.19 to 0.99 across judge models — a ~5× spread purely from model choice. "Evaluating Scoring Bias in LLM-as-a-Judge" (arXiv:2506.22316) shows a detailed rubric scores the same answers systematically harder than a minimal one. The MT-Bench study (arXiv:2306.05685) documents position bias of up to 75% for the first-placed answer, a verbosity bias toward longer answers, and a self-enhancement effect of roughly 10–25% when a model grades its own text (an effect the authors flag as suggestive, not conclusive).

And the benchmark beneath all of this is not pristine. An independent Penfield Labs audit (April 2026, one team, not peer-reviewed) found that ~6.4% of the LoCoMo answer key is wrong — 99 of 1,540 questions, including 24 speaker-attribution errors. For calibration, Northcutt et al. (NeurIPS 2021) measured an average 3.3% label-error rate across ten major ML benchmarks — LoCoMo is nearly double that. The same audit found the canonical gpt-4o-mini judge accepts 62.81% of intentionally wrong, topically-adjacent answers, which — if it holds — puts the effective ceiling near 93.6%. Vendor headlines clustered at 92–95 are, at that altitude, measuring judge leniency and answer-key noise as much as memory.

The fair counterweight, which we will not omit: an MT-Bench GPT-4 judge agreed with human raters about 85% of the time — essentially the ~81% at which humans agree with each other. The landing here is not "LLM judges are useless." It is narrower and sturdier: uncalibrated, vendor-chosen, undisclosed judges inflate. Demand the recipe — judge model, judge prompt, reader, top-k, dataset slice, answer-key version — before you trust a score.

Five questions to ask any memory leaderboard

  1. Which judge model graded it? (Worth ~1–2 points — the least of your worries.)
  2. Which grading prompt / rubric? Lenient versus strict is the ~40-point lever.
  3. What top-k, and did recall@k keep up — or did the judge credit answers the system never retrieved?
  4. Which slice of the dataset — the full set, or an easier subset with the adversarial questions dropped?
  5. Did the system grade its own answers? Generator = judge is the textbook self-preference setup.

If a leaderboard doesn't answer all five, it isn't reporting a memory score — it's reporting a grader's mood.

What an honest memory score looks like

The field's structural problem is that every follower runs its own judge model, prompt, reader, and question subset, then presents the result as comparable. A single system's LoCoMo number can drift upward — Mem0 reported 66.88, then 91.6, then 92.5 — as the harness changes, not the engine. Some reported headlines are self-graded — in one case a 94.7 figure with the same model as reader and judge (reported, not independently reproduced) — the textbook self-preference setup. And the Mem0↔Zep exchange shows how far the test rig alone moves a headline: Mem0 reported Zep at 65.99; Zep re-ran with a corrected harness and reported 75.14 (about a 9-point move); separately, Mem0's counter-analysis argues Zep's distinct 84% claim falls to 58.44 once the adversarial category is handled and ten runs are averaged instead of one. Both sides argue in good faith — the point is that the number is a property of the harness, not the memory. We name these with their sources and mark every cross-vendor comparison comparable: false — not as a hit piece, but because they are not measuring the same thing, and neither are we when our protocol differs.

Our own practice follows from the evidence:

  • Lead with recall@k — judge-free, answer-key-checked — as the floor.
  • Report the strict judge as the headline, not the lenient one, and show multiple judges side by side so the spread is visible, not hidden.
  • Ship the full protocol with every number: judge model, judge prompt, reader, top-k, category set, answer-key version, with the re-judge logs committed.

We are not claiming to top a leaderboard. We are claiming something narrower and more durable: a number you can re-derive. No memory-benchmark headline is interpretable without its full protocol attached — a score without its judge recipe is a rumor, not a measurement.

Common questions

Why do memory benchmark scores disagree so much? Because most AI-memory benchmarks score answers with an LLM judge, and the judge's grading prompt — not the memory system — often decides the headline. Holding the answers fixed and swapping only the prompt moved our LoCoMo score from 0.9013 to 0.4803, a 42.1-point drop; the judge model moved it 1–2 points. Different vendors run different judges, prompts, readers, top-k settings, and question subsets, so their numbers are not measuring the same thing.

Are LLM judges reliable? On average they track human judgment well — an MT-Bench GPT-4 judge agreed with human raters about 85% of the time, near the ~81% at which humans agree with each other — so judges are not useless. The failure mode is uncalibrated, vendor-chosen, undisclosed judges: leniency varies ~5× across models, self-grading inflates, and the LoCoMo answer key is itself ~6.4% wrong. A judge score is trustworthy only when its full recipe is published and independently checked.

What is recall@k and why lead with it? Recall@k is the fraction of the gold evidence a system retrieves in its top k results, checked against the answer key with no LLM judge involved. We lead with it because it is judge-free, free to compute, and hard to game — it tracks retrieval depth where the judge score goes muted. On our k-curve, recall rose 25.8 points from k=10 to k=200 while the judge score rose only 13.1, and at k=10 a third of the gold evidence was never retrieved even though the judge scored 0.770.

Sources

  • Judging the Judges — leniency (P+) 0.19–0.99 across judge models — arXiv:2406.12624
  • Evaluating Scoring Bias in LLM-as-a-Judge — prompt strictness moves same-answer scores — arXiv:2506.22316
  • Judging LLM-as-a-Judge (MT-Bench), Zheng et al., NeurIPS 2023 — position, verbosity, self-enhancement bias; ~85% judge–human agreement — arXiv:2306.05685
  • Self-Preference Bias in LLM-as-a-Judge (NeurIPS 2024 workshop) — arXiv:2410.21819
  • Penfield Labs LoCoMo answer-key audit (April 2026, one team, not peer-reviewed) — ~6.4% key errors; judge accepts 62.81% of wrong answers — dev.to/penfieldlabs
  • Northcutt, Athalye & Mueller, Pervasive Label Errors in Test Sets Destabilize ML Benchmarks (NeurIPS 2021) — avg 3.3% label-error across 10 major ML benchmarks — arXiv:2103.14749
  • Zep "Lies, Damn Lies, and Statistics" and the Mem0↔Zep correction thread — blog.getzep.com · zep-papers issue #5
  • Mem0 LoCoMo paper — arXiv:2504.19413
  • Our committed re-judge logs — the four-judge registry, the conv-26 cross (0.9013/0.4803), and the Q#8 control on Mem0's own answers (0.7413/0.3427) — committed in our benchmark experiments (rejudge_20260521_235650.json, rejudge_q8_20260531T201308Z.json; commit 9e8c3c2) and reproducible from the full protocol.

Edward Izgorodin, June 2026 — LinkedIn


— Mnemoverse is a persistent-memory API for AI agents. Free key: console.mnemoverse.com · Docs: Getting Started