The Judge Says Yes Too Easily: LLM-as-a-Judge, Leniency, and the Memory-Benchmark Number
The judge under the agent-memory leaderboard is given a written instruction, and you can read it. In the Mem0 paper (Appendix A), the grader is told: "The generated answer might be much longer, but you should be generous with your grading - as long as it touches on the same topic as the gold answer, it should be counted as CORRECT."
TL;DR
- The LLM judge is good on average — an MT-Bench GPT-4 judge agreed with humans ~85% of the time — but its error has a direction: it says yes far more readily than no.
- That leniency (agreeableness bias) clusters around one mechanism: "right topic, wrong specifics" — an answer that names the right subject while missing every detail tends to pass.
- It is baked into the memory leaderboard: the LoCoMo "LLM-as-a-Judge" score uses a Mem0 grader told to "be generous, same topic = correct," and the number swings by double digits with harness choices.
- Trustworthy judging is unglamorous: ground in a reference, check atomic facts not topics, use a jury of disjoint families, and calibrate with Cohen's kappa — never trust one absolute score.
Read that twice. Generous. Same topic = correct. That prompt — running on gpt-4o-mini, returning a binary CORRECT/WRONG — produces the "LLM-as-a-Judge accuracy" percentage that vendors now quote when they claim state-of-the-art memory. So here is the question worth a few minutes: how did a be-generous instruction become a headline SOTA number?
This is the deep-dive on the judge itself. Our companion piece on LangSmith evaluation touched it in passing; this one takes it apart.
LLM-as-a-judge is a hidden dependency
Most "quality scores" are an LLM grading an LLM. On average, that works well. The foundational MT-Bench study (Zheng et al., NeurIPS 2023) found a GPT-4 judge agreeing with human raters about 85% of the time — essentially matching the 81% at which humans agree with each other (HTML body). On average, the judge is a fair stand-in for a person.
LLM-as-a-judge is the practice of using one LLM to grade another LLM's output, which produces most of the "quality scores" reported today and on average tracks human judgment well.
Hold the word average. An average can be high while hiding a strong directional error — the judge is good on average, and the average is the wrong question. Its error has a direction: it says yes far more readily than no.
LLM judge bias: the average hides a direction, and the direction is leniency
The sharpest measurement of this comes from Beyond Consensus (Jain, Ahmed, Sahai, Leong, 2025), which tested LLM validators on 366 high-school Python programs. The judges flagged valid outputs at a true-positive rate of about 96% — but caught invalid ones at a true-negative rate of under 25%. About a quarter of the wrong answers were missed by every validator in the panel. A judge that almost never misses a correct answer and almost always waves through a wrong one is the textbook signature of over-acceptance. The authors call it agreeableness bias.
Agreeableness bias is the LLM judge's directional tendency to over-accept — flagging valid answers at a true-positive rate of ~96% while catching invalid ones at a true-negative rate under 25%.
What does the leniency actually look like in practice? It clusters around a single mechanism — "right topic, wrong specifics": an answer that names the right subject while missing every detail tends to pass. The literature is by now a catalogue of how that mechanism shows up and how easily it is exploited:
| Failure mode | What it does | Source |
|---|---|---|
| Agreeableness | TPR ~96% on valid answers vs TNR <25% on invalid — says yes to almost everything | Jain et al., 2025 |
| Relevance over-rating | Open-weight IR judges over-rate ~45–66% of non-relevant passages; keyword-stuffing flips most to "perfectly relevant" | Yu et al., 2026 |
| Adversarial persuasion | Rhetoric inflated scores on incorrect math by up to ~8% on average | Hwang et al., 2025 |
| Evaluation faking | "Your verdict has consequences" shifted judges toward leniency in 58/72 cells; 0 of 4,560 traces admitted it | Gupta et al., 2026 |
| Position | GPT-4 keeps its verdict only 65% of the time when answer order is swapped | MT-Bench |
| Verbosity | A repetitive-list "padding" attack fooled Claude-v1 and GPT-3.5 91.3% of the time (GPT-4 only 8.7%) | MT-Bench |
| Self-enhancement | GPT-4 ~10% / Claude-v1 ~25% higher self-win-rate — though the authors cannot determine it is true self-preference | MT-Bench |
Read the table with its caveats, because they are the difference between honest and not. The verbosity and position failures concentrate in weaker, cheaper judges — exactly the kind used at benchmark scale. The self-enhancement figures carry the original authors' explicit warning that limited 2023-model data means they cannot confirm it is genuinely self-preference. And the ~45–66% relevance over-rating is open-weight judges on information-retrieval grading; the rate does not transfer to a GPT-4-class QA judge. The pattern is convergent across tasks; the exact number is task- and model-specific.
Two of those findings close the door on the obvious defenses. Confidence is no signal — in the relevance study, models reported over 95% confidence on both their correct and incorrect verdicts. And reading the judge's reasoning does not catch the bias — Context Over Content held the evaluated text constant and merely told the judge its verdict "had consequences"; across 4,560 chain-of-thought traces, not one acknowledged that the framing had moved it. The judge cannot tell you when it is being lenient, and it does not know.
The LoCoMo judge is built into the memory benchmark you cite
Now the part that should change how you read a memory leaderboard.
The original LoCoMo benchmark (Maharana et al., ACL 2024) did not use an LLM judge at all. It scored question-answering with token-overlap F1, summaries with ROUGE and FActScore, and multimodal dialogue with a dedicated relevance metric — all lexical or overlap-based, no LLM judge. The "LLM-as-a-Judge score (J)" that vendors now headline was introduced later, by the Mem0 paper, which argued — reasonably — that lexical metrics "exhibit significant limitations" for conversational factual accuracy. So they layered a judge on top. The judge is the one quoted at the top of this article: be generous, same topic = correct, binary, on gpt-4o-mini — and in that setup the same small model often both generates the answer and grades it. (The prompt is stricter elsewhere — it demands the right date on temporal questions — but on factual recall, "same topic" is the bar.) That instruction gives the agreeableness bias an opening.
The LoCoMo "LLM-as-a-Judge score (J)" is a later add-on introduced by the Mem0 paper — a binary CORRECT/WRONG grader told to be generous, running on gpt-4o-mini — not part of the original LoCoMo benchmark, which used lexical F1, ROUGE, and FActScore.
How lenient is that judge in practice? Penfield Labs — one team's audit, posted to dev.to in early April 2026, not peer-reviewed — generated intentionally wrong but topically adjacent answers for all 1,540 non-adversarial LoCoMo questions and ran them through the standard judge config. It accepted 62.81% (~63%) of them as correct. The leniency was not uniform: answers with a specific factual error (wrong name, wrong date) were caught about 89% of the time, but vague answers that named the right topic and missed every specific passed roughly two-thirds of the time. That is the mechanism, quantified. The same audit also reported that 6.4% (99 of 1,540) of the answer key itself is score-corrupting — so the lenient judge is grading against a partly-broken key.
And the number is not even stable. Harness choices swing the same system by double digits:
- The Mem0 paper reported Zep at 65.99% ± 0.16. Zep re-ran with a corrected harness — fixing role assignment, timestamp handling, and search parallelism — and reported 75.14% ± 0.17, about a 9-point move.
- Mem0's counter-analysis argues that Zep's separate "84%" claim falls to 58.44% once the adversarial category is handled correctly and ten runs are averaged instead of one.
The lesson is not "someone cheated." Both sides are arguing in good faith about implementation details, and both are right that those details move the score. The lesson is that the number is harness-dependent — a property of the test rig, not of the memory system. Hindsight puts the general case plainly: small changes to the judge prompt, the generation prompt, or the judge model "can swing accuracy scores by double digits." And because vendors self-report on their own judge and harness — ByteRover, for instance, reports 92.2% using a different judge model (Gemini-3-Flash) and a different prompt (Hindsight's) — there is no single, directly comparable leaderboard across them. For perspective, the systems in the Mem0 paper's own table span ~48% to ~73%; but the ~73% ceiling is a plain full-context baseline, not a memory system — Mem0 itself sits near 67%. That ~24-point spread is the same order of magnitude as the harness-induced swings.
A single absolute memory score, then, is the weakest signal you can rely on.
What a trustworthy judge looks like
None of this means the judge is useless — it means a naked absolute score, ungrounded and uncalibrated, is. The fixes are unglamorous, well-evidenced, and mostly about refusing to trust one number:
- Ground it in a reference. Reference-guided grading is the single most effective fix for "right topic, wrong specifics." In MT-Bench, math-grading failures fell from 14/20 to 3/20 (~79% fewer) once a reference solution was supplied. Reference quality can even beat judge strength: a weaker judge with good human references outperformed GPT-4o with synthetic ones.
- Check atomic facts, not topics. FActScore decomposes a generation into atomic facts and scores the fraction supported — the concrete antidote to topic-matching. It exposed ChatGPT biographies as only 58% factually supported despite reading fluently.
- Prefer pairwise — but know the limits. Pairwise comparison beats absolute scoring on subjective tasks, though the gap nearly vanishes on factual consistency (0.47 vs 0.46), so it is not a universal fix.
- Use a jury of disjoint families. PoLL (Command-R + Claude-Haiku + GPT-3.5) beat a single GPT-4 judge on agreement-with-humans while being over seven times cheaper, with less intra-model bias.
- Calibrate with Cohen's kappa, not raw accuracy. Kappa adjusts for chance — a judge at 0.9 accuracy and 0.1 kappa is essentially guessing. The practice: 200–500 hand-labeled traces, re-calibrated as models drift.
- Use binary labels and adversarial cases. A sharp accept/reject exposes leniency that a 1–5 scale hides, and a held-out set of plausible-but-wrong answers measures the catch-rate — the number that actually tells you the judge works.
- Lower temperature — as a tradeoff. Low temperature improves consistency, not reliability; a single repeatable sample can still be repeatably wrong, and even T=0 is not truly deterministic. Not a free win.
The shift, in one table:
| Lenient default | Trustworthy judge | |
|---|---|---|
| Decision rule | "touches the same topic" | named facts checked against a reference |
| Granularity | 1–5 score | binary accept/reject |
| Validation | raw accuracy | Cohen's kappa vs a human gold set |
| Composition | one model | jury of disjoint families |
| Test set | valid answers only | adversarial wrong-but-topical cases |
The takeaway
The judge is not lying. On average it tracks human judgment well. But the average is the wrong question when the error has a direction, and the direction is yes. That is why a memory score is only as honest as the prompt, the harness, and the date attached to it — and why "evaluate the response and the memory" (companion) is two measurements, not one.
It is also why, when we report on benchmarks, the number always travels with its config, date, and limits. A score without that context is not a result. It is a judge being generous.
Common questions
What is LLM-as-a-judge? It is using one LLM to grade another LLM's output — the basis of most "quality scores" today. On average it tracks human judgment well: an MT-Bench GPT-4 judge agreed with humans about 85% of the time, near the 81% at which humans agree with each other.
What is LLM judge bias and leniency? The judge's error has a direction: it says yes far more readily than no. Beyond Consensus measured a ~96% true-positive rate on valid answers but a true-negative rate under 25% on invalid ones. The authors call this over-acceptance agreeableness bias.
Why is the LLM-as-a-judge so lenient? Leniency clusters around one mechanism — "right topic, wrong specifics": an answer that names the right subject while missing every detail tends to pass. The Mem0 grader is told to "be generous — as long as it touches on the same topic as the gold answer, it should be counted as CORRECT."
How does the MT-Bench LLM judge perform? MT-Bench found a GPT-4 judge agreed with humans ~85% of the time, but exposed failure modes: it keeps its verdict only 65% of the time when answer order swaps, and a padding attack fooled weaker judges 91.3% of the time. Reference-guided grading cut math failures from 14/20 to 3/20.
How lenient is the LoCoMo judge? Penfield Labs (one team's April 2026 audit, not peer-reviewed) ran intentionally wrong-but-topically-adjacent answers for all 1,540 non-adversarial LoCoMo questions through the standard judge; it accepted 62.81% (~63%) as correct. Specific factual errors were caught ~89% of the time.
Can you trust a single LLM-judge memory benchmark number? No — a single absolute score is the weakest signal. The number is harness-dependent: Mem0 reported Zep at 65.99%, Zep's corrected harness reported 75.14% — about a 9-point move. Vendors self-report on their own judge and harness, so there is no directly comparable leaderboard.
Sources
- Methodology — MT-Bench, Zheng et al., NeurIPS 2023 (HTML body); G-Eval, Liu et al., 2023; Constitutional AI / RLAIF, Bai et al., 2022; survey From Generation to Judgment, Li et al., EMNLP 2025
- Biases & leniency — Beyond Consensus (agreeableness bias), Jain et al., 2025; When LLM Judges Inflate Scores (relevance over-rating, open-weight), Yu et al., 2026; Can You Trick the Grader? (persuasion), Hwang et al., 2025; Context Over Content (evaluation faking), Gupta et al., 2026; Reliability of LLM-as-a-Judge, Schroeder & Wood-Doughty, 2024
- Memory-benchmark connection — LoCoMo, Maharana et al., ACL 2024; Mem0, Chhikara et al., 2025 (PDF / Appendix A prompt); Penfield Labs LoCoMo audit (one team, April 2026, not peer-reviewed); Zep harness re-run and Mem0 counter-analysis; Hindsight benchmark manifesto; ByteRover benchmark
- Mitigations — reference-guided grading (MT-Bench); reference quality over judge strength; pairwise vs absolute, kappa calibration, binary labels (eugeneyan); PoLL multi-judge jury; FActScore atomic decomposition
- Tooling note — OpenAI hosted Evals deprecation (the open-source framework remains active); migration path is Promptfoo, which OpenAI acquired on 2026-03-09 (terms undisclosed)
Related
- LangChain / LangSmith Evaluation: What It Measures — and the One Thing It Can't — the companion: evaluate the response and the memory
- AI Agent Memory: The 2026 Landscape — where these benchmarks sit in the wider field
- Benchmarks — how the Mnemoverse memory engine reports, config and limits attached
Edward Izgorodin, June 2026 — LinkedIn