Skip to content

Benchmarks ​

Evaluation results from the Mnemoverse memory engine on established benchmarks. All runs are automated, configurations are documented, and raw results are committed as JSON in the repository.

Interactive dashboard: graph.mnemoverse.com/#benchmarks — live leaderboard with detailed per-question breakdowns.

LoCoMo (Primary Benchmark) ​

LoCoMo evaluates conversational memory across 10 multi-session dialogues (~5,900 turns, 1,986 QA items). Questions span four categories: single-hop factual recall, temporal reasoning, multi-hop inference, and open-domain knowledge.

We use LLM-as-Judge scoring (J score: 0.0 / 0.5 / 1.0), plus token-level F1 and retrieval recall.

Results (v0.7, February 2026) ​

CategoryNJ ScoreToken F1Retrieval Recall
single-hop8410.937——
temporal3210.850——
multi-hop2820.707——
open-domain960.703——
Overall1,9860.8620.2360.897

Leaderboard comparison (from LoCoMo benchmark) ​

SystemJ Scoresingle-hopmulti-hoptemporalopen-domain
MemMachine v0.20.9120.9790.8970.9120.780
Mnemoverse v0.70.8620.9370.7070.8500.703
Memobase0.758————
Zep0.751————
Mem00.669————

Gap to MemMachine: -0.050 overall, largest in multi-hop (-0.190). Multi-hop retrieval is the primary bottleneck.

Version progression ​

The engine improved from baseline retrieval-only to J=0.862 over seven iterations:

VersionJ ScoreKey change
v0.10.116Retrieval only, no LLM reader
v0.20.702Added LLM-as-Judge, basic reader prompt
v0.30.865Better embeddings (+11%), iterative retrieval (+61% multi-hop)
v0.50.878Graph-based retrieval expansion, reranking
v0.70.862Memory decay, feedback learning, scaling instrumentation

Each version is evaluated on the same 10-conversation dataset. v0.5→v0.7 regression is within noise (std=0.021).


HotpotQA ​

HotpotQA tests multi-step reasoning with questions requiring evidence from multiple documents. Two categories: bridge questions (follow a chain of facts) and comparison questions.

Results (March 2026, 500 questions) ​

CategoryNExact MatchF1Retrieval Recall
bridge404—75.8%100%
comparison96—86.9%100%
Overall50063%78%100%

Retrieval recall is perfect (100%) — the memory engine finds supporting facts for every question. The 37% EM gap is in answer extraction: the reader LLM has the right evidence but sometimes generates verbose or differently-formatted answers.

Failure analysis ​

OutcomeCount%
Success (F1 ≥ 0.3)42484.8%
Extraction fail (has evidence, wrong answer)7615.2%
Retrieval miss00%

Average search time: 740ms (API-bound).


LongMemEval ​

LongMemEval evaluates long-term memory across six categories: single-session (user, assistant, preference), multi-session, knowledge-update, and temporal-reasoning.

Results (March 2026, 500 questions) ​

CategoryNRetrieval RecallF1
knowledge-update781.0000.103
multi-session1330.9930.180
single-session-assistant561.0000.142
single-session-preference301.0000.115
single-session-user701.0000.154
temporal-reasoning1330.9810.164
Overall5000.9930.153

Retrieval recall (99.3%) is strong. F1 is low (0.15) due to answer format mismatch — the reader generates verbose explanations where LongMemEval expects concise extractions.


Scaling Analysis ​

The v0.7 benchmark includes the first systematic scaling study: per-conversation snapshots of graph topology, quality, and latency as memory grows from 419 to 5,880 atoms (14× growth).

Graph growth dynamics ​

AtomsEdgesConceptsDensityHub ratio
4191,7704521.74%6.5×
1,4514,4229421.00%8.0×
2,7609,4501,5640.77%23.0×
4,12313,1861,9450.70%25.0×
5,88020,4952,4440.69%32.0×

Power-law fits:

  • Edges vs atoms: E ~ A^0.94 (near-linear, each atom adds ~3.5 edges)
  • Concepts vs atoms: C ~ A^0.65 (Heaps' law — vocabulary saturates as new conversations reuse existing concepts)

Density decreases from 1.74% to 0.69% — the graph naturally sparsifies, which is healthy for both memory and query performance.

Quality does not degrade with scale ​

MetricAt 419 atomsAt 5,880 atomsTrend
J Score0.8450.862Flat (0.865 ± 0.021)
Retrieval Recall0.8850.897Flat (0.897 ± 0.035)
Search P90706ms691msFlat (API-bound)

The quality variance across conversations (std=0.021) is larger than any trend — conversation difficulty dominates, not memory size. At 14× growth, the engine maintains stable quality.

Hub emergence ​

Hub ratio (P90 degree / P10 degree) grows from 6.5× to 32×, indicating scale-free network formation. The top 5.6% of concepts accumulate 50+ edges while 22.5% have only 1-2 connections — a power-law degree distribution consistent with real-world knowledge networks.

Latency breakdown ​

PercentileTimeBottleneck
P50 (episodic hit)0.04msHash lookup
P90 (semantic search)691msAPI round-trip
P99809msAPI + reranking

Graph operations themselves take <1ms. Latency is dominated by external API calls for embeddings.


LoCoMo Multi-Hop Deep Dive ​

Multi-hop is the hardest category (J=0.707) and the primary gap to MemMachine (J=0.897). A focused run on 43 multi-hop questions reveals the failure modes:

OutcomeCount%
Success (F1 ≥ 0.3)1637.2%
Partial evidence (found some, not all)1227.9%
Extraction fail (right evidence, wrong answer)818.6%
Retrieval miss (wrong documents)716.3%

Retrieval recall: 62.2%. The multi-pass retrieval strategy finds relevant context most of the time, but the evidence is spread across multiple atoms and the reader struggles to synthesize.


Ongoing work ​

  • Local reranking to remove API latency dependency
  • SLoD-enabled retrieval — coarse-to-fine search using prototype hierarchies
  • Multi-run variance — 10+ runs per configuration for statistical rigor
  • Reader prompt optimization — targeting the extraction gap (15-60% of errors are correct retrieval with wrong answer)