Benchmarks ​
Evaluation results from the Mnemoverse memory engine on established benchmarks. All runs are automated, configurations are documented, and raw results are committed as JSON in the repository.
Interactive dashboard: graph.mnemoverse.com/#benchmarks — live leaderboard with detailed per-question breakdowns.
LoCoMo (Primary Benchmark) ​
LoCoMo evaluates conversational memory across 10 multi-session dialogues (~5,900 turns, 1,986 QA items). Questions span four categories: single-hop factual recall, temporal reasoning, multi-hop inference, and open-domain knowledge.
We use LLM-as-Judge scoring (J score: 0.0 / 0.5 / 1.0), plus token-level F1 and retrieval recall.
Results (v0.7, February 2026) ​
| Category | N | J Score | Token F1 | Retrieval Recall |
|---|---|---|---|---|
| single-hop | 841 | 0.937 | — | — |
| temporal | 321 | 0.850 | — | — |
| multi-hop | 282 | 0.707 | — | — |
| open-domain | 96 | 0.703 | — | — |
| Overall | 1,986 | 0.862 | 0.236 | 0.897 |
Leaderboard comparison (from LoCoMo benchmark) ​
| System | J Score | single-hop | multi-hop | temporal | open-domain |
|---|---|---|---|---|---|
| MemMachine v0.2 | 0.912 | 0.979 | 0.897 | 0.912 | 0.780 |
| Mnemoverse v0.7 | 0.862 | 0.937 | 0.707 | 0.850 | 0.703 |
| Memobase | 0.758 | — | — | — | — |
| Zep | 0.751 | — | — | — | — |
| Mem0 | 0.669 | — | — | — | — |
Gap to MemMachine: -0.050 overall, largest in multi-hop (-0.190). Multi-hop retrieval is the primary bottleneck.
Version progression ​
The engine improved from baseline retrieval-only to J=0.862 over seven iterations:
| Version | J Score | Key change |
|---|---|---|
| v0.1 | 0.116 | Retrieval only, no LLM reader |
| v0.2 | 0.702 | Added LLM-as-Judge, basic reader prompt |
| v0.3 | 0.865 | Better embeddings (+11%), iterative retrieval (+61% multi-hop) |
| v0.5 | 0.878 | Graph-based retrieval expansion, reranking |
| v0.7 | 0.862 | Memory decay, feedback learning, scaling instrumentation |
Each version is evaluated on the same 10-conversation dataset. v0.5→v0.7 regression is within noise (std=0.021).
HotpotQA ​
HotpotQA tests multi-step reasoning with questions requiring evidence from multiple documents. Two categories: bridge questions (follow a chain of facts) and comparison questions.
Results (March 2026, 500 questions) ​
| Category | N | Exact Match | F1 | Retrieval Recall |
|---|---|---|---|---|
| bridge | 404 | — | 75.8% | 100% |
| comparison | 96 | — | 86.9% | 100% |
| Overall | 500 | 63% | 78% | 100% |
Retrieval recall is perfect (100%) — the memory engine finds supporting facts for every question. The 37% EM gap is in answer extraction: the reader LLM has the right evidence but sometimes generates verbose or differently-formatted answers.
Failure analysis ​
| Outcome | Count | % |
|---|---|---|
| Success (F1 ≥ 0.3) | 424 | 84.8% |
| Extraction fail (has evidence, wrong answer) | 76 | 15.2% |
| Retrieval miss | 0 | 0% |
Average search time: 740ms (API-bound).
LongMemEval ​
LongMemEval evaluates long-term memory across six categories: single-session (user, assistant, preference), multi-session, knowledge-update, and temporal-reasoning.
Results (March 2026, 500 questions) ​
| Category | N | Retrieval Recall | F1 |
|---|---|---|---|
| knowledge-update | 78 | 1.000 | 0.103 |
| multi-session | 133 | 0.993 | 0.180 |
| single-session-assistant | 56 | 1.000 | 0.142 |
| single-session-preference | 30 | 1.000 | 0.115 |
| single-session-user | 70 | 1.000 | 0.154 |
| temporal-reasoning | 133 | 0.981 | 0.164 |
| Overall | 500 | 0.993 | 0.153 |
Retrieval recall (99.3%) is strong. F1 is low (0.15) due to answer format mismatch — the reader generates verbose explanations where LongMemEval expects concise extractions.
Scaling Analysis ​
The v0.7 benchmark includes the first systematic scaling study: per-conversation snapshots of graph topology, quality, and latency as memory grows from 419 to 5,880 atoms (14× growth).
Graph growth dynamics ​
| Atoms | Edges | Concepts | Density | Hub ratio |
|---|---|---|---|---|
| 419 | 1,770 | 452 | 1.74% | 6.5× |
| 1,451 | 4,422 | 942 | 1.00% | 8.0× |
| 2,760 | 9,450 | 1,564 | 0.77% | 23.0× |
| 4,123 | 13,186 | 1,945 | 0.70% | 25.0× |
| 5,880 | 20,495 | 2,444 | 0.69% | 32.0× |
Power-law fits:
- Edges vs atoms: E ~ A^0.94 (near-linear, each atom adds ~3.5 edges)
- Concepts vs atoms: C ~ A^0.65 (Heaps' law — vocabulary saturates as new conversations reuse existing concepts)
Density decreases from 1.74% to 0.69% — the graph naturally sparsifies, which is healthy for both memory and query performance.
Quality does not degrade with scale ​
| Metric | At 419 atoms | At 5,880 atoms | Trend |
|---|---|---|---|
| J Score | 0.845 | 0.862 | Flat (0.865 ± 0.021) |
| Retrieval Recall | 0.885 | 0.897 | Flat (0.897 ± 0.035) |
| Search P90 | 706ms | 691ms | Flat (API-bound) |
The quality variance across conversations (std=0.021) is larger than any trend — conversation difficulty dominates, not memory size. At 14× growth, the engine maintains stable quality.
Hub emergence ​
Hub ratio (P90 degree / P10 degree) grows from 6.5× to 32×, indicating scale-free network formation. The top 5.6% of concepts accumulate 50+ edges while 22.5% have only 1-2 connections — a power-law degree distribution consistent with real-world knowledge networks.
Latency breakdown ​
| Percentile | Time | Bottleneck |
|---|---|---|
| P50 (episodic hit) | 0.04ms | Hash lookup |
| P90 (semantic search) | 691ms | API round-trip |
| P99 | 809ms | API + reranking |
Graph operations themselves take <1ms. Latency is dominated by external API calls for embeddings.
LoCoMo Multi-Hop Deep Dive ​
Multi-hop is the hardest category (J=0.707) and the primary gap to MemMachine (J=0.897). A focused run on 43 multi-hop questions reveals the failure modes:
| Outcome | Count | % |
|---|---|---|
| Success (F1 ≥ 0.3) | 16 | 37.2% |
| Partial evidence (found some, not all) | 12 | 27.9% |
| Extraction fail (right evidence, wrong answer) | 8 | 18.6% |
| Retrieval miss (wrong documents) | 7 | 16.3% |
Retrieval recall: 62.2%. The multi-pass retrieval strategy finds relevant context most of the time, but the evidence is spread across multiple atoms and the reader struggles to synthesize.
Ongoing work ​
- Local reranking to remove API latency dependency
- SLoD-enabled retrieval — coarse-to-fine search using prototype hierarchies
- Multi-run variance — 10+ runs per configuration for statistical rigor
- Reader prompt optimization — targeting the extraction gap (15-60% of errors are correct retrieval with wrong answer)
Related ​
- Building Memory That Scales — the story behind these numbers
- Interactive dashboard — explore the benchmark data visually
- Design Language — how the 3D visualization maps to engine concepts