Building Memory That Scales

The Journey from 0.116 to 0.862

An AI that remembers which colleague recommended the Python library — and what other projects they contributed to. That tracks how the team's opinion on microservices changed after the March outage. Not keyword search. Actual memory across hundreds of sessions.

The current version of the memory engine — the one with benchmarks and a leaderboard position — was built over four months in early 2026. But the idea is older. The first prototype dates to early 2023, right after ChatGPT-3 appeared: a knowledge base built on FAQ entries stored in SQLite, tagged with entities extracted by spaCy. No embeddings, no cosine similarity — embedding models at the time were too weak for production semantic search. It was keyword matching with NER on top.

The approach has changed completely since then. The goal has not: give AI agents the ability to remember what worked, forget what didn't, and build an internal map of what matters.

Seven versions of the current engine later, the system scores 0.862 on the LoCoMo benchmark (1,986 questions across 10 multi-session conversations), places second on the leaderboard, and maintains that quality as memory grows 14× in size.

What does 0.862 mean in practice? It means the engine can recall details from arbitrarily long conversation histories — hundreds of sessions, weeks of interaction — and answer questions about them correctly 86% of the time. Not just simple lookups ("what's Caroline's email?") but queries that require real cognitive work:

Multi-hop: "Which project did the person who recommended the Python library also contribute to?" — the engine finds the library recommendation, identifies who made it, then finds their other projects. Three separate memories, chained together.
Temporal: "Did the team's opinion on microservices change after the outage in March?" — the engine finds opinions expressed before and after a specific event, then surfaces both so they can be compared.

These are the kinds of questions a colleague answers effortlessly after months of working with you — and that every AI assistant today fails at, because it cannot remember last Tuesday.

This is the story of how we got there — and the 3D visualization tool that became our most important research instrument along the way.

Starting from nothing

Version 0.1 had no intelligence. It stored memories as vectors, retrieved by cosine similarity, and returned raw results without any processing. No LLM reader, no graph, no consolidation. J score: 0.116. Barely better than random.

We needed a baseline to measure against, and 0.116 was it. Everything that followed was an attempt to understand why retrieval alone wasn't enough.

(A note on early measurements: versions 0.1-0.2 had a data leak — ground-truth evidence structure was inadvertently used to build Hebbian connections during evaluation. The graph didn't see answers directly, but it learned which memories were relevant to each question, giving retrieval an unfair structural advantage. This was discovered and removed before v0.3. All numbers from v0.3 onward use blind feedback only.)

The embedding breakthrough (v0.2 → v0.3)

The first real jump came from a single change: better embeddings.

Version	J Score	What changed
v0.2	0.776	Added LLM reader with prompt tuning
v0.3	0.865	Better embedding model

That's +11% from switching the embedding model. The old model (2021 vintage, 384 dimensions) simply couldn't distinguish between "Caroline prefers morning meetings" and "the team discussed scheduling options." The new model could. Multi-hop accuracy — questions requiring evidence from multiple memories — jumped from 0.450 to 0.723. A 61% improvement from one change.

This taught us something we should have known earlier: the quality of the embedding model dominates everything downstream. Better retrieval is worth more than clever post-processing.

Graph-based retrieval (v0.5)

Version 0.5 added the Hebbian knowledge graph — connections between memories that strengthen when memories are retrieved together and lead to good outcomes.

Version	J Score	Multi-hop
v0.3	0.865	0.723
v0.5	0.878	0.770

The overall improvement was modest (+1.5%), but multi-hop went from 0.723 to 0.770. The graph found evidence chains that cosine similarity alone couldn't: memory A links to memory B through a shared concept, even though A and B aren't directly similar.

Testing at scale (v0.7)

The question we kept avoiding: does this hold up as memory grows?

Version 0.7 added per-conversation scaling instrumentation. After processing all 10 conversations, the engine held 5,880 memories — fourteen times more than after the first conversation (419).

The results were better than we expected.

Quality stays flat

Memory size	J Score	Retrieval Recall
419 atoms	0.845	0.885
1,451 atoms	0.898	0.941
2,760 atoms	0.831	0.908
4,123 atoms	0.877	0.938
5,880 atoms	0.873	0.897

J score variance across conversations (std=0.021) is larger than any trend with memory size. The difficulty of individual conversations matters more than how much the engine remembers. At 14× growth, quality is indistinguishable from where it started.

The graph self-organises

As memory grows, the graph develops structure without being told to:

Density drops from 1.74% to 0.69%. The graph becomes more selective, not more noisy. Each new memory adds connections only where they matter.
Hub ratio grows from 6× to 32×. A small number of concepts accumulate many connections while most concepts have only one or two. This is the power-law degree distribution seen in real-world knowledge networks — Wikipedia articles, citation graphs, social networks.
57% of edges are never accessed after creation. Most connections form once and are never reinforced. Only 4% of edges carry meaningful weight (>0.2). The graph naturally separates signal from noise.

These are emergent properties. We didn't program sparsification or hub formation. They arise from the combination of co-activation strengthening and the structure of real conversation data.

Seeing it happen: the visualization dashboard

Numbers in tables are useful. But the moment we truly understood how memory behaves was when we built a 3D visualization of the graph and watched it grow in real time.

graph.mnemoverse.com renders the memory graph as an interactive 3D force-directed layout:

Node size shows importance — how novel the memory was when it entered
Node color shows category — what kind of knowledge it represents
Node shape shows outcome — did this memory lead to good results (sphere), bad results (octahedron), or unknown (cube)?
Edge thickness shows connection strength — how often two memories are retrieved together successfully
Orange rings flag memories whose internal representation has drifted from their original encoding
Glowing nodes are prototypes — consolidated summaries created during sleep cycles

Walk through the graph after a 10-conversation benchmark run and you see the scaling story play out visually. Hub nodes — the concepts that connect many memories — sit at the center, pulling related clusters toward them. Peripheral nodes with weak connections float at the edges. The 57% of dead edges are visible as thin, barely-there lines connecting nodes that never co-activated.

This is an Euclidean projection of what is conceptually a hyperbolic structure. Like a Mercator map, it distorts — nodes near the "boundary" of the knowledge space appear closer together than they should. But it gives an intuition that no table of numbers can: memory has spatial structure, and that structure tells you something about what the agent knows.

The dashboard became our primary research tool. Before running benchmarks, we check the graph visually. After a benchmark, we look for new clusters, orphaned nodes, suspiciously large hubs. We have caught bugs this way — a memory that shouldn't have connected to anything showing up at the centre of a cluster. No automated test would have flagged it.

Where we stand on the leaderboard

Provenance. The Mnemoverse figures here come from a full 10-conversation LoCoMo run in February 2026 (v0.7) and reflect that snapshot of the engine and the field. Benchmark methodology for agent memory is still unstable across the industry — scores move with run parameters, judge choice, and which dataset subset is used as much as with the systems themselves — so we are building a more reproducible in-house protocol and will publish it. For the current, fully-provenanced matrix (run 2026-06-08, with fixed judges and a recorded dataset hash), see the benchmarks page.

The LoCoMo leaderboard provides context for these numbers:

System	J Score
MemMachine v0.2	0.912
Mnemoverse v0.7	0.862
Memobase	0.758
Zep	0.751
Mem0	0.669

We are second. The gap to first place is 5 percentage points, concentrated in multi-hop reasoning (our 0.707 vs their 0.897). Single-hop and temporal categories are competitive.

Of the scores above, only the Mnemoverse row is our own measurement. The other four rows are external figures from the LoCoMo leaderboard, produced under their own harnesses — not ours — so the comparison is indicative rather than protocol-controlled. Vendor-reported numbers can differ again: Mem0, for instance, reports a higher LoCoMo figure of its own. When we quote a system's self-reported score, we cite the source; see the 2026 memory landscape.

We also evaluated on two other benchmarks:

Benchmark	Key result
HotpotQA (500 questions)	F1=78%, retrieval recall=100%
LongMemEval (500 questions)	Retrieval recall=99.3%

Retrieval recall is consistently high across all benchmarks (89-100%). The gap between recall and final accuracy comes from answer extraction — the reader has the right evidence but doesn't always produce concise answers in the expected format. This is our primary area of active work.

What we haven't solved

Honest accounting of limitations:

Multi-hop is the bottleneck. Questions that require chaining evidence across multiple memories are where we lose to the top system. Our graph expansion helps, but not enough.

Benchmark data caps out at 6K memories. The 5,880 atoms represent all 10 LoCoMo conversations combined — the largest publicly available conversational memory dataset we have found. The engine itself can handle more, but we lack a benchmark with 10× more data to validate quality at that scale. Building one is planned.

Automatic consolidation is not on by default yet. Sleep-cycle consolidation — clustering redundant atoms into prototypes — ships as an on-demand operation (POST /api/v1/memory/consolidate) and is available in production. What is not yet enabled by default is the automatic background sleep-cycle that would run it on a schedule (maintenance_hard_consolidation): we are still tuning when and how aggressively to compress — including against the benchmarks on this page — before turning it on automatically. Until then, unless consolidation is triggered explicitly, memory grows without automatic compression.

The visualization is Euclidean. The mathematical foundation uses hyperbolic geometry, but the dashboard renders in 3D Euclidean space. It's useful for intuition and bug-catching, but it doesn't faithfully represent the hyperbolic distances. Building a true hyperbolic visualization is future work.

Hyperbolic embeddings are the missing piece. Better embeddings transformed retrieval quality (+11% from one model swap). We expect the same effect for geometric placement — positioning memories in the Poincaré ball according to their hierarchical relationships. But current approaches project Euclidean embeddings onto the hyperbolic manifold, and the projection is noisy. Native hyperbolic embedding models are not yet mature enough for production; closing that gap is the aim of our tensor-hyperbolic graph research. Until they are, the geometric layer operates on imprecise coordinates.

Detailed benchmark tables — full results across LoCoMo, HotpotQA, LongMemEval
Design Language — how each visual element maps to an engine concept
When AI Cites What Doesn't Exist — why persistent memory needs verification
AI Memory Landscape 2026 — where Mnemoverse fits in the market
The Judge Says Yes Too Easily — how the LLM judge behind these J scores works, and why leniency inflates them

Edward Izgorodin, April 2026 — LinkedIn

— Mnemoverse is a persistent-memory API for AI agents. Free key: console.mnemoverse.com · Docs: Getting Started

Building Memory That Scales ​

The Journey from 0.116 to 0.862 ​

Starting from nothing ​

The embedding breakthrough (v0.2 → v0.3) ​

Graph-based retrieval (v0.5) ​

Testing at scale (v0.7) ​

Quality stays flat ​

The graph self-organises ​

Seeing it happen: the visualization dashboard ​

Where we stand on the leaderboard ​

What we haven't solved ​

Related ​