Building Memory That Scales β
The Journey from 0.116 to 0.862 β
An AI that remembers which colleague recommended the Python library β and what other projects they contributed to. That tracks how the team's opinion on microservices changed after the March outage. Not keyword search. Actual memory across hundreds of sessions.
The current version of the memory engine β the one with benchmarks and a leaderboard position β was built over four months in early 2026. But the idea is older. The first prototype dates to early 2023, right after ChatGPT-3 appeared: a knowledge base built on FAQ entries stored in SQLite, tagged with entities extracted by spaCy. No embeddings, no cosine similarity β embedding models at the time were too weak for production semantic search. It was keyword matching with NER on top.
The approach has changed completely since then. The goal has not: give AI agents the ability to remember what worked, forget what didn't, and build an internal map of what matters.
Seven versions of the current engine later, the system scores 0.862 on the LoCoMo benchmark (1,986 questions across 10 multi-session conversations), places second on the leaderboard, and maintains that quality as memory grows 14Γ in size.
What does 0.862 mean in practice? It means the engine can recall details from arbitrarily long conversation histories β hundreds of sessions, weeks of interaction β and answer questions about them correctly 86% of the time. Not just simple lookups ("what's Caroline's email?") but queries that require real cognitive work:
- Multi-hop: "Which project did the person who recommended the Python library also contribute to?" β the engine finds the library recommendation, identifies who made it, then finds their other projects. Three separate memories, chained together.
- Temporal: "Did the team's opinion on microservices change after the outage in March?" β the engine finds opinions expressed before and after a specific event, then surfaces both so they can be compared.
These are the kinds of questions a colleague answers effortlessly after months of working with you β and that every AI assistant today fails at, because it cannot remember last Tuesday.
This is the story of how we got there β and the 3D visualization tool that became our most important research instrument along the way.
Starting from nothing β
Version 0.1 had no intelligence. It stored memories as vectors, retrieved by cosine similarity, and returned raw results without any processing. No LLM reader, no graph, no consolidation. J score: 0.116. Barely better than random.
We needed a baseline to measure against, and 0.116 was it. Everything that followed was an attempt to understand why retrieval alone wasn't enough.
(A note on early measurements: versions 0.1-0.2 had a data leak β ground-truth evidence structure was inadvertently used to build Hebbian connections during evaluation. The graph didn't see answers directly, but it learned which memories were relevant to each question, giving retrieval an unfair structural advantage. This was discovered and removed before v0.3. All numbers from v0.3 onward use blind feedback only.)
The embedding breakthrough (v0.2 β v0.3) β
The first real jump came from a single change: better embeddings.
| Version | J Score | What changed |
|---|---|---|
| v0.2 | 0.776 | Added LLM reader with prompt tuning |
| v0.3 | 0.865 | Better embedding model |
That's +11% from switching the embedding model. The old model (2021 vintage, 384 dimensions) simply couldn't distinguish between "Caroline prefers morning meetings" and "the team discussed scheduling options." The new model could. Multi-hop accuracy β questions requiring evidence from multiple memories β jumped from 0.450 to 0.723. A 61% improvement from one change.
This taught us something we should have known earlier: the quality of the embedding model dominates everything downstream. Better retrieval is worth more than clever post-processing.
Graph-based retrieval (v0.5) β
Version 0.5 added the Hebbian knowledge graph β connections between memories that strengthen when memories are retrieved together and lead to good outcomes.
| Version | J Score | Multi-hop |
|---|---|---|
| v0.3 | 0.865 | 0.723 |
| v0.5 | 0.878 | 0.770 |
The overall improvement was modest (+1.5%), but multi-hop went from 0.723 to 0.770. The graph found evidence chains that cosine similarity alone couldn't: memory A links to memory B through a shared concept, even though A and B aren't directly similar.
Testing at scale (v0.7) β
The question we kept avoiding: does this hold up as memory grows?
Version 0.7 added per-conversation scaling instrumentation. After processing all 10 conversations, the engine held 5,880 memories β fourteen times more than after the first conversation (419).
The results were better than we expected.
Quality stays flat β
| Memory size | J Score | Retrieval Recall |
|---|---|---|
| 419 atoms | 0.845 | 0.885 |
| 1,451 atoms | 0.867 | 0.909 |
| 2,760 atoms | 0.871 | 0.897 |
| 4,123 atoms | 0.869 | 0.901 |
| 5,880 atoms | 0.862 | 0.897 |
J score variance across conversations (std=0.021) is larger than any trend with memory size. The difficulty of individual conversations matters more than how much the engine remembers. At 14Γ growth, quality is indistinguishable from where it started.
The graph self-organises β
As memory grows, the graph develops structure without being told to:
- Density drops from 1.74% to 0.69%. The graph becomes more selective, not more noisy. Each new memory adds connections only where they matter.
- Hub ratio grows from 6Γ to 32Γ. A small number of concepts accumulate many connections while most concepts have only one or two. This is the power-law degree distribution seen in real-world knowledge networks β Wikipedia articles, citation graphs, social networks.
- 57% of edges are never accessed after creation. Most connections form once and are never reinforced. Only 4% of edges carry meaningful weight (>0.2). The graph naturally separates signal from noise.
These are emergent properties. We didn't program sparsification or hub formation. They arise from the combination of co-activation strengthening and the structure of real conversation data.
Seeing it happen: the visualization dashboard β
Numbers in tables are useful. But the moment we truly understood how memory behaves was when we built a 3D visualization of the graph and watched it grow in real time.
graph.mnemoverse.com renders the memory graph as an interactive 3D force-directed layout:
- Node size shows importance β how novel the memory was when it entered
- Node color shows category β what kind of knowledge it represents
- Node shape shows outcome β did this memory lead to good results (sphere), bad results (octahedron), or unknown (cube)?
- Edge thickness shows connection strength β how often two memories are retrieved together successfully
- Orange rings flag memories whose internal representation has drifted from their original encoding
- Glowing nodes are prototypes β consolidated summaries created during sleep cycles
Walk through the graph after a 10-conversation benchmark run and you see the scaling story play out visually. Hub nodes β the concepts that connect many memories β sit at the center, pulling related clusters toward them. Peripheral nodes with weak connections float at the edges. The 57% of dead edges are visible as thin, barely-there lines connecting nodes that never co-activated.
This is an Euclidean projection of what is conceptually a hyperbolic structure. Like a Mercator map, it distorts β nodes near the "boundary" of the knowledge space appear closer together than they should. But it gives an intuition that no table of numbers can: memory has spatial structure, and that structure tells you something about what the agent knows.
The dashboard became our primary research tool. Before running benchmarks, we check the graph visually. After a benchmark, we look for new clusters, orphaned nodes, suspiciously large hubs. We have caught bugs this way β a memory that shouldn't have connected to anything showing up at the centre of a cluster. No automated test would have flagged it.
Where we stand on the leaderboard β
The LoCoMo leaderboard provides context for these numbers:
| System | J Score |
|---|---|
| MemMachine v0.2 | 0.912 |
| Mnemoverse v0.7 | 0.862 |
| Memobase | 0.758 |
| Zep | 0.751 |
| Mem0 | 0.669 |
We are second. The gap to first place is 5 percentage points, concentrated in multi-hop reasoning (our 0.707 vs their 0.897). Single-hop and temporal categories are competitive.
We also evaluated on two other benchmarks:
| Benchmark | Key result |
|---|---|
| HotpotQA (500 questions) | F1=78%, retrieval recall=100% |
| LongMemEval (500 questions) | Retrieval recall=99.3% |
Retrieval recall is consistently high across all benchmarks (89-100%). The gap between recall and final accuracy comes from answer extraction β the reader has the right evidence but doesn't always produce concise answers in the expected format. This is our primary area of active work.
What we haven't solved β
Honest accounting of limitations:
Multi-hop is the bottleneck. Questions that require chaining evidence across multiple memories are where we lose to the top system. Our graph expansion helps, but not enough.
Benchmark data caps out at 6K memories. The 5,880 atoms represent all 10 LoCoMo conversations combined β the largest publicly available conversational memory dataset we have found. The engine itself can handle more, but we lack a benchmark with 10Γ more data to validate quality at that scale. Building one is planned.
Consolidation is disabled in production. The sleep-cycle mechanism that compresses episodic memories into prototypes works on test data but triggers a platform-specific threading issue that we haven't resolved. We ship without it, which means memory only grows, never compresses.
The visualization is Euclidean. The mathematical foundation uses hyperbolic geometry, but the dashboard renders in 3D Euclidean space. It's useful for intuition and bug-catching, but it doesn't faithfully represent the hyperbolic distances. Building a true hyperbolic visualization is future work.
Hyperbolic embeddings are the missing piece. Better embeddings transformed retrieval quality (+11% from one model swap). We expect the same effect for geometric placement β positioning memories in the PoincarΓ© ball according to their hierarchical relationships. But current approaches project Euclidean embeddings onto the hyperbolic manifold, and the projection is noisy. Native hyperbolic embedding models are not yet mature enough for production. Until they are, the geometric layer operates on imprecise coordinates.
Related β
- Detailed benchmark tables β full results across LoCoMo, HotpotQA, LongMemEval
- Design Language β how each visual element maps to an engine concept
- When AI Cites What Doesn't Exist β why persistent memory needs verification
- AI Memory Landscape 2026 β where Mnemoverse fits in the market
Eduard Izgorodin, April 2026 β LinkedIn