Landscape of RAG Solutions for LLM Applications ​
Last Updated: 2025-08-24
Version: 2.0.0 (AGI-memory essentials)
Why this doc exists ​
A concise, opinionated guide for building production AGI memory. No vendor catalog, no encyclopedias—only what matters to ship: stable primitives for long-term memory, retrieval quality, feedback learning, and observability.
AGI-memory essentials (August 2025) ​
- Memory persistence primitives: per-agent, per-identity stores with TTL/decay and reinforcement.
- Retrieval stack: hybrid vector + sparse + graph hops; cheap first, heavy second.
- Knowledge topology: local subgraph-on-demand > global monolith graphs.
- Temporal awareness: recency, episodic windows, and time-weighted scoring.
- Feedback learning: write-backs from conversations and tasks; guard for privacy and secure fields.
- Observability and eval: end-to-end traces, attribution, quality metrics, cost/latency budgets.
- Large context bridge: Combine 200K-1M token windows (Claude Sonnet 4) with targeted retrieval for attribution and cost control.
What to use now (short picks) ​
Orchestration / Memory framework ​
- LangGraph (>=0.6.1): Context API replaces config patterns; type-safe Runtime[Context]; persistent memory, HITL, resumable graphs.
Use when you need multi-step tools + memory. Mind the complexity and external state store. - LlamaIndex Agents (>=0.13.0): NEW AgentWorkflow system replaces legacy agents (breaking migration); PropertyGraphIndex, solid doc parsing.
Use when you need fast shipping with strong ingestion/evaluation. Graph part is basic. - Microsoft GraphRAG (Production 2025): graph-first RAG with community detection; Azure Discovery platform.
Use for multi-hop reasoning and domain graphs; cost heavy—prefer subgraph-on-demand.
Vector/Graph storage ​
- Qdrant: self-hosted, strong payload filters, HNSW + quantization; 4x RPS gains in latest benchmarks; great default for agent memory.
- Weaviate: GraphQL, hybrid search, multi-tenancy; MUVERA multi-vector embeddings; good for cloud/self-host projects needing flexibility.
- Pinecone: managed, sparse+dense hybrid, serverless 2.0 with auto-config; best when you want zero-ops and predictable SLOs.
- Milvus: extreme scale and GPU acceleration; CAGRA index for 10x batch performance; use when your recall/latency budget requires it.
Models & embeddings ​
- Text embeddings: Voyage AI voyage-3-large (top retrieval quality) or OpenAI text-embedding-3-large/3-small (5x cost reduction via Matryoshka).
- Re-ranking: Cohere Rerank or Voyage for improved MRR on small k.
- LLM context windows: Claude Sonnet 4 (1M tokens) eliminates retrieval for many use cases—still keep retrieval for attribution, freshness, and cost control.
- Multi-modal: voyage-multimodal-3, ColPali for documents; only when it moves product KPIs; otherwise keep text-first for memory.
Production patterns that work ​
- Memory–retrieval fusion
- Write explicit memory events (facts, skills, preferences, tasks) with typed schemas.
- Retrieve from memory and documents jointly; merge by recency×relevance×confidence.
- Keep a small "working set" cache per session; refill from stores on demand.
- Subgraph-on-demand
- Build local knowledge subgraphs per query/task via entity/link extraction; expire quickly.
- Use GraphRAG patterns for multi-hop reasoning; avoid whole-corpus graph construction except for narrow domains.
- Large context + targeted retrieval
- Use 1M token context (Claude Sonnet 4) for document analysis; targeted retrieval for attribution and real-time updates.
- Balance cost: large context for reasoning, retrieval for facts and freshness.
- Temporal scoring and decay
- Score = alpha·semantic + beta·recency + gamma·reinforcement; apply soft decay per identity.
- Promote items on explicit confirmations or repeated usage; demote noisy snippets.
- Safety and privacy guards
- Never log raw sensitive text; mask secure fields.
- Maintain allow/deny lists for write-backs; require consent for cross-identity joins.
- Observability and evaluation
- LangSmith v2 / LlamaTrace with request_id, agent_id, user_id, dataset_id; log top-k, scores, write-backs.
- Track latency budgets: p95 retrieval < 50 ms, total thought loop < 1–2 s.
- Evaluate with retrieval precision/recall, answer faithfulness, and memory hit quality.
Minimal checklist (map to picks) ​
- Persistent memory per agent/identity → LangGraph Context API store | LlamaIndex AgentWorkflow memory | Qdrant/Weaviate.
- Hybrid retrieval (dense+sparse) → Pinecone sparse-dense | Weaviate hybrid | BM25+vector via your search.
- Graph hops when needed → LlamaIndex PropertyGraphIndex (basic) | GraphRAG subgraph-on-demand | Microsoft Discovery platform.
- Large context strategy → Claude Sonnet 4 (1M tokens) for analysis + targeted retrieval for attribution/freshness.
- Best embeddings → Voyage AI voyage-3-large for quality or OpenAI text-embedding-3 for cost optimization.
- Temporal + reinforcement → implement in app layer with simple weights and TTLs.
- Observability → LangSmith v2 / LlamaTrace; structured logs with attribution.
- Cost/latency control → small k + rerank, cache working set, favor subgraphs over global graphs.
What we intentionally ignore (for core AGI memory) ​
- Prototyping DBs (e.g., Chroma) for production memory stores.
- All-in-one stacks with low adoption (txtai, Marqo, Jina) unless a concrete requirement demands them.
- Heavy enterprise search engines (Vespa) unless you operate at that scale already.
- Giant comparison matrices—signal gets lost; we keep a living checklist instead.
Our bets for Mnemoverse ​
- Graph–Vector Hybrid Memory: fuse vector recall with on-demand local graphs for reasoning.
- Memory-Aware Retrieval Engine: retrieval priorities adapt from agent’s accumulated experience.
- Hyperbolic Embeddings (R&D): hierarchical spaces for topics/skills; pilot behind a flag.
- GPU-Native Indexing: accelerate ingestion, re-index, and online updates.
- Spatial UI (3D): practical, collaborative navigation for memory curation and debugging.
Sources (verified links) ​
- LlamaIndex: docs.llamaindex.ai — AgentWorkflow migration guide and PropertyGraphIndex
- LangGraph: langchain-ai.github.io/langgraph — Context API and deployment docs
- LangSmith: docs.smith.langchain.com — observability and evaluation platform
- Qdrant: qdrant.tech/documentation — vector search engine and performance benchmarks
- Weaviate: weaviate.io/developers/weaviate — vector database and hybrid search
- Pinecone: docs.pinecone.io — serverless vector database and sparse-dense hybrid
- Milvus: milvus.io/docs — distributed vector database and CAGRA index
- Microsoft GraphRAG: microsoft.github.io/graphrag — knowledge graph RAG framework
- Anthropic Claude: docs.anthropic.com — model cards and context window capabilities
- Voyage AI: docs.voyageai.com — embedding models and retrieval performance
See also:
- Memory Solutions Landscape (complement)
- Vector–Graph Experience RAG pattern
- Spatial Memory Design Language
- Core Mathematical Theory (spatial retrieval)