Landscape of RAG Solutions for LLM Applications

Last Updated: 2025-08-24
Version: 2.0.0 (AGI-memory essentials)

Why this doc exists

A concise, opinionated guide for building production AGI memory. No vendor catalog, no encyclopedias—only what matters to ship: stable primitives for long-term memory, retrieval quality, feedback learning, and observability.

AGI-memory essentials (August 2025)

Memory persistence primitives: per-agent, per-identity stores with TTL/decay and reinforcement.
Retrieval stack: hybrid vector + sparse + graph hops; cheap first, heavy second.
Knowledge topology: local subgraph-on-demand > global monolith graphs.
Temporal awareness: recency, episodic windows, and time-weighted scoring.
Feedback learning: write-backs from conversations and tasks; guard for privacy and secure fields.
Observability and eval: end-to-end traces, attribution, quality metrics, cost/latency budgets.
Large context bridge: Combine 200K-1M token windows (Claude Sonnet 4) with targeted retrieval for attribution and cost control.

What to use now (short picks)

Orchestration / Memory framework

LangGraph (>=0.6.1): Context API replaces config patterns; type-safe Runtime[Context]; persistent memory, HITL, resumable graphs.
Use when you need multi-step tools + memory. Mind the complexity and external state store.
LlamaIndex Agents (>=0.13.0): NEW AgentWorkflow system replaces legacy agents (breaking migration); PropertyGraphIndex, solid doc parsing.
Use when you need fast shipping with strong ingestion/evaluation. Graph part is basic.
Microsoft GraphRAG (Production 2025): graph-first RAG with community detection; Azure Discovery platform.
Use for multi-hop reasoning and domain graphs; cost heavy—prefer subgraph-on-demand.

Vector/Graph storage

Qdrant: self-hosted, strong payload filters, HNSW + quantization; 4x RPS gains in latest benchmarks; great default for agent memory.
Weaviate: GraphQL, hybrid search, multi-tenancy; MUVERA multi-vector embeddings; good for cloud/self-host projects needing flexibility.
Pinecone: managed, sparse+dense hybrid, serverless 2.0 with auto-config; best when you want zero-ops and predictable SLOs.
Milvus: extreme scale and GPU acceleration; CAGRA index for 10x batch performance; use when your recall/latency budget requires it.

Models & embeddings

Text embeddings: Voyage AI voyage-3-large (top retrieval quality) or OpenAI text-embedding-3-large/3-small (5x cost reduction via Matryoshka).
Re-ranking: Cohere Rerank or Voyage for improved MRR on small k.
LLM context windows: Claude Sonnet 4 (1M tokens) eliminates retrieval for many use cases—still keep retrieval for attribution, freshness, and cost control.
Multi-modal: voyage-multimodal-3, ColPali for documents; only when it moves product KPIs; otherwise keep text-first for memory.

Production patterns that work

Memory–retrieval fusion

Write explicit memory events (facts, skills, preferences, tasks) with typed schemas.
Retrieve from memory and documents jointly; merge by recency×relevance×confidence.
Keep a small "working set" cache per session; refill from stores on demand.

Subgraph-on-demand

Build local knowledge subgraphs per query/task via entity/link extraction; expire quickly.
Use GraphRAG patterns for multi-hop reasoning; avoid whole-corpus graph construction except for narrow domains.

Large context + targeted retrieval

Use 1M token context (Claude Sonnet 4) for document analysis; targeted retrieval for attribution and real-time updates.
Balance cost: large context for reasoning, retrieval for facts and freshness.

Temporal scoring and decay

Score = alpha·semantic + beta·recency + gamma·reinforcement; apply soft decay per identity.
Promote items on explicit confirmations or repeated usage; demote noisy snippets.

Safety and privacy guards

Never log raw sensitive text; mask secure fields.
Maintain allow/deny lists for write-backs; require consent for cross-identity joins.

Observability and evaluation

LangSmith v2 / LlamaTrace with request_id, agent_id, user_id, dataset_id; log top-k, scores, write-backs.
Track latency budgets: p95 retrieval < 50 ms, total thought loop < 1–2 s.
Evaluate with retrieval precision/recall, answer faithfulness, and memory hit quality.

Minimal checklist (map to picks)

Persistent memory per agent/identity → LangGraph Context API store | LlamaIndex AgentWorkflow memory | Qdrant/Weaviate.
Hybrid retrieval (dense+sparse) → Pinecone sparse-dense | Weaviate hybrid | BM25+vector via your search.
Graph hops when needed → LlamaIndex PropertyGraphIndex (basic) | GraphRAG subgraph-on-demand | Microsoft Discovery platform.
Large context strategy → Claude Sonnet 4 (1M tokens) for analysis + targeted retrieval for attribution/freshness.
Best embeddings → Voyage AI voyage-3-large for quality or OpenAI text-embedding-3 for cost optimization.
Temporal + reinforcement → implement in app layer with simple weights and TTLs.
Observability → LangSmith v2 / LlamaTrace; structured logs with attribution.
Cost/latency control → small k + rerank, cache working set, favor subgraphs over global graphs.

What we intentionally ignore (for core AGI memory)

Prototyping DBs (e.g., Chroma) for production memory stores.
All-in-one stacks with low adoption (txtai, Marqo, Jina) unless a concrete requirement demands them.
Heavy enterprise search engines (Vespa) unless you operate at that scale already.
Giant comparison matrices—signal gets lost; we keep a living checklist instead.

Our bets for Mnemoverse

Graph–Vector Hybrid Memory: fuse vector recall with on-demand local graphs for reasoning.
Memory-Aware Retrieval Engine: retrieval priorities adapt from agent’s accumulated experience.
Hyperbolic Embeddings (R&D): hierarchical spaces for topics/skills; pilot behind a flag.
GPU-Native Indexing: accelerate ingestion, re-index, and online updates.
Spatial UI (3D): practical, collaborative navigation for memory curation and debugging.

Sources (verified links)

LlamaIndex: docs.llamaindex.ai — AgentWorkflow migration guide and PropertyGraphIndex
LangGraph: langchain-ai.github.io/langgraph — Context API and deployment docs
LangSmith: docs.smith.langchain.com — observability and evaluation platform
Qdrant: qdrant.tech/documentation — vector search engine and performance benchmarks
Weaviate: weaviate.io/developers/weaviate — vector database and hybrid search
Pinecone: docs.pinecone.io — serverless vector database and sparse-dense hybrid
Milvus: milvus.io/docs — distributed vector database and CAGRA index
Microsoft GraphRAG: microsoft.github.io/graphrag — knowledge graph RAG framework
Anthropic Claude: docs.anthropic.com — model cards and context window capabilities
Voyage AI: docs.voyageai.com — embedding models and retrieval performance

Landscape of RAG Solutions for LLM Applications ​

Why this doc exists ​

AGI-memory essentials (August 2025) ​

What to use now (short picks) ​

Orchestration / Memory framework ​

Vector/Graph storage ​

Models & embeddings ​

Production patterns that work ​

Minimal checklist (map to picks) ​

What we intentionally ignore (for core AGI memory) ​

Our bets for Mnemoverse ​

Sources (verified links) ​